Thread: backslash-dot quoting in COPY CSV
Hi, The doc on COPY CSV says about the backslash-dot sequence: To avoid any misinterpretation, a \. data value appearing as a lone entry on a line is automatically quoted on output, and on input, if quoted, is not interpreted as the end-of-data marker However this quoting does not happen when \. is already part of a quoted field. Example: COPY (select 'somevalue', E'foo\n\\.\nbar') TO STDOUT CSV; outputs: somevalue,"foo \. bar" which conforms to the CSV rules, by which we are not allowed to replace \. by anything AFAICS. The trouble is, when trying to import this back with COPY FROM, it will error out at the backslash-dot and not import anything. Furthermore, if these data are meant to be embedded into a script, it creates a potential risk of SQL injection. It is a known issue? I haven't found previous discussions on this. It looks to me like the ability of backslash-dot to be an end-of-data marker should be neutralizable for CSV. When the data is not embedded, it's not needed anyway, and when it's embedded, we could surely think of alternatives. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
On Wed, Jan 2, 2019 at 04:58:35PM +0100, Daniel Verite wrote: > Hi, > > The doc on COPY CSV says about the backslash-dot sequence: > > To avoid any misinterpretation, a \. data value appearing as a > lone entry on a line is automatically quoted on output, and on > input, if quoted, is not interpreted as the end-of-data marker > > However this quoting does not happen when \. is already part > of a quoted field. Example: > > COPY (select 'somevalue', E'foo\n\\.\nbar') TO STDOUT CSV; > > outputs: > > somevalue,"foo > \. > bar" > > which conforms to the CSV rules, by which we are not allowed > to replace \. by anything AFAICS. > The trouble is, when trying to import this back with COPY FROM, > it will error out at the backslash-dot and not import anything. > Furthermore, if these data are meant to be embedded into a > script, it creates a potential risk of SQL injection. > > It is a known issue? I haven't found previous discussions on this. > It looks to me like the ability of backslash-dot to be an end-of-data > marker should be neutralizable for CSV. When the data is not embedded, > it's not needed anyway, and when it's embedded, we could surely think > of alternatives. I was unable to reproduce the failure here using files: CREATE TABLE test (x TEXT); INSERT INTO test VALUES (E'foo\n\\.\nbar'); COPY test TO STDOUT CSV; "foo \. bar" COPY test TO '/u/postgres/tmp/x' CSV; COPY test FROM '/u/postgres/tmp/x' CSV; SELECT * FROM test; x ----- foo+ \. + bar foo+ \. + bar but I am able to see the failure using STDIN: COPY test FROM STDIN CSV; Enter data to be copied followed by a newline. End with a backslash and a period on a line by itself, or an EOF signal. "foo \. ERROR: unterminated CSV quoted field CONTEXT: COPY test, line 1: ""foo This seems like a bug to me. Looking at the code, psql issues the prompts for STDIN, but when it sees \. alone on a line, it has no idea you are in a quoted CSV string, so it thinks the copy is done and sends the result to the server. I can't see an easy way to fix this. I guess we could document it. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Bruce Momjian wrote: > but I am able to see the failure using STDIN: > > COPY test FROM STDIN CSV; > Enter data to be copied followed by a newline. > End with a backslash and a period on a line by itself, or an EOF > signal. > "foo > \. > ERROR: unterminated CSV quoted field > CONTEXT: COPY test, line 1: ""foo > > This seems like a bug to me. Looking at the code, psql issues the > prompts for STDIN, but when it sees \. alone on a line, it has no idea > you are in a quoted CSV string, so it thinks the copy is done and sends > the result to the server. I can't see an easy way to fix this. I guess > we could document it. Thanks for looking into this. \copy from file with csv is also affected since it uses COPY FROM STDIN behind the scene. The case of embedded data looks more worrying because psql will execute the data following \. as if they were SQL statements. ISTM that only ON_ERROR_STOP=on prevents the risk of SQL injection in that scenario, but it's off by default. What about this idea: when psql is feeding COPY data from its command stream and an error occurs, it should act as if ON_ERROR_STOP was "on" even if it's not. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
On Thu, Jan 24, 2019 at 10:09:30PM -0500, Bruce Momjian wrote: > This seems like a bug to me. Looking at the code, psql issues the > prompts for STDIN, but when it sees \. alone on a line, it has no idea > you are in a quoted CSV string, so it thinks the copy is done and sends > the result to the server. I can't see an easy way to fix this. I guess > we could document it. In src/bin/psql/copy.c, handleCopyIn(): /* * This code erroneously assumes '\.' on a line alone * inside a quoted CSV string terminates the \copy. * http://www.postgresql.org/message-id/E1TdNVQ-0001ju-GO@wrigleys.postgresql.org */ if (strcmp(buf, "\\.\n") == 0 || strcmp(buf, "\\.\r\n") == 0) { copydone = true; break; } This story pops up from time to time.. -- Michael
Attachment
Michael Paquier wrote: > In src/bin/psql/copy.c, handleCopyIn(): > > /* > * This code erroneously assumes '\.' on a line alone > * inside a quoted CSV string terminates the \copy. > * > http://www.postgresql.org/message-id/E1TdNVQ-0001ju-GO@wrigleys.postgresql.org > */ > if (strcmp(buf, "\\.\n") == 0 || > strcmp(buf, "\\.\r\n") == 0) > { > copydone = true; > break; > } Indeed, it's exactly that problem. And there's the related problem that it derails the input stream in a way that lines of data become commands, but that one is not specific to that particular error. For the backslash-dot in a quoted string, the root cause is that psql is not aware that the contents are CSV so it can't parse them properly. I can think of several ways of working around that, more or less inelegant: - the end of data could be expressed as a length (in number of lines for instance) instead of an in-data marker. - the end of data could be configurable, as in the MIME structure of multipart mail messages, where a part is ended by a "boundary", line, generally a long randomly generated string. This boundary would have to be known to psql through setting a dedicated variable or command. - COPY as the SQL command could have the boundary option for data fed through its STDIN. This could neutralize the special role of backslash-dot in general, not just in quoted fields, since the necessity to quote backslash-dot is a wart anyway. - psql could be told somehow that the next piece of inline data is in the CSV format, and then pass it through a CSV parser. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
On Sun, Jan 27, 2019 at 10:10:36PM +0900, Michael Paquier wrote: > On Thu, Jan 24, 2019 at 10:09:30PM -0500, Bruce Momjian wrote: > > This seems like a bug to me. Looking at the code, psql issues the > > prompts for STDIN, but when it sees \. alone on a line, it has no idea > > you are in a quoted CSV string, so it thinks the copy is done and sends > > the result to the server. I can't see an easy way to fix this. I guess > > we could document it. > > In src/bin/psql/copy.c, handleCopyIn(): > > /* > * This code erroneously assumes '\.' on a line alone > * inside a quoted CSV string terminates the \copy. > * http://www.postgresql.org/message-id/E1TdNVQ-0001ju-GO@wrigleys.postgresql.org > */ > if (strcmp(buf, "\\.\n") == 0 || > strcmp(buf, "\\.\r\n") == 0) > { > copydone = true; > break; > } > > This story pops up from time to time.. The killer is I committed this C comment six years ago, and didn't remember it. :-O commit 361b94c4b98b85b19b850cff37be76d1f6d4f8f7 Author: Bruce Momjian <bruce@momjian.us> Date: Thu Jul 4 13:09:52 2013 -0400 Add C comment about \copy bug in CSV mode Comment: This code erroneously assumes '\.' on a line alone inside a quoted CSV string terminates the \copy. http://www.postgresql.org/message-id/E1TdNVQ-0001ju-GO@wrigleys.postgresql.org Glad I mentioned the URL, at least. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Fri, Jan 25, 2019 at 01:01:22PM +0100, Daniel Verite wrote: > Bruce Momjian wrote: > > > but I am able to see the failure using STDIN: > > > > COPY test FROM STDIN CSV; > > Enter data to be copied followed by a newline. > > End with a backslash and a period on a line by itself, or an EOF > > signal. > > "foo > > \. > > ERROR: unterminated CSV quoted field > > CONTEXT: COPY test, line 1: ""foo > > > > This seems like a bug to me. Looking at the code, psql issues the > > prompts for STDIN, but when it sees \. alone on a line, it has no idea > > you are in a quoted CSV string, so it thinks the copy is done and sends > > the result to the server. I can't see an easy way to fix this. I guess > > we could document it. > > Thanks for looking into this. > > \copy from file with csv is also affected since it uses COPY FROM > STDIN behind the scene. The case of embedded data looks more worrying > because psql will execute the data following \. as if they were > SQL statements. > > ISTM that only ON_ERROR_STOP=on prevents the risk of SQL injection > in that scenario, but it's off by default. You are correct that someone having data that is SQL commands would be able to perhaps execute those commands on restore. pg_dump doesn't use CSV, and this only affects STDIN, not files or PROGRAM input. I think the question is how many people are using CSV/STDIN for insecure data loads? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Mon, Jan 28, 2019 at 04:06:17PM +0100, Daniel Verite wrote: > Michael Paquier wrote: > > > In src/bin/psql/copy.c, handleCopyIn(): > > > > /* > > * This code erroneously assumes '\.' on a line alone > > * inside a quoted CSV string terminates the \copy. > > * > > http://www.postgresql.org/message-id/E1TdNVQ-0001ju-GO@wrigleys.postgresql.org > > */ > > if (strcmp(buf, "\\.\n") == 0 || > > strcmp(buf, "\\.\r\n") == 0) > > { > > copydone = true; > > break; > > } > > Indeed, it's exactly that problem. > And there's the related problem that it derails the input stream > in a way that lines of data become commands, but that one is > not specific to that particular error. > > For the backslash-dot in a quoted string, the root cause is > that psql is not aware that the contents are CSV so it can't > parse them properly. > I can think of several ways of working around that, more or less > inelegant: > > - the end of data could be expressed as a length (in number of lines > for instance) instead of an in-data marker. > > - the end of data could be configurable, as in the MIME structure of > multipart mail messages, where a part is ended by a "boundary", > line, generally a long randomly generated string. This boundary > would have to be known to psql through setting a dedicated > variable or command. > > - COPY as the SQL command could have the boundary option > for data fed through its STDIN. This could neutralize the > special role of backslash-dot in general, not just in quoted fields, > since the necessity to quote backslash-dot is a wart anyway. Well, these all kind of require a change to the COPY format, which hasn't changed in many years. > - psql could be told somehow that the next piece of inline data is in > the CSV format, and then pass it through a CSV parser. That might be the cleanest solution, but how would we actually input multi-line data in CSV mode with \. alone on a line? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Bruce Momjian wrote: > > - the end of data could be expressed as a length (in number of lines > > for instance) instead of an in-data marker. > > > > - the end of data could be configurable, as in the MIME structure of > > multipart mail messages, where a part is ended by a "boundary", > > line, generally a long randomly generated string. This boundary > > would have to be known to psql through setting a dedicated > > variable or command. > > > > - COPY as the SQL command could have the boundary option > > for data fed through its STDIN. This could neutralize the > > special role of backslash-dot in general, not just in quoted fields, > > since the necessity to quote backslash-dot is a wart anyway. > > Well, these all kind of require a change to the COPY format, which > hasn't changed in many years. Not for the first two. As an example of solution #2, it could look like this: =# \set INLINE_COPY_BOUNDARY ==JuQW3gc2mQjXuvmJ32TlOLhJ3F2Eh2LcsBup0oH7== =# COPY table FROM STDIN CSV; somevalue,"foo \. bar" ==JuQW3gc2mQjXuvmJ32TlOLhJ3F2Eh2LcsBup0oH7== Instead of looking for \. on a line by itself, psql would look for the boundary to know where the data ends. The boundary is not transmitted to the server, it has no need to know about it. > > - psql could be told somehow that the next piece of inline data is in > > the CSV format, and then pass it through a CSV parser. > > That might be the cleanest solution, but how would we actually input > multi-line data in CSV mode with \. alone on a line? With this solution, the content doesn't change at all. The weird part would be the user interface, because the information psql needs is not only "CSV", it's also the options DELIMITER, QUOTE, ESCAPE and possibly ENCODING. Currently it doesn't know any of these, they're passed to the server in an opaque, unparsed form within the COPY command. Personally, the solution I find cleaner is the server side not having any end-of-data marker for CSV. So backslash-dot would never be special. psql could allow for a custom ending boundary for in-script data, and users could set that to backslash-dot if they want, but that would be their choice. That would be clearly not backward compatible, and I believe it wouldn't work with the v2 protocol, so I'm not sure it would have much chance of approval. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
On Wed, Jan 30, 2019 at 06:32:11PM +0100, Daniel Verite wrote: > Bruce Momjian wrote: > > Well, these all kind of require a change to the COPY format, which > > hasn't changed in many years. > > Not for the first two. As an example of solution #2, it could look like this: > > =# \set INLINE_COPY_BOUNDARY ==JuQW3gc2mQjXuvmJ32TlOLhJ3F2Eh2LcsBup0oH7== > =# COPY table FROM STDIN CSV; > somevalue,"foo > \. > bar" > ==JuQW3gc2mQjXuvmJ32TlOLhJ3F2Eh2LcsBup0oH7== > > Instead of looking for \. on a line by itself, psql would look for the > boundary to know where the data ends. > The boundary is not transmitted to the server, it has no need to know > about it. Wow, that is an odd API, as you stated below. > > > - psql could be told somehow that the next piece of inline data is in > > > the CSV format, and then pass it through a CSV parser. > > > > That might be the cleanest solution, but how would we actually input > > multi-line data in CSV mode with \. alone on a line? > > With this solution, the content doesn't change at all. > The weird part would be the user interface, because the information > psql needs is not only "CSV", it's also the options DELIMITER, QUOTE, > ESCAPE and possibly ENCODING. Currently it doesn't know any of these, > they're passed to the server in an opaque, unparsed form within > the COPY command. > > Personally, the solution I find cleaner is the server side not having > any end-of-data marker for CSV. So backslash-dot would never be > special. psql could allow for a custom ending boundary for in-script > data, and users could set that to backslash-dot if they want, but that > would be their choice. > That would be clearly not backward compatible, and I believe it wouldn't > work with the v2 protocol, so I'm not sure it would have much chance of > approval. I had forgotten that the DELIMITER and QUOTE can be changed --- that kills the idea of adding a simple CSV parser into psql because we would have to parse the COPY SQL command as well. I am wondering if we should just disallow CSV from STDIN, on security grounds. How big a problem would that be for people? Would we have to disable to STDOUT as well since it could not be restored? Should we issue some kind of security warning in such cases? Should we document this? In hindsight, I am not sure how we could have designed this more securly. I guess we could have required some special text to start all CSV continuation lines that were not end-of-file, but that would have been very unportable, which is the goal of CSV. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
st 30. 1. 2019 18:51 odesílatel Bruce Momjian <bruce@momjian.us> napsal:
On Wed, Jan 30, 2019 at 06:32:11PM +0100, Daniel Verite wrote:
> Bruce Momjian wrote:
> > Well, these all kind of require a change to the COPY format, which
> > hasn't changed in many years.
>
> Not for the first two. As an example of solution #2, it could look like this:
>
> =# \set INLINE_COPY_BOUNDARY ==JuQW3gc2mQjXuvmJ32TlOLhJ3F2Eh2LcsBup0oH7==
> =# COPY table FROM STDIN CSV;
> somevalue,"foo
> \.
> bar"
> ==JuQW3gc2mQjXuvmJ32TlOLhJ3F2Eh2LcsBup0oH7==
>
> Instead of looking for \. on a line by itself, psql would look for the
> boundary to know where the data ends.
> The boundary is not transmitted to the server, it has no need to know
> about it.
Wow, that is an odd API, as you stated below.
> > > - psql could be told somehow that the next piece of inline data is in
> > > the CSV format, and then pass it through a CSV parser.
> >
> > That might be the cleanest solution, but how would we actually input
> > multi-line data in CSV mode with \. alone on a line?
>
> With this solution, the content doesn't change at all.
> The weird part would be the user interface, because the information
> psql needs is not only "CSV", it's also the options DELIMITER, QUOTE,
> ESCAPE and possibly ENCODING. Currently it doesn't know any of these,
> they're passed to the server in an opaque, unparsed form within
> the COPY command.
>
> Personally, the solution I find cleaner is the server side not having
> any end-of-data marker for CSV. So backslash-dot would never be
> special. psql could allow for a custom ending boundary for in-script
> data, and users could set that to backslash-dot if they want, but that
> would be their choice.
> That would be clearly not backward compatible, and I believe it wouldn't
> work with the v2 protocol, so I'm not sure it would have much chance of
> approval.
I had forgotten that the DELIMITER and QUOTE can be changed --- that
kills the idea of adding a simple CSV parser into psql because we would
have to parse the COPY SQL command as well.
I am wondering if we should just disallow CSV from STDIN, on security
grounds. How big a problem would that be for people? Would we have to
disable to STDOUT as well since it could not be restored? Should we
issue some kind of security warning in such cases? Should we document
this?
it is pretty common pattern for etl, copy from stdin. I am thinking it can be big problem
In hindsight, I am not sure how we could have designed this more
securly. I guess we could have required some special text to start all
CSV continuation lines that were not end-of-file, but that would have
been very unportable, which is the goal of CSV.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
Pavel Stehule <pavel.stehule@gmail.com> writes: > st 30. 1. 2019 18:51 odesílatel Bruce Momjian <bruce@momjian.us> napsal: >> I am wondering if we should just disallow CSV from STDIN, on security >> grounds. How big a problem would that be for people? Would we have to >> disable to STDOUT as well since it could not be restored? Should we >> issue some kind of security warning in such cases? Should we document >> this? > it is pretty common pattern for etl, copy from stdin. I am thinking it can > be big problem Given how long we've had COPY CSV support, and the tiny number of complaints to date, I do not think it's something to panic over. Disallowing the functionality altogether is surely an overreaction. I don't really see an argument for calling it a security problem, given that pg_dump doesn't use CSV and it isn't the default for anything else either. Sure, you can imagine some bad actor hoping to cause problems by putting crafted data into a table, but how does that data end up in a script that's using COPY CSV FROM STDIN (as opposed to copying out-of-line data)? It's a bit far-fetched. A documentation warning might be the appropriate response. I don't see any plausible way for psql to actually fix the problem, short of a protocol change to allow the backend to tell it how the data stream is going to be parsed. regards, tom lane
On Wed, Jan 30, 2019 at 01:20:59PM -0500, Tom Lane wrote: > Given how long we've had COPY CSV support, and the tiny number of > complaints to date, I do not think it's something to panic over. > Disallowing the functionality altogether is surely an overreaction. > > A documentation warning might be the appropriate response. I don't > see any plausible way for psql to actually fix the problem, short > of a protocol change to allow the backend to tell it how the data > stream is going to be parsed. Yes, agreed. I looked at this problem a couple of months (year(s)?) ago and gave up on designing a clear portable solution after a couple of hours over it, and that's quite a corner case. -- Michael