Thread: Re: [BUG] psql: Make \copy from 'text' and 'csv' formats fail on NUL bytes

Re: [BUG] psql: Make \copy from 'text' and 'csv' formats fail on NUL bytes

From
Tom Lane
Date:
"Joel Jacobson" <joel@compiler.org> writes:
> Fix by adjusting handleCopyIn() to use the binary code path also when the copy
> source is a file (i.e., copystream != pset.cur_cmd_source), even in textual
> copies.

That seems like a hack, as it also changes the behavior w.r.t.
prompts and EOF-mark detection, neither for the better.

I'm not really convinced that we need to do anything at all
about this.

            regards, tom lane



Re: [BUG] psql: Make \copy from 'text' and 'csv' formats fail on NUL bytes

From
"Joel Jacobson"
Date:
On Sun, Nov 10, 2024, at 22:37, Tom Lane wrote:
> "Joel Jacobson" <joel@compiler.org> writes:
>> Fix by adjusting handleCopyIn() to use the binary code path also when the copy
>> source is a file (i.e., copystream != pset.cur_cmd_source), even in textual
>> copies.
>
> That seems like a hack, as it also changes the behavior w.r.t.
> prompts and EOF-mark detection, neither for the better.

Hmm, in what way does it change the behavior?
I ran almost all the tests, including the TAP ones, and none of them failed with
this fix. Is there some behavior that is currently not covered by the test suite?

> I'm not really convinced that we need to do anything at all
> about this.

I think this should be fixed, because 0x00 is valid UTF-8 for the Unicode
Character 'NULL' (U+0000), which makes me think there is a risk CSV files
coming from other systems than PostgreSQL, could contain such text strings
which would then silently cause data corruption.

/Joel



Re: [BUG] psql: Make \copy from 'text' and 'csv' formats fail on NUL bytes

From
Tom Lane
Date:
"Joel Jacobson" <joel@compiler.org> writes:
> On Sun, Nov 10, 2024, at 22:37, Tom Lane wrote:
>> That seems like a hack, as it also changes the behavior w.r.t.
>> prompts and EOF-mark detection, neither for the better.

> Hmm, in what way does it change the behavior?
> I ran almost all the tests, including the TAP ones, and none of them failed with
> this fix. Is there some behavior that is currently not covered by the test suite?

Of course.  (Our regression tests are very far from covering 100% of
the behavioral space.)  In the case at hand, this would break the
expected line-by-line prompting for "\copy ... from '/dev/tty'".

Playing with this just now, I notice that the prompt you get
still claims that \. works to end input, although right now
it does not:

regression=# \copy int8_tbl from '/dev/tty'
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> 1    3
>> 5    6
>> \.
>> ^D
COPY 2

I'm inclined to think that the prompt still describes what should
happen, at least in non-binary mode, although in binary mode we
probably ought to just say (and do) "End with an EOF signal".

So perhaps the if-test to choose the code path could be

    if ((isbinary || copystream != pset.cur_cmd_source) && !showprompt)

which would allow dropping the vestigial prompt logic in the first
path, and we would also need to change the test in the second path
that decides if we should check for \.  (Likely this should be
refactored a bit to make it more understandable.  An intermediate
flag saying whether we intend to check for \. might help.)

Anyway, my point is that we need to think through the desirable
behavior for each possible combination of showprompt, isbinary, and
copystream != pset.cur_cmd_source, because all 8 cases are reachable.

            regards, tom lane



Re: [BUG] psql: Make \copy from 'text' and 'csv' formats fail on NUL bytes

From
"Joel Jacobson"
Date:
On Sun, Nov 10, 2024, at 23:14, Tom Lane wrote:
> "Joel Jacobson" <joel@compiler.org> writes:
>> On Sun, Nov 10, 2024, at 22:37, Tom Lane wrote:
>>> That seems like a hack, as it also changes the behavior w.r.t.
>>> prompts and EOF-mark detection, neither for the better.
>
>> Hmm, in what way does it change the behavior?
>> I ran almost all the tests, including the TAP ones, and none of them failed with
>> this fix. Is there some behavior that is currently not covered by the test suite?
>
> Of course.  (Our regression tests are very far from covering 100% of
> the behavioral space.)
>
> In the case at hand, this would break the
> expected line-by-line prompting for "\copy ... from '/dev/tty'".

Yes, of course, just wondered what kind of behavior in this area
that wasn't tested, thanks for explaining it.

> Playing with this just now, I notice that the prompt you get
> still claims that \. works to end input, although right now
> it does not:
>
> regression=# \copy int8_tbl from '/dev/tty'
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself, or an EOF signal.
>>> 1    3
>>> 5    6
>>> \.
>>> ^D
> COPY 2
>
> I'm inclined to think that the prompt still describes what should
> happen, at least in non-binary mode, although in binary mode we
> probably ought to just say (and do) "End with an EOF signal".
>
> So perhaps the if-test to choose the code path could be
>
>     if ((isbinary || copystream != pset.cur_cmd_source) && !showprompt)

Thanks for guidance, that seems to fix, for 6/8 cases I've figured out how to test.

> which would allow dropping the vestigial prompt logic in the first
> path,

I assume you mean that due to "&& !showprompt" the "if (showprompt)"
becomes unreachable and can therefore be dropped?

> and we would also need to change the test in the second path
> that decides if we should check for \.  (Likely this should be
> refactored a bit to make it more understandable.  An intermediate
> flag saying whether we intend to check for \. might help.)

Maybe check_dot_command?

    const bool    check_dot_command = (copystream == pset.cur_cmd_source);

I haven't tried yet to refactor the code,
except than replacing the two "copystream == pset.cur_cmd_source"
occurrences with the new check_dot_command flag.

First want to understand if the two remaining cases are valid,
and if they can be tested:

> Anyway, my point is that we need to think through the desirable
> behavior for each possible combination of showprompt, isbinary, and
> copystream != pset.cur_cmd_source, because all 8 cases are reachable.

I guess these are the 8 cases?

+--------+-------------+----------+------------------+
| CASE   | showprompt  | isbinary | check_dot_command |
+--------+-------------+----------+------------------+
|   1    |    false    |   false  |      false      |
|   2    |    false    |   false  |      true       |
|   3    |    false    |   true   |      false      |
|   4    |    false    |   true   |      true       |
|   5    |    true     |   false  |      false      |
|   6    |    true     |   false  |      true       |
|   7*   |    true     |   true   |      false      |
|   8*   |    true     |   true   |      true       |
+--------+-------------+----------+------------------+
* Cases 7 and 8 not tested yet

With the changed if-test, case 1-6 works,
and for case 1, then binary mode branch is taken
instead of the text mode branch,
whereas cases 2-6 take the same branch as before.

joel@Joels-MBP psql_tester % git diff --no-index -U100 /tmp/psql.log.HEAD /tmp/psql.log
diff --git a/tmp/psql.log.HEAD b/tmp/psql.log
index 5e44e30..1f48ac9 100644
--- a/tmp/psql.log.HEAD
+++ b/tmp/psql.log
@@ -1,6 +1,6 @@
-COPY case: 1 TEXT MODE
+COPY case: 1 BINARY MODE
 COPY case: 2 TEXT MODE
 COPY case: 3 BINARY MODE
 COPY case: 4 BINARY MODE
 COPY case: 5 TEXT MODE
 COPY case: 6 TEXT MODE

Here is how I tested each case:

# CASE 1:
# showprompt:        false
# isbinary:          false
# check_dot_command: false
psql -c "\copy int8_tbl from '/tmp/int8_tbl.data'"

# CASE 2:
# showprompt:        false
# isbinary:          false
# check_dot_command: true
psql -f /tmp/copy_stdin_text.sql

# CASE 3:
# showprompt:        false
# isbinary:          true
# check_dot_command: false
psql -c "\copy int8_tbl from '/tmp/int8_tbl.bin' (format binary)"

# CASE 4:
# showprompt:        false
# isbinary:          true
# check_dot_command: true
printf '\\copy int8_tbl from stdin (format binary)\n' >/tmp/copy_stdin_binary.sql
cat /tmp/int8_tbl.bin >>/tmp/copy_stdin_binary.sql
psql -f copy_stdin_binary.sql

# CASE 5:
# showprompt:        true
# isbinary:          false
# check_dot_command: false

psql
\copy int8_tbl from '/dev/tty'
17    18
19    20
\.
# Send EOF (Ctrl+D)

# CASE 6:
# showprompt:        true
# isbinary:          false
# check_dot_command: true

psql
\copy int8_tbl from stdin
21    22
23    24
\.

# CASE 7:
# showprompt:        true
# isbinary:          true
# check_dot_command: false

CASE 7 would be like CASE 5, with the addition of (format binary)
that is:

    \copy int8_tbl from '/dev/tty' (format binary)

but I wonder if this is a valid case? Could we really copy/paste
or by some other means give psql binary data here?
I tried to wrap psql and send the binary content of
my '/tmp/int8_tbl.bin' file, and tried sending the EOF control code,
and I see I get PGCOPY-message, but the txn isn't committed,
so not sure what's happening. Before I continue trying
to figure this one out, I just wanted to make sure this is
a valid case, and how it is supposed to be used if so?

# CASE 8:
# showprompt:        true
# isbinary:          true
# check_dot_command: true

CASE 8 would be like CASE 6, with the addition of (format binary)
that is:

    \copy int8_tbl from stdin (format binary)

I've had the same problem with this one, as with CASE 7.

Many thanks for guidance.

/Joel