Re: Fixing backslash dot for COPY FROM...CSV - Mailing list pgsql-hackers

From Daniel Verite
Subject Re: Fixing backslash dot for COPY FROM...CSV
Date
Msg-id 1fba50b1-604c-44f9-b6a6-a3a81e8d0bb8@manitou-mail.org
Whole thread Raw
In response to Re: Fixing backslash dot for COPY FROM...CSV  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Fixing backslash dot for COPY FROM...CSV
List pgsql-hackers
    Tom Lane wrote:

> This is sufficiently weird that I'm starting to come around to
> Daniel's original proposal that we just drop the server's recognition
> of \. altogether (which would allow removal of some dozens of lines of
> complicated and now known-buggy code)

FWIW my plan was to not change anything in the TEXT mode,
but I wasn't aware it had this issue that you found when
\. is not in a line by itself.

>  Alternatively, we could fix it so that \. at the end of a line draws
> "end-of-copy marker corrupt"
> which would at least make things consistent, but I'm not sure that has
> any great advantage.  I surely don't want to document the current
> behavioral details as being the right thing that we're going to keep
> doing.

Agreed we don't want to document that, but also why doesn't \. in the
contents represent just a dot  (as opposed to being an error),
just like \a is a?

I mean if eofdata contains

  foobar\a
  foobaz\aother

then we get after import:
      f1
--------------
 foobara
 foobazaother
(2 rows)

Reading the current doc on the text format, I can't see why
importing:

  foobar\.
  foobar\.other

is not supposed to produce
      f1
--------------
 foobar.
 foobaz.other
(2 rows)


I see these rules in [1] about backslash:

#1.
  "End of data can be represented by a single line containing just
   backslash-period (\.)."

foobar\. and foobar\.other do not match that so #1 does not describe
how they're interpreted.

#2.
  "Backslash characters (\) can be used in the COPY data to quote data
  characters that might otherwise be taken as row or column
  delimiters."

Dot is not a column delimiter (it's forbidden anyway), so #2 does
not apply.

#3.
  "In particular, the following characters must be preceded by a
  backslash if they appear as part of a column value: backslash itself,
  newline, carriage return, and the current delimiter character"

Dot is not in that list so #3 does not apply.

#4.
  "The following special backslash sequences are recognized by COPY
  FROM:" (followed by the table with \b \f, ...,)

Dot is not mentioned.

#5.
  "Any other backslashed character that is not mentioned in the above
  table will be taken to represent itself"

Here we say that backslash dot represents a dot (unless other
rules apply)

  foobar\. => foobar.
  foobar\.other => foobar.other

#6.
  "However, beware of adding backslashes unnecessarily, since that
   might accidentally produce a string matching the end-of-data marker
   (\.) or the null string (\N by default)."

So we *recommend* not to use \. but as I understand it, the warning
with the EOD marker is about accidentally creating a line that matches #1,
that is, \. alone on a line.

#7
  "These strings will be recognized before any other backslash
  processing is done."

TBH I don't understand what #7 implies. The order in backslash
processing looks like an implementation detail that should not
matter in understanding the format?


Considering this, it seems to me that #5 says that
backslash-dot represents a dot unless #1 applies, and the
other #2 #3 #4 #6 #7 rules do not state anything that would
contradict that.


[1] https://www.postgresql.org/docs/current/sql-copy.html


Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite



pgsql-hackers by date:

Previous
From: Sergey Prokhorenko
Date:
Subject: Re: UUID v7
Next
From: Jelte Fennema-Nio
Date:
Subject: Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs