Idea for making COPY data Microsoft-proof - Mailing list pgsql-hackers

From Tom Lane
Subject Idea for making COPY data Microsoft-proof
Date
Msg-id 25767.1013394173@sss.pgh.pa.us
Whole thread Raw
Responses Re: Idea for making COPY data Microsoft-proof  (Philip Warner <pjw@rhyme.com.au>)
List pgsql-hackers
Okay, I think I've seen one too many reports of corrupted pg_dump data
caused by Microslothish newline-to-CR/LF translation.  We need to find
a way to make COPY data proof against that brain damage.

The best idea that comes to mind is as follows:

1. During COPY OUT, convert carriage returns and newlines in
data being copied out to the two-character sequences \ r and \ n.
(This will already work properly with the existing COPY IN logic,
since it recognizes those escape sequences.)

2. During COPY IN, discard \r if it appears just before \n.  (This
must occur before de-backslashing, so that the sequence \ \r \n
will be interpreted as a quoted newline.  That way, an existing
dump file that represents a data newline as \ \n will be correctly
read even if it's been Microsoft-munged to \ \r \n.)

The second part of this would have a small backwards-compatibility
problem: when reading an old dump file (one made by COPY OUT before
the above change) it would be possible for it to discard a \r that
is legitimately part of the data.  That would only happen if the \r
was the last real data character on the line.  Notice that if the
\r precedes a data \n character, the \n will be represented as \ \n
by existing COPY OUT, or by \ n by COPY OUT with the above change;
either way, the \r is followed by a backslash and will not be dropped.
So this change to COPY IN would not cause it to fail with old data
values containing \r\n sequences, only with those ending with a bare \r.

We could provide a workaround for that case by offering a SET variable
that enables or disables discarding of \r, though I'd want it to default
to ON.

If this seems like a reasonable approach, then I'd like to apply the
COPY OUT part of the change immediately (ie, for 7.2.1).  Converting \r
and \n to \ r and \ n will not hurt anyone in either forward or backward
direction, and if we do that then dumps made with 7.2.1 or later will
work correctly with 7.3 regardless of how one sets the discard-\r
variable.

A stronger change would make COPY IN regard \n, \r \n, or \r as
equivalent representations of newline.  If we do this then we'll also
prevent Unix-to-Mac newline conversions from breaking COPY data (Mac
represents newlines by just \r).  However, in this approach any
old-style data \r would break, not only those at end of data line.
So the SET variable to revert to the old COPY IN behavior would be
needed in many more cases.

Comments?  Anyone see a better way?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Brent Verner
Date:
Subject: Re: RTLD_LAZY considered harmful (Re: pltlc and pltlcu
Next
From: Tom Lane
Date:
Subject: Re: RTLD_LAZY considered harmful (Re: pltlc and pltlcu