Thread: Idea for making COPY data Microsoft-proof

Idea for making COPY data Microsoft-proof

From
Tom Lane
Date:
Okay, I think I've seen one too many reports of corrupted pg_dump data
caused by Microslothish newline-to-CR/LF translation.  We need to find
a way to make COPY data proof against that brain damage.

The best idea that comes to mind is as follows:

1. During COPY OUT, convert carriage returns and newlines in
data being copied out to the two-character sequences \ r and \ n.
(This will already work properly with the existing COPY IN logic,
since it recognizes those escape sequences.)

2. During COPY IN, discard \r if it appears just before \n.  (This
must occur before de-backslashing, so that the sequence \ \r \n
will be interpreted as a quoted newline.  That way, an existing
dump file that represents a data newline as \ \n will be correctly
read even if it's been Microsoft-munged to \ \r \n.)

The second part of this would have a small backwards-compatibility
problem: when reading an old dump file (one made by COPY OUT before
the above change) it would be possible for it to discard a \r that
is legitimately part of the data.  That would only happen if the \r
was the last real data character on the line.  Notice that if the
\r precedes a data \n character, the \n will be represented as \ \n
by existing COPY OUT, or by \ n by COPY OUT with the above change;
either way, the \r is followed by a backslash and will not be dropped.
So this change to COPY IN would not cause it to fail with old data
values containing \r\n sequences, only with those ending with a bare \r.

We could provide a workaround for that case by offering a SET variable
that enables or disables discarding of \r, though I'd want it to default
to ON.

If this seems like a reasonable approach, then I'd like to apply the
COPY OUT part of the change immediately (ie, for 7.2.1).  Converting \r
and \n to \ r and \ n will not hurt anyone in either forward or backward
direction, and if we do that then dumps made with 7.2.1 or later will
work correctly with 7.3 regardless of how one sets the discard-\r
variable.

A stronger change would make COPY IN regard \n, \r \n, or \r as
equivalent representations of newline.  If we do this then we'll also
prevent Unix-to-Mac newline conversions from breaking COPY data (Mac
represents newlines by just \r).  However, in this approach any
old-style data \r would break, not only those at end of data line.
So the SET variable to revert to the old COPY IN behavior would be
needed in many more cases.

Comments?  Anyone see a better way?
        regards, tom lane


Re: Idea for making COPY data Microsoft-proof

From
Philip Warner
Date:
At 09:22 PM 2/10/02 -0500, Tom Lane wrote:
>
>Comments?  Anyone see a better way?
>

Can you do something akin to what you did with the binary output - but in
this case allow for no details. This would be *great* for column order
problems. eg,
   COPY <table> TO STDOUT WITH HEADER;

or somesuch. This would result in two extra lines at the start:
  <Keywords indocating assumptions and translations (eg. CRLF_TO_LF) etc.  {<col name>...}  ...NORMAL DATA HERE...

Similary,
   COPY <table> FROM STDIN WITH HEADER;

Would use the header to work out how to translate CRLF etc, as well as
allow COPY to be used to load different table definitions.

How does this sound?


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: Idea for making COPY data Microsoft-proof

From
Tom Lane
Date:
Philip Warner <pjw@rhyme.com.au> writes:
> Can you do something akin to what you did with the binary output - but in
> this case allow for no details.

This strikes me as solving an entirely different issue -- with great
loss of backwards compatibility.  Possibly these are good ideas, but for
the moment I'd like to keep this thread focused on the issue of coping
with newline translations.

(In any case, I thought someone was already working on an optional
column-name-list clause for COPY, which would solve that problem in what
seems a cleaner fashion.)
        regards, tom lane


Re: Idea for making COPY data Microsoft-proof

From
Philip Warner
Date:
At 11:11 AM 2/11/02 -0500, Tom Lane wrote:
>Philip Warner <pjw@rhyme.com.au> writes:
>This strikes me as solving an entirely different issue 

Well, a related issue: you are talking about (slightly) changing the
encoding of dumped data - why not therefor allow for an (optional) header
with full encoding details. Column headers are just a minor bonus.


>- with great
>loss of backwards compatibility.

Not if anything without the 'WITH HEADERS' is treated as per current.
Anyone having M$ problems can use the new format (and pg_dump could use it
always).


>the moment I'd like to keep this thread focused on the issue of coping
>with newline translations.

No big deal, but this seems to be an encoding issue; and it seems like a
good idea to formalize it somehow.


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: Idea for making COPY data Microsoft-proof

From
Philip Warner
Date:
At 07:02 PM 2/11/02 -0500, Tom Lane wrote:
>To take just one problem: how do I know that
>the first line is metadata, and not data that happens to look exactly
>like whatever my metadata layout is?

You don't, which is why you need the 'WITH HEADER' or 'WITH ENCODING'
clause on COPY. I guess COPY could issue a warning when you do not say WITH
HEADER and it looks like a valid header. 

Other than that, it's a case of storing information about the dumped data,
not the database schema in the data file. I'm not particularly attached to
the column names being there, but it does seem usefull to store
instructions indicating the the file is formatted.



----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.B.N. 75 008 659 498)          |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|                                |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/


Re: Idea for making COPY data Microsoft-proof

From
Tom Lane
Date:
Philip Warner <pjw@rhyme.com.au> writes:
> No big deal, but this seems to be an encoding issue; and it seems like a
> good idea to formalize it somehow.

Well, currently there is a strict separation between COPY data (in the
file) and metadata (supplied as parameters to the COPY command).  I'm
not eager to revisit that decision.  What you seem to be suggesting is
shoving metadata into the data file, but I think that will create more
problems than it solves.  To take just one problem: how do I know that
the first line is metadata, and not data that happens to look exactly
like whatever my metadata layout is?
        regards, tom lane


Re: Idea for making COPY data Microsoft-proof

From
Brent Verner
Date:
[2002-02-11 11:11] Tom Lane said:
| Philip Warner <pjw@rhyme.com.au> writes:
| > Can you do something akin to what you did with the binary output - but in
| > this case allow for no details.
| 
| This strikes me as solving an entirely different issue -- with great
| loss of backwards compatibility.  Possibly these are good ideas, but for
| the moment I'd like to keep this thread focused on the issue of coping
| with newline translations.
| 
| (In any case, I thought someone was already working on an optional
| column-name-list clause for COPY, which would solve that problem in what
| seems a cleaner fashion.)

Yes, the work for a column list in COPY FROM is largely done.  I've
not been able to work on COPY TO, tho.

Part #1 of your original proposal is certainly the right thing to do.

I've backgrounded this problem for most of the day, and although I
know it's a severe change, your "stronger" solution seems like a
better change than the part #2, which just feels like something 
that would only be undone later.  Both #2 and the "stronger" way
will require a SET option for absolute correctness; why not require
that SETting more often than not for any old-format dumps?  Yes, it
will affect a larger number of users, but we net a better dump 
format for that pain.

cheers. brent

-- 
"Develop your talent, man, and leave the world something. Records are 
really gifts from people. To think that an artist would love you enough
to share his music with anyone is a beautiful thing."  -- Duane Allman