Thread: Idea for making COPY data Microsoft-proof
Okay, I think I've seen one too many reports of corrupted pg_dump data caused by Microslothish newline-to-CR/LF translation. We need to find a way to make COPY data proof against that brain damage. The best idea that comes to mind is as follows: 1. During COPY OUT, convert carriage returns and newlines in data being copied out to the two-character sequences \ r and \ n. (This will already work properly with the existing COPY IN logic, since it recognizes those escape sequences.) 2. During COPY IN, discard \r if it appears just before \n. (This must occur before de-backslashing, so that the sequence \ \r \n will be interpreted as a quoted newline. That way, an existing dump file that represents a data newline as \ \n will be correctly read even if it's been Microsoft-munged to \ \r \n.) The second part of this would have a small backwards-compatibility problem: when reading an old dump file (one made by COPY OUT before the above change) it would be possible for it to discard a \r that is legitimately part of the data. That would only happen if the \r was the last real data character on the line. Notice that if the \r precedes a data \n character, the \n will be represented as \ \n by existing COPY OUT, or by \ n by COPY OUT with the above change; either way, the \r is followed by a backslash and will not be dropped. So this change to COPY IN would not cause it to fail with old data values containing \r\n sequences, only with those ending with a bare \r. We could provide a workaround for that case by offering a SET variable that enables or disables discarding of \r, though I'd want it to default to ON. If this seems like a reasonable approach, then I'd like to apply the COPY OUT part of the change immediately (ie, for 7.2.1). Converting \r and \n to \ r and \ n will not hurt anyone in either forward or backward direction, and if we do that then dumps made with 7.2.1 or later will work correctly with 7.3 regardless of how one sets the discard-\r variable. A stronger change would make COPY IN regard \n, \r \n, or \r as equivalent representations of newline. If we do this then we'll also prevent Unix-to-Mac newline conversions from breaking COPY data (Mac represents newlines by just \r). However, in this approach any old-style data \r would break, not only those at end of data line. So the SET variable to revert to the old COPY IN behavior would be needed in many more cases. Comments? Anyone see a better way? regards, tom lane
At 09:22 PM 2/10/02 -0500, Tom Lane wrote: > >Comments? Anyone see a better way? > Can you do something akin to what you did with the binary output - but in this case allow for no details. This would be *great* for column order problems. eg, COPY <table> TO STDOUT WITH HEADER; or somesuch. This would result in two extra lines at the start: <Keywords indocating assumptions and translations (eg. CRLF_TO_LF) etc. {<col name>...} ...NORMAL DATA HERE... Similary, COPY <table> FROM STDIN WITH HEADER; Would use the header to work out how to translate CRLF etc, as well as allow COPY to be used to load different table definitions. How does this sound? ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Philip Warner <pjw@rhyme.com.au> writes: > Can you do something akin to what you did with the binary output - but in > this case allow for no details. This strikes me as solving an entirely different issue -- with great loss of backwards compatibility. Possibly these are good ideas, but for the moment I'd like to keep this thread focused on the issue of coping with newline translations. (In any case, I thought someone was already working on an optional column-name-list clause for COPY, which would solve that problem in what seems a cleaner fashion.) regards, tom lane
At 11:11 AM 2/11/02 -0500, Tom Lane wrote: >Philip Warner <pjw@rhyme.com.au> writes: >This strikes me as solving an entirely different issue Well, a related issue: you are talking about (slightly) changing the encoding of dumped data - why not therefor allow for an (optional) header with full encoding details. Column headers are just a minor bonus. >- with great >loss of backwards compatibility. Not if anything without the 'WITH HEADERS' is treated as per current. Anyone having M$ problems can use the new format (and pg_dump could use it always). >the moment I'd like to keep this thread focused on the issue of coping >with newline translations. No big deal, but this seems to be an encoding issue; and it seems like a good idea to formalize it somehow. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
At 07:02 PM 2/11/02 -0500, Tom Lane wrote: >To take just one problem: how do I know that >the first line is metadata, and not data that happens to look exactly >like whatever my metadata layout is? You don't, which is why you need the 'WITH HEADER' or 'WITH ENCODING' clause on COPY. I guess COPY could issue a warning when you do not say WITH HEADER and it looks like a valid header. Other than that, it's a case of storing information about the dumped data, not the database schema in the data file. I'm not particularly attached to the column names being there, but it does seem usefull to store instructions indicating the the file is formatted. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Philip Warner <pjw@rhyme.com.au> writes: > No big deal, but this seems to be an encoding issue; and it seems like a > good idea to formalize it somehow. Well, currently there is a strict separation between COPY data (in the file) and metadata (supplied as parameters to the COPY command). I'm not eager to revisit that decision. What you seem to be suggesting is shoving metadata into the data file, but I think that will create more problems than it solves. To take just one problem: how do I know that the first line is metadata, and not data that happens to look exactly like whatever my metadata layout is? regards, tom lane
[2002-02-11 11:11] Tom Lane said: | Philip Warner <pjw@rhyme.com.au> writes: | > Can you do something akin to what you did with the binary output - but in | > this case allow for no details. | | This strikes me as solving an entirely different issue -- with great | loss of backwards compatibility. Possibly these are good ideas, but for | the moment I'd like to keep this thread focused on the issue of coping | with newline translations. | | (In any case, I thought someone was already working on an optional | column-name-list clause for COPY, which would solve that problem in what | seems a cleaner fashion.) Yes, the work for a column list in COPY FROM is largely done. I've not been able to work on COPY TO, tho. Part #1 of your original proposal is certainly the right thing to do. I've backgrounded this problem for most of the day, and although I know it's a severe change, your "stronger" solution seems like a better change than the part #2, which just feels like something that would only be undone later. Both #2 and the "stronger" way will require a SET option for absolute correctness; why not require that SETting more often than not for any old-format dumps? Yes, it will affect a larger number of users, but we net a better dump format for that pain. cheers. brent -- "Develop your talent, man, and leave the world something. Records are really gifts from people. To think that an artist would love you enough to share his music with anyone is a beautiful thing." -- Duane Allman