Re: New "raw" COPY format - Mailing list pgsql-hackers
From | Jacob Champion |
---|---|
Subject | Re: New "raw" COPY format |
Date | |
Msg-id | CAOYmi+=4trybU1sUOTxfZE43eWcQTq=-RLMaUSgeeX2404GiUQ@mail.gmail.com Whole thread Raw |
In response to | Re: New "raw" COPY format ("Joel Jacobson" <joel@compiler.org>) |
Responses |
Re: New "raw" COPY format
|
List | pgsql-hackers |
On Tue, Oct 15, 2024 at 1:38 PM Joel Jacobson <joel@compiler.org> wrote: > > However, I thinking rejecting such column data seems like the > better alternative, to ensure data exported with COPY TO > can always be imported back using COPY FROM, > for the same format. If text column data contains newlines, > users probably ought to be using the text or csv format instead. Yeah. I think _someone's_ going to have strong opinions one way or the other, but that person is not me. And I assume a contents check during COPY TO is going to have a noticeable performance impact... > > - RAW seems like an okay-ish label, but for something that's doing as > > much magic end-of-line detection as this patch is, I'd personally > > prefer SINGLE (as in, "single column"). > > It's actually the same end-of-line detection as the text format > in copyfromparse.c's CopyReadLineText(), except the code > is simpler thanks to not having to deal with quotes or escapes. Right, sorry, I hadn't meant to imply that you made it up. :D Just that a "raw" format that is actually automagically detecting things doesn't seem very "raw" to me, so I prefer the other name. > It basically just learns the newline sequence based on the first > occurrence, and then require it to be the same throughout the file. A hypothetical type whose text representation can contain '\r' but not '\n' still can't be unambiguously round-tripped under this scheme: COPY FROM will see the "mixed" line endings and complain, even though there's no ambiguity. Maybe no one will run into that problem in practice? But if they did, I think that'd be a pretty frustrating limitation. It'd be nice to override the behavior, to change it from "do what you think I mean" to "do what I say". > > - Speaking of magic end-of-line detection, can there be a way to turn > > that off? Say, via DELIMITER? > > - Generic DELIMITER support, for any single-byte separator at all, > > might make a "single-column" format more generally applicable. But I > > might be over-architecting. And it would make the COPY TO issue even > > worse... > > That's an interesting idea that would provide more flexibility, > though, at the cost of complicating things by overloading the meaning > of DELIMITER. I think that'd be a docs issue rather than a conceptual one, though... it's still a delimiter. I wouldn't really expect end-user confusion. > If aiming to make this more generally applicable, > then at least DELIMITER would need to be multi-byte, > since otherwise the Windows case \r\n couldn't be specified. True. > What I found appealing with the idea of a new COPY format, > was that instead of overloading the existing options > with more complexity, a new format wouldn't need to affect > the existing options, and the new format could be explained > separately, without making things worse for users not > using this format. I agree that we should not touch the existing formats. If RAW/SINGLE/whatever needed a multibyte line delimiter, I'm not proposing that the other formats should change. --Jacob
pgsql-hackers by date: