Re: New "raw" COPY format - Mailing list pgsql-hackers

From Jacob Champion
Subject Re: New "raw" COPY format
Date
Msg-id CAOYmi+=4trybU1sUOTxfZE43eWcQTq=-RLMaUSgeeX2404GiUQ@mail.gmail.com
Whole thread Raw
In response to Re: New "raw" COPY format  ("Joel Jacobson" <joel@compiler.org>)
Responses Re: New "raw" COPY format
List pgsql-hackers
On Tue, Oct 15, 2024 at 1:38 PM Joel Jacobson <joel@compiler.org> wrote:
>
> However, I thinking rejecting such column data seems like the
> better alternative, to ensure data exported with COPY TO
> can always be imported back using COPY FROM,
> for the same format. If text column data contains newlines,
> users probably ought to be using the text or csv format instead.

Yeah. I think _someone's_ going to have strong opinions one way or the
other, but that person is not me. And I assume a contents check during
COPY TO is going to have a noticeable performance impact...

> > - RAW seems like an okay-ish label, but for something that's doing as
> > much magic end-of-line detection as this patch is, I'd personally
> > prefer SINGLE (as in, "single column").
>
> It's actually the same end-of-line detection as the text format
> in copyfromparse.c's CopyReadLineText(), except the code
> is simpler thanks to not having to deal with quotes or escapes.

Right, sorry, I hadn't meant to imply that you made it up. :D Just
that a "raw" format that is actually automagically detecting things
doesn't seem very "raw" to me, so I prefer the other name.

> It basically just learns the newline sequence based on the first
> occurrence, and then require it to be the same throughout the file.

A hypothetical type whose text representation can contain '\r' but not
'\n' still can't be unambiguously round-tripped under this scheme:
COPY FROM will see the "mixed" line endings and complain, even though
there's no ambiguity.

Maybe no one will run into that problem in practice? But if they did,
I think that'd be a pretty frustrating limitation. It'd be nice to
override the behavior, to change it from "do what you think I mean" to
"do what I say".

> > - Speaking of magic end-of-line detection, can there be a way to turn
> > that off? Say, via DELIMITER?
> > - Generic DELIMITER support, for any single-byte separator at all,
> > might make a "single-column" format more generally applicable. But I
> > might be over-architecting. And it would make the COPY TO issue even
> > worse...
>
> That's an interesting idea that would provide more flexibility,
> though, at the cost of complicating things by overloading the meaning
> of DELIMITER.

I think that'd be a docs issue rather than a conceptual one, though...
it's still a delimiter. I wouldn't really expect end-user confusion.

> If aiming to make this more generally applicable,
> then at least DELIMITER would need to be multi-byte,
> since otherwise the Windows case \r\n couldn't be specified.

True.

> What I found appealing with the idea of a new COPY format,
> was that instead of overloading the existing options
> with more complexity, a new format wouldn't need to affect
> the existing options, and the new format could be explained
> separately, without making things worse for users not
> using this format.

I agree that we should not touch the existing formats. If
RAW/SINGLE/whatever needed a multibyte line delimiter, I'm not
proposing that the other formats should change.

--Jacob



pgsql-hackers by date:

Previous
From: Jacob Champion
Date:
Subject: Re: Add support to TLS 1.3 cipher suites and curves lists
Next
From: Nathan Bossart
Date:
Subject: Re: Misleading error "permission denied for table"