Re: New "raw" COPY format - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: New "raw" COPY format
Date
Msg-id CAD21AoCKLesbbmMUOCtFdYy7H8M=4C6B4RH0eNCzFP0PJCv0bw@mail.gmail.com
Whole thread Raw
In response to Re: New "raw" COPY format  ("Joel Jacobson" <joel@compiler.org>)
Responses Re: New "raw" COPY format
List pgsql-hackers
On Mon, Nov 4, 2024 at 7:22 PM Joel Jacobson <joel@compiler.org> wrote:
>
> On Mon, Nov 4, 2024, at 19:34, Masahiko Sawada wrote:
> > On Sat, Nov 2, 2024 at 4:08 AM Joel Jacobson <joel@compiler.org> wrote:
> >>
> >> On Fri, Nov 1, 2024, at 22:28, Masahiko Sawada wrote:
> >> > As I mentioned in a separate email, if we use the OS default EOL as
> >> > the default EOL in raw format, it would not be necessary to allow it
> >> > to be multi characters. I think it's worth considering it.
> >>
> >> I like the idea, but not sure I understand how it would work.
> >>
> >> What if a user's OS default is \n (LF) and this user wants
> >> to import a Windows text file \r\n (CR LR), which is a
> >> multi characters EOL delimiter.
> >>
> >> Was your idea to make an exception for that particular EOL,
> >> or to simply not support that edge case?
> >
> > IIUC the text and csv formats already support it. We start from the
> > EOL_UNKNOWN state and guess the EOL marker while parsing the line. I
> > think we can do something similar to what we do in the text and csv
> > formats but we won't need to care about quotes and escapes in the raw
> > format.
>
> Ah, OK, then I see what you mean.
>
> That's actually how the patch worked initially, but due to comments by
> Jacob Champion, the magic EOL detection was removed.
>
> I have no strong opinion, maybe it's fine, since that's how most
> text editor seems to work, they detect the EOL automatically.
>
> Maybe we should then also rename the format to SINGLE, like suggested by
> Jacob and Andrew, since it perhaps wouldn't be fair to say it's RAW when
> it does magic detection.
>
> Below is the relevant part of the discussion earlier in this thread.
>
> I'll await your comments on this before making any changes.

If I understand this feature correctly, users can fully benefit from
the raw format even without a multi-byte magic EOL. If a multi-byte
magic EOL improves user experiences further, we can develop it later
as an improvement. That way, we can keep the first patch simple and
small.

As an alternative idea, as with the Text and CSV formats, I think it
would be possible that we can use the newline EOL ('\r' and/or '\n')
as the default and can specify either a single-byte EOL as delimiter
or none/empty (i.e. loading the whole file into a single tuple).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Bertrand Drouvot
Date:
Subject: Re: relfilenode statistics
Next
From: Peter Smith
Date:
Subject: Re: Pgoutput not capturing the generated columns