Re: New "single" COPY format - Mailing list pgsql-hackers

From Joel Jacobson
Subject Re: New "single" COPY format
Date
Msg-id 55dee2a2-31e9-4be4-a095-0a89031a9036@app.fastmail.com
Whole thread Raw
In response to Re: New "single" COPY format  (Aleksander Alekseev <aleksander@timescale.com>)
List pgsql-hackers
On Fri, Nov 8, 2024, at 12:25, Aleksander Alekseev wrote:
> Sorry for being late for the discussion.

No worries, better late than never, thanks for chiming in.

> I disagree with the idea of adding a new format name for this. Mostly
> because it is *not* a new format and pretending that it is will be
> just a poor and/or confusing user interface.
>
> IMO it should be 'text' we already have with special options e.g.
> DELIMITER AS NULL ESCAPE AS NULL. If there are no escape characters
> and column delimiters (and no NULLs designations, and what else I
> forgot) then your text file just contains one tuple per line.
>
> Personally I wouldn't mind a special syntax such as LINES AS IS or
> maybe COPY AS IS for convenience. Perhaps we should discuss it
> separately though as a syntax sugar for a long list of options we
> already support.

From an implementation perspective, I agree with you that this could
be handled by tweaking the existing code for the 'text' and 'csv' formats,
although with a performance penalty for the existing formats.

But from a user-perspective, the implementation is of course irrelevant,
then what I think is important, is that the format should have an intuitive name,
where the default behaviour should match a typical file in the format,
as closely as possible.

For this reason, the 'text' format is unfortunately a poor name,
since it gives the impression it's a generic format for text files,
which it's not, it's a PostgreSQL-specific format, where "\." on a
single line has special meaning, and other defaults such as \N
are also PostgreSQL-specific, and needs to be overriden, if dealing
with a non-PostgreSQL specific text file.
Users who fail to understand these details, risks being surprised.

In contrast, the 'csv' format, works quite as expected.

So for this reason, I think a new format, is a good idea, not only
because it makes it much clearer how to have a fast parsing path
in the implementation, but also because it will increase the chances
users will get things right, when dealing with non-PostgreSQL specific
text files, such as JSONL and log files.

Sure, from an implementation perspective, we could have separate
specialized functions, to allow for fast parsing paths, even if
just overloading the existing options, but that would be a bit awkward I think.

The "DELIMITER AS NULL ESCAPE AS NULL" idea was proposed in the old thread
"Should CSV parsing be stricter about mid-field quotes?" [1]

[1] https://postgr.es/m/8aeab305-5e94-4fa5-82bf-6da6baee6e05@app.fastmail.com

However, some of us came to the conclusion that it would be
better to introduce a new format, for reasons explained below,
quoted from the old thread [1]:

On Wed, Oct 9, 2024, at 18:14, Andrew Dunstan wrote:
> On 2024-10-09 We 11:58 AM, Tom Lane wrote:
>> "Joel Jacobson" <joel@compiler.org> writes:
>>> I think it would be nicest to introduce a new "raw" FORMAT,
>>> that clearly get things right already at the top-level,
>>> instead of trying to customize any of the existing formats.
>>
>> FWIW, I like this idea.  It makes it much clearer how to have a
>> fast parsing path.
>
> WFM, so something like FORMAT {SIMPLE, RAW, FAST, SINGLE}? We don't seem
> to have an existing keyword that we could sanely reuse here.

To add to that, I think there is value of a new format, from a user-friendiness
perspective, by keeping the existing formats and their options intact,
and instead just add a new format, with a clear name, with well-defined
semantics, explained in the docs under its own section, to avoid cluttering
the documentation further, where users would need to assemble various options,
and understand their intricate details, in order to get things right.

/Joel



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Disallow UPDATE/DELETE on table with unpublished generated column as REPLICA IDENTITY
Next
From: Alvaro Herrera
Date:
Subject: Re: not null constraints, again