Hi,
In <CAKFQuwbhSssKTJyeYo9rn20zffV3L7wdQSbEQ8zwRfC=uXLkVA@mail.gmail.com>
"Re: Make COPY format extendable: Extract COPY TO format implementations" on Mon, 31 Mar 2025 10:05:34 -0700,
"David G. Johnston" <david.g.johnston@gmail.com> wrote:
> The CopyFromInFunc API allows for each attribute to somehow
> have its I/O format individualized. But I don't see how that is practical
> or useful, and it adds burden on API users.
If an extension want to use I/O routines, it can use the
CopyFromInFunc API. Otherwise it can provide an empty
function.
For example,
https://github.com/MasahikoSawada/pg_copy_jsonlines/blob/master/copy_jsonlines.c
uses the CopyFromInFunc API but
https://github.com/kou/pg-copy-arrow/blob/main/copy_arrow.cc
uses an empty function for the CopyFromInFunc API.
The "it adds burden" means that "defining an empty function
is inconvenient", right? See also our past discussion for
this design:
https://www.postgresql.org/message-id/ZbijVn9_51mljMAG%40paquier.xyz
> Keeping empty options does not strike as a bad idea, because this
> forces extension developers to think about this code path rather than
> just ignore it.
> I suggest we remove both .CopyFromInFunc and .CopyFromStart/End and add a
> property to CopyFromRoutine (.ioMode?) with values of either Copy_IO_Text
> or Copy_IO_Binary and then just branch to either:
>
> CopyFromTextLikeInFunc & CopyFromTextLikeStart/End
> or
> CopyFromBinaryInFunc & CopyFromStart/End
>
> So, in effect, the only method an extension needs to write is converting
> to/from the 'serialized' form to the text/binary form (text being near
> unanimous).
I object this API. If we choose this API, we can create only
custom COPY formats that compatible with PostgreSQL's
text/binary form. For example, the above jsonlines format
and Apache Arrow format aren't implemented. It's meaningless
to introduce this custom COPY format mechanism with the
suggested API.
> It seems to me that CopyFromOneRow could simply produce a *string
> collection,
> one cell per attribute, and NextCopyFrom could do all of the above on a
> for-loop over *string
You suggest that we use a string collection instead of a
Datum collection in CopyFromOneRow() and convert a string
collection to a Datum collection in NextCopyFrom(), right?
I object this API. Because it has needless string <-> Datum
conversion overhead. For example,
https://github.com/MasahikoSawada/pg_copy_jsonlines/blob/master/copy_jsonlines.c
parses a JSON value to Datum. If we use this API, we need to
convert parsed Datum to string in an extension and
NextCopyFrom() re-converts the converted string to
Datum. It will slow down custom COPY format.
I want this custom COPY format feature for performance. So
APIs that require needless overhead for non text/csv/binary
formats isn't acceptable to me.
Thanks,
--
kou