Re: Make COPY format extendable: Extract COPY TO format implementations - Mailing list pgsql-hackers

From Sutou Kouhei
Subject Re: Make COPY format extendable: Extract COPY TO format implementations
Date
Msg-id 20250404.154248.1031254461649883039.kou@clear-code.com
Whole thread Raw
In response to Re: Make COPY format extendable: Extract COPY TO format implementations  ("David G. Johnston" <david.g.johnston@gmail.com>)
List pgsql-hackers
Hi,

In <CAKFQuwbhSssKTJyeYo9rn20zffV3L7wdQSbEQ8zwRfC=uXLkVA@mail.gmail.com>
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on Mon, 31 Mar 2025 10:05:34 -0700,
  "David G. Johnston" <david.g.johnston@gmail.com> wrote:

>              The CopyFromInFunc API allows for each attribute to somehow
> have its I/O format individualized.  But I don't see how that is practical
> or useful, and it adds burden on API users.

If an extension want to use I/O routines, it can use the
CopyFromInFunc API. Otherwise it can provide an empty
function.

For example,
https://github.com/MasahikoSawada/pg_copy_jsonlines/blob/master/copy_jsonlines.c
uses the CopyFromInFunc API but
https://github.com/kou/pg-copy-arrow/blob/main/copy_arrow.cc
uses an empty function for the CopyFromInFunc API.

The "it adds burden" means that "defining an empty function
is inconvenient", right? See also our past discussion for
this design:

https://www.postgresql.org/message-id/ZbijVn9_51mljMAG%40paquier.xyz

> Keeping empty options does not strike as a bad idea, because this
> forces extension developers to think about this code path rather than
> just ignore it.


> I suggest we remove both .CopyFromInFunc and .CopyFromStart/End and add a
> property to CopyFromRoutine (.ioMode?) with values of either Copy_IO_Text
> or Copy_IO_Binary and then just branch to either:
> 
> CopyFromTextLikeInFunc & CopyFromTextLikeStart/End
> or
> CopyFromBinaryInFunc & CopyFromStart/End
> 
> So, in effect, the only method an extension needs to write is converting
> to/from the 'serialized' form to the text/binary form (text being near
> unanimous).

I object this API. If we choose this API, we can create only
custom COPY formats that compatible with PostgreSQL's
text/binary form. For example, the above jsonlines format
and Apache Arrow format aren't implemented. It's meaningless
to introduce this custom COPY format mechanism with the
suggested API.

> It seems to me that CopyFromOneRow could simply produce a *string
> collection,
> one cell per attribute, and NextCopyFrom could do all of the above on a
> for-loop over *string

You suggest that we use a string collection instead of a
Datum collection in CopyFromOneRow() and convert a string
collection to a Datum collection in NextCopyFrom(), right?

I object this API. Because it has needless string <-> Datum
conversion overhead. For example,
https://github.com/MasahikoSawada/pg_copy_jsonlines/blob/master/copy_jsonlines.c
parses a JSON value to Datum. If we use this API, we need to
convert parsed Datum to string in an extension and
NextCopyFrom() re-converts the converted string to
Datum. It will slow down custom COPY format.

I want this custom COPY format feature for performance. So
APIs that require needless overhead for non text/csv/binary
formats isn't acceptable to me.


Thanks,
-- 
kou



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Conflict detection for multiple_unique_conflicts in logical replication
Next
From: Bertrand Drouvot
Date:
Subject: Re: Draft for basic NUMA observability