Re: Make COPY format extendable: Extract COPY TO format implementations - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Make COPY format extendable: Extract COPY TO format implementations
Date
Msg-id aEo8kMGcSi0ZxAYn@paquier.xyz
Whole thread Raw
In response to Re: Make COPY format extendable: Extract COPY TO format implementations  ("David G. Johnston" <david.g.johnston@gmail.com>)
Responses Re: Make COPY format extendable: Extract COPY TO format implementations
List pgsql-hackers
On Mon, May 26, 2025 at 10:04:05AM +0900, Sutou Kouhei wrote:
> As I already said, I don't have a strong opinion on which
> approach is better. My opinion for the (important) second
> point is no. I feel that the pros of a. isn't realistic. If
> users want to improve text/csv/binary performance (or
> something), they should improve PostgreSQL itself instead of
> replacing it as an extension. (Or they should create another
> custom copy format such as "faster_text" not "text".)

Patches welcome.  Andres may have a TODO board regarding that, I
think.

> So I'm OK with the approach b.

Here is an opinion.

Approach (b), that uses _PG_init() a function to register a custom
format has the merit to be simple to implement and secure by "design",
because it depends only on the fact that we can do a lookup based on
the string defined in one or more DefElems.  Adding a dependendy to
search_path as you say could lead to surprising results.

Using a shared ID when a COPY method is registered (like extension
wait events) or an ID that's static in a backend (like EXPLAIN
extensibility does) is an implementation difference that can be useful
for monitoring, and only that AFAIK.  If you want to implement
method-based statistics for COPY, you will want to allocate one stats
kind for each COPY method, because the stats stored will be aggregates
of the COPY methods.  The stats kind ID is something that should not
be linked to the COPY method ID, because the stats kind ID is
registered in its own dedicated path, and it would be hardcoded in the
library where the COPY callbacks are defined.  So you could have a
stats kind with a fixed ID, and a COPY method ID that's linked to each
backend like EXPLAIN does.

One factor to take into account is how much freedom we are OK with
giving to users when it comes to the deployment of custom COPY
methods, and how popular these would be.  Cloud is popular these days,
so folks may want to be able to define pointers to functions that are
run in something else than C, as long as the language is trusted.  My
take on this part is that we are not going to see many formats out
there that would benefit from these callbacks, so asking for people to
deploy a .so on disk that can only be LOAD'ed or registered with one
of the preloading GUCs should be enough to satisfy most users, even if
the barrier entry to get that only a cloud instead like RDS or Azure
is higher.  This has also the benefit in giving more control on the
COPY internals to cloud providers, as they are the ones who would be
in charge of saying if they're OK with a dedicated .so or not.  Not
the users themselves.  We've had a lot of bad PR and false CVEs in the
past with COPY FROM/TO PROGRAM and the fact that it requires
superusers.  Having something in this area that gives more freedom to
the user with something like approach (a) (SQL functions allowed to
define the callback) will, I suspect, bite us back hard.

So, my opinion is to rely on _PG_init(), with a shared ID if you want
to expose the method used somewhere for monitoring tools.  You could
as well implement the simpler set of APIs that allocates IDs local to
each backend, like EXPLAIN, then consider later if shared IDs are
really needed.  The registration APIs don't have to be fixed in time
across releases, they can be always improved in steps as required.
What matters is ABI compatibility in the same major version once it is
released.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: "Mark Drake"
Date:
Subject: RE: Inconsistent Behavior in JSONB Numeric Array Deletion
Next
From: Amit Kapila
Date:
Subject: Re: failover logical replication slots