Re: Make COPY format extendable: Extract COPY TO format implementations - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Make COPY format extendable: Extract COPY TO format implementations
Date
Msg-id CAD21AoBwxgfkMYxgPWyrLG-r8-ptVKjd=jhncY_QAaVJYhQQdw@mail.gmail.com
Whole thread Raw
In response to Re: Make COPY format extendable: Extract COPY TO format implementations  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
On Wed, Jun 11, 2025 at 7:34 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, May 26, 2025 at 10:04:05AM +0900, Sutou Kouhei wrote:
> > As I already said, I don't have a strong opinion on which
> > approach is better. My opinion for the (important) second
> > point is no. I feel that the pros of a. isn't realistic. If
> > users want to improve text/csv/binary performance (or
> > something), they should improve PostgreSQL itself instead of
> > replacing it as an extension. (Or they should create another
> > custom copy format such as "faster_text" not "text".)
>
> Patches welcome.  Andres may have a TODO board regarding that, I
> think.
>
> > So I'm OK with the approach b.
>
> Here is an opinion.

Thank you for the comments.

>
> Approach (b), that uses _PG_init() a function to register a custom
> format has the merit to be simple to implement and secure by "design",
> because it depends only on the fact that we can do a lookup based on
> the string defined in one or more DefElems.  Adding a dependendy to
> search_path as you say could lead to surprising results.
>
> Using a shared ID when a COPY method is registered (like extension
> wait events) or an ID that's static in a backend (like EXPLAIN
> extensibility does) is an implementation difference that can be useful
> for monitoring, and only that AFAIK.  If you want to implement
> method-based statistics for COPY, you will want to allocate one stats
> kind for each COPY method, because the stats stored will be aggregates
> of the COPY methods.  The stats kind ID is something that should not
> be linked to the COPY method ID, because the stats kind ID is
> registered in its own dedicated path, and it would be hardcoded in the
> library where the COPY callbacks are defined.  So you could have a
> stats kind with a fixed ID, and a COPY method ID that's linked to each
> backend like EXPLAIN does.

Good point.

>
> One factor to take into account is how much freedom we are OK with
> giving to users when it comes to the deployment of custom COPY
> methods, and how popular these would be.  Cloud is popular these days,
> so folks may want to be able to define pointers to functions that are
> run in something else than C, as long as the language is trusted.  My
> take on this part is that we are not going to see many formats out
> there that would benefit from these callbacks, so asking for people to
> deploy a .so on disk that can only be LOAD'ed or registered with one
> of the preloading GUCs should be enough to satisfy most users, even if
> the barrier entry to get that only a cloud instead like RDS or Azure
> is higher.  This has also the benefit in giving more control on the
> COPY internals to cloud providers, as they are the ones who would be
> in charge of saying if they're OK with a dedicated .so or not.  Not
> the users themselves.  We've had a lot of bad PR and false CVEs in the
> past with COPY FROM/TO PROGRAM and the fact that it requires
> superusers.  Having something in this area that gives more freedom to
> the user with something like approach (a) (SQL functions allowed to
> define the callback) will, I suspect, bite us back hard.

That's a valid point and I agree.

>
> So, my opinion is to rely on _PG_init(), with a shared ID if you want
> to expose the method used somewhere for monitoring tools.  You could
> as well implement the simpler set of APIs that allocates IDs local to
> each backend, like EXPLAIN, then consider later if shared IDs are
> really needed.  The registration APIs don't have to be fixed in time
> across releases, they can be always improved in steps as required.
> What matters is ABI compatibility in the same major version once it is
> released.

+1 to start with a simpler set of APIs.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Álvaro Herrera
Date:
Subject: Re: queryId constant squashing does not support prepared statements
Next
From: Nathan Bossart
Date:
Subject: Re: pg_dump --with-* options