Re: Make COPY format extendable: Extract COPY TO format implementations - Mailing list pgsql-hackers
From | Michael Paquier |
---|---|
Subject | Re: Make COPY format extendable: Extract COPY TO format implementations |
Date | |
Msg-id | aEo8kMGcSi0ZxAYn@paquier.xyz Whole thread Raw |
In response to | Re: Make COPY format extendable: Extract COPY TO format implementations ("David G. Johnston" <david.g.johnston@gmail.com>) |
Responses |
Re: Make COPY format extendable: Extract COPY TO format implementations
|
List | pgsql-hackers |
On Mon, May 26, 2025 at 10:04:05AM +0900, Sutou Kouhei wrote: > As I already said, I don't have a strong opinion on which > approach is better. My opinion for the (important) second > point is no. I feel that the pros of a. isn't realistic. If > users want to improve text/csv/binary performance (or > something), they should improve PostgreSQL itself instead of > replacing it as an extension. (Or they should create another > custom copy format such as "faster_text" not "text".) Patches welcome. Andres may have a TODO board regarding that, I think. > So I'm OK with the approach b. Here is an opinion. Approach (b), that uses _PG_init() a function to register a custom format has the merit to be simple to implement and secure by "design", because it depends only on the fact that we can do a lookup based on the string defined in one or more DefElems. Adding a dependendy to search_path as you say could lead to surprising results. Using a shared ID when a COPY method is registered (like extension wait events) or an ID that's static in a backend (like EXPLAIN extensibility does) is an implementation difference that can be useful for monitoring, and only that AFAIK. If you want to implement method-based statistics for COPY, you will want to allocate one stats kind for each COPY method, because the stats stored will be aggregates of the COPY methods. The stats kind ID is something that should not be linked to the COPY method ID, because the stats kind ID is registered in its own dedicated path, and it would be hardcoded in the library where the COPY callbacks are defined. So you could have a stats kind with a fixed ID, and a COPY method ID that's linked to each backend like EXPLAIN does. One factor to take into account is how much freedom we are OK with giving to users when it comes to the deployment of custom COPY methods, and how popular these would be. Cloud is popular these days, so folks may want to be able to define pointers to functions that are run in something else than C, as long as the language is trusted. My take on this part is that we are not going to see many formats out there that would benefit from these callbacks, so asking for people to deploy a .so on disk that can only be LOAD'ed or registered with one of the preloading GUCs should be enough to satisfy most users, even if the barrier entry to get that only a cloud instead like RDS or Azure is higher. This has also the benefit in giving more control on the COPY internals to cloud providers, as they are the ones who would be in charge of saying if they're OK with a dedicated .so or not. Not the users themselves. We've had a lot of bad PR and false CVEs in the past with COPY FROM/TO PROGRAM and the fact that it requires superusers. Having something in this area that gives more freedom to the user with something like approach (a) (SQL functions allowed to define the callback) will, I suspect, bite us back hard. So, my opinion is to rely on _PG_init(), with a shared ID if you want to expose the method used somewhere for monitoring tools. You could as well implement the simpler set of APIs that allocates IDs local to each backend, like EXPLAIN, then consider later if shared IDs are really needed. The registration APIs don't have to be fixed in time across releases, they can be always improved in steps as required. What matters is ABI compatibility in the same major version once it is released. -- Michael
Attachment
pgsql-hackers by date: