Re: Make COPY format extendable: Extract COPY TO format implementations - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Make COPY format extendable: Extract COPY TO format implementations |
Date | |
Msg-id | CAD21AoB82+MoP_RJ=zzhO9KaHK4LbfGjORkre34C7g-xsCdegQ@mail.gmail.com Whole thread Raw |
In response to | Re: Make COPY format extendable: Extract COPY TO format implementations (Michael Paquier <michael@paquier.xyz>) |
Responses |
Re: Make COPY format extendable: Extract COPY TO format implementations
|
List | pgsql-hackers |
On Thu, May 1, 2025 at 4:04 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, May 01, 2025 at 12:15:30PM -0700, Masahiko Sawada wrote: > > In light of these concerns, I've been contemplating alternative > > interface designs. One promising approach would involve registering > > custom copy formats via a C function during module loading > > (specifically, in _PG_init()). This method would require extension > > authors to invoke a registration function, say > > RegisterCustomCopyFormat(), in _PG_init() as follows: > > > > JsonLinesFormatId = RegisterCustomCopyFormat("jsonlines", > > &JsonLinesCopyToRoutine, > > &JsonLinesCopyFromRoutine); > > > > The registration function would validate the format name and store it > > in TopMemoryContext. It would then return a unique identifier that can > > be used subsequently to reference the custom copy format extension. > > Hmm. How much should we care about the observability of the COPY > format used by a given backend? Storing this information in a > backend's TopMemoryContext is OK to get the extensibility basics to > work, but could it make sense to use some shmem state to allocate a > uint32 ID that could be shared by all backends. Contrary to EXPLAIN, > COPY commands usually run for a very long time, so I am wondering if > these APIs should be designed so as it would be possible to monitor > the format used. One layer where the format information could be made > available is the progress reporting view for COPY, for example. I can > also imagine a pgstats kind where we do COPY stats aggregates, with a > per-format pgstats kind, and sharing a fixed ID across multiple > backends is relevant (when flushing the stats at shutdown, we would > use a name/ID mapping like replication slots). > > I don't think that this needs to be relevant for the option part, just > for the format where, I suspect, we should store in a shmem array > based on the ID allocated the name of the format, the library of the > callback and the function name fed to load_external_function(). > > Note that custom LWLock and wait events use a shmem state for > monitoring purposes, where we are able to do ID->format name lookups > as much as format->ID lookups. Perhaps it's OK not to do that for > COPY, but I am wondering if we'd better design things from scratch > with states in shmem state knowing that COPY is a long-running > operation, and that if one mixes multiple formats they would most > likely want to know which formats are bottlenecks, through SQL. Cloud > providers would love that. Good point. It would make sense to have such information as a map on shmem. It might be better to use dshash here since a custom copy format module can be loaded at runtime. Or we can use dynahash with large enough elements. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: