Re: Make COPY format extendable: Extract COPY TO format implementations - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Make COPY format extendable: Extract COPY TO format implementations
Date
Msg-id CAD21AoB82+MoP_RJ=zzhO9KaHK4LbfGjORkre34C7g-xsCdegQ@mail.gmail.com
Whole thread Raw
In response to Re: Make COPY format extendable: Extract COPY TO format implementations  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Make COPY format extendable: Extract COPY TO format implementations
List pgsql-hackers
On Thu, May 1, 2025 at 4:04 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, May 01, 2025 at 12:15:30PM -0700, Masahiko Sawada wrote:
> > In light of these concerns, I've been contemplating alternative
> > interface designs. One promising approach would involve registering
> > custom copy formats via a C function during module loading
> > (specifically, in _PG_init()). This method would require extension
> > authors to invoke a registration function, say
> > RegisterCustomCopyFormat(), in _PG_init() as follows:
> >
> > JsonLinesFormatId = RegisterCustomCopyFormat("jsonlines",
> >                                              &JsonLinesCopyToRoutine,
> >                                              &JsonLinesCopyFromRoutine);
> >
> > The registration function would validate the format name and store it
> > in TopMemoryContext. It would then return a unique identifier that can
> > be used subsequently to reference the custom copy format extension.
>
> Hmm.  How much should we care about the observability of the COPY
> format used by a given backend?  Storing this information in a
> backend's TopMemoryContext is OK to get the extensibility basics to
> work, but could it make sense to use some shmem state to allocate a
> uint32 ID that could be shared by all backends.  Contrary to EXPLAIN,
> COPY commands usually run for a very long time, so I am wondering if
> these APIs should be designed so as it would be possible to monitor
> the format used.  One layer where the format information could be made
> available is the progress reporting view for COPY, for example.  I can
> also imagine a pgstats kind where we do COPY stats aggregates, with a
> per-format pgstats kind, and sharing a fixed ID across multiple
> backends is relevant (when flushing the stats at shutdown, we would
> use a name/ID mapping like replication slots).
>
> I don't think that this needs to be relevant for the option part, just
> for the format where, I suspect, we should store in a shmem array
> based on the ID allocated the name of the format, the library of the
> callback and the function name fed to load_external_function().
>
> Note that custom LWLock and wait events use a shmem state for
> monitoring purposes, where we are able to do ID->format name lookups
> as much as format->ID lookups.  Perhaps it's OK not to do that for
> COPY, but I am wondering if we'd better design things from scratch
> with states in shmem state knowing that COPY is a long-running
> operation, and that if one mixes multiple formats they would most
> likely want to know which formats are bottlenecks, through SQL.  Cloud
> providers would love that.

Good point. It would make sense to have such information as a map on
shmem. It might be better to use dshash here since a custom copy
format module can be loaded at runtime. Or we can use dynahash with
large enough elements.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: POC: Parallel processing of indexes in autovacuum
Next
From: Sami Imseih
Date:
Subject: Re: POC: Parallel processing of indexes in autovacuum