Re: Make COPY format extendable: Extract COPY TO format implementations - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Make COPY format extendable: Extract COPY TO format implementations
Date
Msg-id aBP91umMAJcXbhHc@paquier.xyz
Whole thread Raw
In response to Re: Make COPY format extendable: Extract COPY TO format implementations  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: Make COPY format extendable: Extract COPY TO format implementations
List pgsql-hackers
On Thu, May 01, 2025 at 12:15:30PM -0700, Masahiko Sawada wrote:
> In light of these concerns, I've been contemplating alternative
> interface designs. One promising approach would involve registering
> custom copy formats via a C function during module loading
> (specifically, in _PG_init()). This method would require extension
> authors to invoke a registration function, say
> RegisterCustomCopyFormat(), in _PG_init() as follows:
>
> JsonLinesFormatId = RegisterCustomCopyFormat("jsonlines",
>                                              &JsonLinesCopyToRoutine,
>                                              &JsonLinesCopyFromRoutine);
>
> The registration function would validate the format name and store it
> in TopMemoryContext. It would then return a unique identifier that can
> be used subsequently to reference the custom copy format extension.

Hmm.  How much should we care about the observability of the COPY
format used by a given backend?  Storing this information in a
backend's TopMemoryContext is OK to get the extensibility basics to
work, but could it make sense to use some shmem state to allocate a
uint32 ID that could be shared by all backends.  Contrary to EXPLAIN,
COPY commands usually run for a very long time, so I am wondering if
these APIs should be designed so as it would be possible to monitor
the format used.  One layer where the format information could be made
available is the progress reporting view for COPY, for example.  I can
also imagine a pgstats kind where we do COPY stats aggregates, with a
per-format pgstats kind, and sharing a fixed ID across multiple
backends is relevant (when flushing the stats at shutdown, we would
use a name/ID mapping like replication slots).

I don't think that this needs to be relevant for the option part, just
for the format where, I suspect, we should store in a shmem array
based on the ID allocated the name of the format, the library of the
callback and the function name fed to load_external_function().

Note that custom LWLock and wait events use a shmem state for
monitoring purposes, where we are able to do ID->format name lookups
as much as format->ID lookups.  Perhaps it's OK not to do that for
COPY, but I am wondering if we'd better design things from scratch
with states in shmem state knowing that COPY is a long-running
operation, and that if one mixes multiple formats they would most
likely want to know which formats are bottlenecks, through SQL.  Cloud
providers would love that.

> This approach offers several advantages: it would eliminate the
> search_path issue, provide greater flexibility, and potentially
> simplify the overall interface for users and developers alike. We
> might be able to provide a view showing the registered custom COPY
> format in the future. Also, these interfaces align with other
> customizable functionalities such as custom rmgr, custom lwlock,
> custom waitevent, and custom EXPLAIN option etc.

Yeah, agreed with the search_path concerns.  We are getting better at
making areas of Postgres more pluggable lately, having a loading path
where we don't have any of these potential issues by design matters.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: index prefetching
Next
From: Fujii Masao
Date:
Subject: Re: pgsql: Add function to log the memory contexts of specified backend pro