Hi,
In <CAD21AoB82+MoP_RJ=zzhO9KaHK4LbfGjORkre34C7g-xsCdegQ@mail.gmail.com>
"Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 2 May 2025 15:52:49 -0700,
Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Hmm. How much should we care about the observability of the COPY
>> format used by a given backend? Storing this information in a
>> backend's TopMemoryContext is OK to get the extensibility basics to
>> work, but could it make sense to use some shmem state to allocate a
>> uint32 ID that could be shared by all backends. Contrary to EXPLAIN,
>> COPY commands usually run for a very long time, so I am wondering if
>> these APIs should be designed so as it would be possible to monitor
>> the format used. One layer where the format information could be made
>> available is the progress reporting view for COPY, for example. I can
>> also imagine a pgstats kind where we do COPY stats aggregates, with a
>> per-format pgstats kind, and sharing a fixed ID across multiple
>> backends is relevant (when flushing the stats at shutdown, we would
>> use a name/ID mapping like replication slots).
>>
>> I don't think that this needs to be relevant for the option part, just
>> for the format where, I suspect, we should store in a shmem array
>> based on the ID allocated the name of the format, the library of the
>> callback and the function name fed to load_external_function().
>>
>> Note that custom LWLock and wait events use a shmem state for
>> monitoring purposes, where we are able to do ID->format name lookups
>> as much as format->ID lookups. Perhaps it's OK not to do that for
>> COPY, but I am wondering if we'd better design things from scratch
>> with states in shmem state knowing that COPY is a long-running
>> operation, and that if one mixes multiple formats they would most
>> likely want to know which formats are bottlenecks, through SQL. Cloud
>> providers would love that.
>
> Good point. It would make sense to have such information as a map on
> shmem. It might be better to use dshash here since a custom copy
> format module can be loaded at runtime. Or we can use dynahash with
> large enough elements.
If we don't need to assign an ID for each format, can we
avoid it? If we implement it, is this approach more complex
than the current table sampling method like approach?
Thanks,
--
kou