Re: Make COPY format extendable: Extract COPY TO format implementations - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Make COPY format extendable: Extract COPY TO format implementations
Date
Msg-id 257d5573-07da-48c3-ac07-e047e7a65e99@enterprisedb.com
Whole thread Raw
In response to Re: Make COPY format extendable: Extract COPY TO format implementations  (Sutou Kouhei <kou@clear-code.com>)
Responses Re: Make COPY format extendable: Extract COPY TO format implementations
Re: Make COPY format extendable: Extract COPY TO format implementations
List pgsql-hackers
Hello Kouhei-san,

I think it'd be helpful if you could post a patch status, i.e. a message
re-explaininig what it aims to achieve, summary of the discussion so
far, and what you think are the open questions. Otherwise every reviewer
has to read the whole thread to learn this.

FWIW I realize there are other related patches, and maybe some of the
discussion is happening on those threads. But that's just another reason
to post the summary here - as a reviewer I'm not going to read random
other patches that "might" have relevant info.

-----

The way I understand it, the ultimate goal is to allow extensions to
define formats using CREATE XYZ. And I agree that would be a very
valuable feature. But the proposed patch does not do that, right? It
only does some basic things at the C level, there's no DDL etc.

Per the commit message, none of the existing formats (text/csv/binary)
is implemented as "copy routine". IMHO that's a bit strange, because
that's exactly what I'd expect this patch to do - to define all the
infrastructure (catalogs, ...) and switch the existing formats to it.

Yes, the patch will be larger, but it'll also simplify some of the code
(right now there's a bunch of branches to handle these "old" formats).
How would you even know the new code is correct, when there's nothing
using using the "copy routine" branch?

In fact, doesn't this mean that the benchmarks presented earlier are not
very useful? We still use the old code, except there are a couple "if"
branches that are never taken? I don't think this measures the new
approach would not be slower once everything gets to be copy routine.

Or what am I missing?

Also, how do we know this API is suitable for the alternative formats?
For example you mentioned Arrow, and I suppose people will want to add
support for other column-oriented formats. I assume that will require
stashing a batch of rows (or some other internal state) somewhere, but
does the proposed API plan for that?

My guess would be we'll need to add a "private_data" pointer to the
CopyFromStateData/CopyToStateData structs, but maybe I'm wrong.

Also, won't the alternative formats require custom parameters. For
example, for column-oriented-formats it might be useful to specify a
stripe size (rows per batch), etc. I'm not saying this patch needs to
implement that, but maybe the API should expect it?

-----

To sum this up, what I think needs to happen for this patch to move forward:

1) Switch the existing formats to the new API, to validate the API works
at least for them, allow testing and benchmarking the code.

2) Try implementing some of the more exotic formats (column-oriented) to
test the API works for those too.

3) Maybe try implementing a PoC version to do the DDL, so that it
actually is extensible.

It's not my intent to "move the goalposts" - I think it's fine if the
patches (2) and (3) are just PoC, to validate (1) goes in the right
direction. For example, it's fine if (2) just hard-codes the new format
next to the build-in ones - that's not something we'd commit, I think,
but for validation of (1) it's good enough.

Most of the DDL stuff can probably be "copied" from FDW handlers. It's
pretty similar, and the "routine" idea is what FDW does too. It probably
also shows a good way to "initialize" the routine, etc.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: make pg_ctl more friendly
Next
From: Amit Kapila
Date:
Subject: Re: Slow catchup of 2PC (twophase) transactions on replica in LR