Re: Make COPY format extendable: Extract COPY TO format implementations - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Make COPY format extendable: Extract COPY TO format implementations |
Date | |
Msg-id | 257d5573-07da-48c3-ac07-e047e7a65e99@enterprisedb.com Whole thread Raw |
In response to | Re: Make COPY format extendable: Extract COPY TO format implementations (Sutou Kouhei <kou@clear-code.com>) |
Responses |
Re: Make COPY format extendable: Extract COPY TO format implementations
Re: Make COPY format extendable: Extract COPY TO format implementations |
List | pgsql-hackers |
Hello Kouhei-san, I think it'd be helpful if you could post a patch status, i.e. a message re-explaininig what it aims to achieve, summary of the discussion so far, and what you think are the open questions. Otherwise every reviewer has to read the whole thread to learn this. FWIW I realize there are other related patches, and maybe some of the discussion is happening on those threads. But that's just another reason to post the summary here - as a reviewer I'm not going to read random other patches that "might" have relevant info. ----- The way I understand it, the ultimate goal is to allow extensions to define formats using CREATE XYZ. And I agree that would be a very valuable feature. But the proposed patch does not do that, right? It only does some basic things at the C level, there's no DDL etc. Per the commit message, none of the existing formats (text/csv/binary) is implemented as "copy routine". IMHO that's a bit strange, because that's exactly what I'd expect this patch to do - to define all the infrastructure (catalogs, ...) and switch the existing formats to it. Yes, the patch will be larger, but it'll also simplify some of the code (right now there's a bunch of branches to handle these "old" formats). How would you even know the new code is correct, when there's nothing using using the "copy routine" branch? In fact, doesn't this mean that the benchmarks presented earlier are not very useful? We still use the old code, except there are a couple "if" branches that are never taken? I don't think this measures the new approach would not be slower once everything gets to be copy routine. Or what am I missing? Also, how do we know this API is suitable for the alternative formats? For example you mentioned Arrow, and I suppose people will want to add support for other column-oriented formats. I assume that will require stashing a batch of rows (or some other internal state) somewhere, but does the proposed API plan for that? My guess would be we'll need to add a "private_data" pointer to the CopyFromStateData/CopyToStateData structs, but maybe I'm wrong. Also, won't the alternative formats require custom parameters. For example, for column-oriented-formats it might be useful to specify a stripe size (rows per batch), etc. I'm not saying this patch needs to implement that, but maybe the API should expect it? ----- To sum this up, what I think needs to happen for this patch to move forward: 1) Switch the existing formats to the new API, to validate the API works at least for them, allow testing and benchmarking the code. 2) Try implementing some of the more exotic formats (column-oriented) to test the API works for those too. 3) Maybe try implementing a PoC version to do the DDL, so that it actually is extensible. It's not my intent to "move the goalposts" - I think it's fine if the patches (2) and (3) are just PoC, to validate (1) goes in the right direction. For example, it's fine if (2) just hard-codes the new format next to the build-in ones - that's not something we'd commit, I think, but for validation of (1) it's good enough. Most of the DDL stuff can probably be "copied" from FDW handlers. It's pretty similar, and the "routine" idea is what FDW does too. It probably also shows a good way to "initialize" the routine, etc. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: