Home > mailing lists
Re: Make COPY format extendable: Extract COPY TO format implementations - Mailing list pgsql-hackers

From	Sutou Kouhei
Subject	Re: Make COPY format extendable: Extract COPY TO format implementations
Date	July 22, 2024 07:45:40
Msg-id	20240722.164540.889091645042390373.kou@clear-code.com Whole thread Raw
In response to	Re: Make COPY format extendable: Extract COPY TO format implementations (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses	Re: Make COPY format extendable: Extract COPY TO format implementations
List	pgsql-hackers
Tree view
Hi Tomas,

Thanks for joining this thread!

In <257d5573-07da-48c3-ac07-e047e7a65e99@enterprisedb.com>
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on Fri, 19 Jul 2024 14:40:05 +0200,
  Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

> I think it'd be helpful if you could post a patch status, i.e. a message
> re-explaininig what it aims to achieve, summary of the discussion so
> far, and what you think are the open questions. Otherwise every reviewer
> has to read the whole thread to learn this.

It makes sense. It seems your questions covers all important
points in this thread. So my answers of your questions
summarize the latest information.

> FWIW I realize there are other related patches, and maybe some of the
> discussion is happening on those threads. But that's just another reason
> to post the summary here - as a reviewer I'm not going to read random
> other patches that "might" have relevant info.

It makes sense too. To clarify it, other threads are
unrelated. We can focus on only this thread for this propose.

> The way I understand it, the ultimate goal is to allow extensions to
> define formats using CREATE XYZ.

Right.

>                   But the proposed patch does not do that, right? It
> only does some basic things at the C level, there's no DDL etc.

Right. The latest patch set includes only the basic things
for the first implementation.

> Per the commit message, none of the existing formats (text/csv/binary)
> is implemented as "copy routine".

Right.

>                                   IMHO that's a bit strange, because
> that's exactly what I'd expect this patch to do - to define all the
> infrastructure (catalogs, ...) and switch the existing formats to it.

We did it in the v1-v15 patch sets. But the v16/v17 patch
sets remove it because of a profiling result. (It's
described later.)

In general, we don't want to decrease the current
performance of the existing formats:


https://www.postgresql.org/message-id/flat/10025bac-158c-ffe7-fbec-32b42629121f%40dunslane.net#81cf82c219f2f2d77a616bbf5e511a5c

> We've spent quite a lot of blood sweat and tears over the years to make
> COPY fast, and we should not sacrifice any of that lightly.

The v15 patch set is faster than HEAD but there is a
mysterious profiling result:

https://www.postgresql.org/message-id/flat/ZdbtQJ-p5H1_EDwE%40paquier.xyz#6439e6ad574f2d47cd7220e9bfed3889

> The increase in CopyOneRowTo from 80% to 85% worries me
...
>                                                  I am getting faster
> runtimes with v15 (6232ms in average) vs HEAD (6550ms).

I think that it's not a blocker because the v15 patch set
approach is faster. But someone may think that it's a
blocker. So the v16 or later patch sets don't include codes
to use this extension mechanism for the existing formats. We
can work on it after we introduce the basic features if it's
valuable.

> How would you even know the new code is correct, when there's nothing
> using using the "copy routine" branch?

We can't test it only with the v16/v17 patch set
changes. But we can do it by adding more changes we did in
the v6 patch set.

https://www.postgresql.org/message-id/flat/20240124.144936.67229716500876806.kou%40clear-code.com#f1ad092fc5e81fe38d3c376559efd52c

If we should commit the basic changes with tests, I can
adjust the test mechanism in v6 patch set and add it to the
latest patch set. But it needs CREATE XYZ mechanism and
so on too. Is it OK?

> In fact, doesn't this mean that the benchmarks presented earlier are not
> very useful? We still use the old code, except there are a couple "if"
> branches that are never taken? I don't think this measures the new
> approach would not be slower once everything gets to be copy routine.

Here is a benchmark result with the v17 and HEAD:

https://www.postgresql.org/message-id/flat/ZelfYatRdVZN3FbE%40paquier.xyz#eccfd1a0131af93c48026d691cc247f4

It shows that no performance difference for the existing
formats.

The added mechanism may be slower than the existing formats
mechanism but it's not a blocker. Because it's never
performance regression. (Because this is a new feature.)

We can improve it later if it's needed.

> Also, how do we know this API is suitable for the alternative formats?

The v6 patch set has more APIs built on this API. These APIs
are for implementing the alternative formats.


https://www.postgresql.org/message-id/flat/20240124.144936.67229716500876806.kou%40clear-code.com#f1ad092fc5e81fe38d3c376559efd52c

This is an Apache Arrow format implementation based on the
v6 patch set: https://github.com/kou/pg-copy-arrow

> For example you mentioned Arrow, and I suppose people will want to add
> support for other column-oriented formats. I assume that will require
> stashing a batch of rows (or some other internal state) somewhere, but
> does the proposed API plan for that?
>
> My guess would be we'll need to add a "private_data" pointer to the
> CopyFromStateData/CopyToStateData structs, but maybe I'm wrong.

I think so too. The v6 patch set has a "private_data"
pointer. But the v17 patch set doesn't have it because the
v17 patch set has only basic changes. We'll add it and other
features in the following patches:

https://www.postgresql.org/message-id/flat/20240305.171808.667980402249336456.kou%40clear-code.com

> I'll send the following patches after this patch is
> merged. They are based on the v6 patch[1]:
> 
> 1. Add copy_handler
>    * This also adds a pg_proc lookup for custom FORMAT
>    * This also adds a test for copy_handler
> 2. Export CopyToStateData
>    * We need it to implement custom copy TO handler
> 3. Add needed APIs to implement custom copy TO handler
>    * Add CopyToStateData::opaque
>    * Export CopySendEndOfRow()
> 4. Export CopyFromStateData
>    * We need it to implement custom copy FROM handler
> 5. Add needed APIs to implement custom copy FROM handler
>    * Add CopyFromStateData::opaque
>    * Export CopyReadBinaryData()

"Copy{To,From}StateDate::opaque" are the "private_data"
pointer in the v6 patch.

> Also, won't the alternative formats require custom parameters. For
> example, for column-oriented-formats it might be useful to specify a
> stripe size (rows per batch), etc. I'm not saying this patch needs to
> implement that, but maybe the API should expect it?

Yes. The v6 patch set also has the API. But we want to
minimize API set as much as possible in the first
implementation.

https://www.postgresql.org/message-id/flat/Zbi1TwPfAvUpKqTd%40paquier.xyz#00abc60c5a1ad9eee395849b7b5a5e0d

>                           I am really worried about the complexities
> this thread is getting into because we are trying to shape the
> callbacks in the most generic way possible based on *two* use cases.
> This is going to be a never-ending discussion.  I'd rather get some
> simple basics, and then we can discuss if tweaking the callbacks is
> really necessary or not.

And I agree with this approach.

> 1) Switch the existing formats to the new API, to validate the API works
> at least for them, allow testing and benchmarking the code.

I want to keep the current style for the first
implementation to avoid affecting the existing formats
performance. If it's not allowed to move forward this
proposal, could someone help us to solve the mysterious
result (why are %s of CopyOneRowTo() different?) in the
following v15 patch set benchmark result?

https://www.postgresql.org/message-id/flat/ZdbtQJ-p5H1_EDwE%40paquier.xyz#6439e6ad574f2d47cd7220e9bfed3889

> 2) Try implementing some of the more exotic formats (column-oriented) to
> test the API works for those too.
>
> 3) Maybe try implementing a PoC version to do the DDL, so that it
> actually is extensible.
> 
> It's not my intent to "move the goalposts" - I think it's fine if the
> patches (2) and (3) are just PoC, to validate (1) goes in the right
> direction. For example, it's fine if (2) just hard-codes the new format
> next to the build-in ones - that's not something we'd commit, I think,
> but for validation of (1) it's good enough.
>
> Most of the DDL stuff can probably be "copied" from FDW handlers. It's
> pretty similar, and the "routine" idea is what FDW does too. It probably
> also shows a good way to "initialize" the routine, etc.

Is the v6 patch set enough for it?

https://www.postgresql.org/message-id/flat/20240124.144936.67229716500876806.kou%40clear-code.com#f1ad092fc5e81fe38d3c376559efd52c

Or should we do it based on the v17 patch set? If so, I'll
work on it now. It was a plan that I'll do after the v17
patch set is merged.


Thanks,
-- 
kou
pgsql-hackers by date:
From: Jeevan Chalke
Date: 22 July 2024, 07:12:21
Subject: Re: jsonpath: Inconsistency of timestamp_tz() Output
From: jian he
Date: 22 July 2024, 08:01:45
Subject: Re: Virtual generated columns
Re: Make COPY format extendable: Extract COPY TO format implementations - Mailing list pgsql-hackers

Previous

Next