On 03/11/2020 10:59, Amit Kapila wrote:
> On Mon, Nov 2, 2020 at 12:40 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> However, the point of parallel copy is to maximize bandwidth.
>
> Okay, but this first-phase (finding the line boundaries) can anyway
> be not done in parallel and we have seen in some of the initial
> benchmarking that this initial phase is a small part of work
> especially when the table has indexes, constraints, etc. So, I think
> it won't matter much if this splitting is done in a single process
> or multiple processes.
Right, it won't matter performance-wise. That's not my point. The
difference is in the complexity. If you don't store the line boundaries
in shared memory, you get away with much simpler shared memory structures.
> I think something close to that is discussed as you have noticed in
> your next email but IIRC, because many people (Andres, Ants, myself
> and author) favoured the current approach (single reader and multiple
> consumers) we decided to go with that. I feel this patch is very much
> in the POC stage due to which the code doesn't look good and as we
> move forward we need to see what is the better way to improve it,
> maybe one of the ways is to split it as you are suggesting so that it
> can be easier to review.
Sure. I think the roadmap here is:
1. Split copy.c [1]. Not strictly necessary, but I think it'd make this
nice to review and work with.
2. Refactor CopyReadLine(), so that finding the line-endings and the
rest of the line-parsing are separated into separate functions.
3. Implement parallel copy.
> I think the other important thing which this
> patch has not addressed properly is the parallel-safety checks as
> pointed by me earlier. There are two things to solve there (a) the
> lower-level code (like heap_* APIs, CommandCounterIncrement, xact.c
> APIs, etc.) have checks which doesn't allow any writes, we need to see
> which of those we can open now (or do some additional work to prevent
> from those checks) after some of the work done for parallel-writes in
> PG-13[1][2], and (b) in which all cases we can parallel-writes
> (parallel copy) is allowed, for example need to identify whether table
> or one of its partitions has any constraint/expression which is
> parallel-unsafe.
Agreed, that needs to be solved. I haven't given it any thought myself.
- Heikki
[1]
https://www.postgresql.org/message-id/8e15b560-f387-7acc-ac90-763986617bfb%40iki.fi