Re: Parallel copy - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Parallel copy
Date
Msg-id 60fb6859-60f4-4a5b-1866-73c3cd2f0d16@iki.fi
Whole thread Raw
In response to Re: Parallel copy  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On 03/11/2020 10:59, Amit Kapila wrote:
> On Mon, Nov 2, 2020 at 12:40 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> However, the point of parallel copy is to maximize bandwidth.
> 
> Okay, but this first-phase (finding the line boundaries) can anyway
> be not done in parallel and we have seen in some of the initial 
> benchmarking that this initial phase is a small part of work 
> especially when the table has indexes, constraints, etc. So, I think 
> it won't matter much if this splitting is done in a single process
> or multiple processes.
Right, it won't matter performance-wise. That's not my point. The 
difference is in the complexity. If you don't store the line boundaries 
in shared memory, you get away with much simpler shared memory structures.

> I think something close to that is discussed as you have noticed in
> your next email but IIRC, because many people (Andres, Ants, myself
> and author) favoured the current approach (single reader and multiple
> consumers) we decided to go with that. I feel this patch is very much
> in the POC stage due to which the code doesn't look good and as we
> move forward we need to see what is the better way to improve it,
> maybe one of the ways is to split it as you are suggesting so that it
> can be easier to review.

Sure. I think the roadmap here is:

1. Split copy.c [1]. Not strictly necessary, but I think it'd make this 
nice to review and work with.

2. Refactor CopyReadLine(), so that finding the line-endings and the 
rest of the line-parsing are separated into separate functions.

3. Implement parallel copy.

> I think the other important thing which this
> patch has not addressed properly is the parallel-safety checks as
> pointed by me earlier. There are two things to solve there (a) the
> lower-level code (like heap_* APIs, CommandCounterIncrement, xact.c
> APIs, etc.) have checks which doesn't allow any writes, we need to see
> which of those we can open now (or do some additional work to prevent
> from those checks) after some of the work done for parallel-writes in
> PG-13[1][2], and (b) in which all cases we can parallel-writes
> (parallel copy) is allowed, for example need to identify whether table
> or one of its partitions has any constraint/expression which is
> parallel-unsafe.

Agreed, that needs to be solved. I haven't given it any thought myself.

- Heikki

[1] 
https://www.postgresql.org/message-id/8e15b560-f387-7acc-ac90-763986617bfb%40iki.fi



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: ModifyTable overheads in generic plans
Next
From: Dilip Kumar
Date:
Subject: Re: Re: parallel distinct union and aggregate support patch