Re: Parallel copy - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Parallel copy
Date
Msg-id CAA4eK1+k6z7aF8+t_+uRWsUqQg=kMyPrYH9gT5Y5EXW8eVV4Cw@mail.gmail.com
Whole thread Raw
In response to Re: Parallel copy  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Parallel copy
List pgsql-hackers
On Fri, Feb 14, 2020 at 3:36 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Fri, Feb 14, 2020 at 9:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > This work is to parallelize the copy command and in particular "Copy
> > <table_name> from 'filename' Where <condition>;" command.
>
> Nice project, and a great stepping stone towards parallel DML.
>

Thanks.

> > The first idea is that we allocate each chunk to a worker and once the
> > worker has finished processing the current chunk, it can start with
> > the next unprocessed chunk.  Here, we need to see how to handle the
> > partial tuples at the end or beginning of each chunk.  We can read the
> > chunks in dsa/dsm instead of in local buffer for processing.
> > Alternatively, if we think that accessing shared memory can be costly
> > we can read the entire chunk in local memory, but copy the partial
> > tuple at the beginning of a chunk (if any) to a dsa.  We mainly need
> > partial tuple in the shared memory area. The worker which has found
> > the initial part of the partial tuple will be responsible to process
> > the entire tuple. Now, to detect whether there is a partial tuple at
> > the beginning of a chunk, we always start reading one byte, prior to
> > the start of the current chunk and if that byte is not a terminating
> > line byte, we know that it is a partial tuple.  Now, while processing
> > the chunk, we will ignore this first line and start after the first
> > terminating line.
>
> That's quiet similar to the approach I took with a parallel file_fdw
> patch[1], which mostly consisted of parallelising the reading part of
> copy.c, except that...
>
> > To connect the partial tuple in two consecutive chunks, we need to
> > have another data structure (for the ease of reference in this email,
> > I call it CTM (chunk-tuple-map)) in shared memory where we store some
> > per-chunk information like the chunk-number, dsa location of that
> > chunk and a variable which indicates whether we can free/reuse the
> > current entry.  Whenever we encounter the partial tuple at the
> > beginning of a chunk we note down its chunk number, and dsa location
> > in CTM.  Next, whenever we encounter any partial tuple at the end of
> > the chunk, we search CTM for next chunk-number and read from
> > corresponding dsa location till we encounter terminating line byte.
> > Once we have read and processed this partial tuple, we can mark the
> > entry as available for reuse.  There are some loose ends here like how
> > many entries shall we allocate in this data structure.  It depends on
> > whether we want to allow the worker to start reading the next chunk
> > before the partial tuple of the previous chunk is processed.  To keep
> > it simple, we can allow the worker to process the next chunk only when
> > the partial tuple in the previous chunk is processed.  This will allow
> > us to keep the entries equal to a number of workers in CTM.  I think
> > we can easily improve this if we want but I don't think it will matter
> > too much as in most cases by the time we processed the tuples in that
> > chunk, the partial tuple would have been consumed by the other worker.
>
> ... I didn't use a shm 'partial tuple' exchanging mechanism, I just
> had each worker follow the final tuple in its chunk into the next
> chunk, and have each worker ignore the first tuple in chunk after
> chunk 0 because it knows someone else is looking after that.  That
> means that there was some double reading going on near the boundaries,
>

Right and especially if the part in the second chunk is bigger, then
we might need to read most of the second chunk.

> and considering how much I've been complaining about bogus extra
> system calls on this mailing list lately, yeah, your idea of doing a
> bit more coordination is a better idea.  If you go this way, you might
> at least find the copy.c part of the patch I wrote useful as stand-in
> scaffolding code in the meantime while you prototype the parallel
> writing side, if you don't already have something better for this?
>

No, I haven't started writing anything yet, but I have some ideas on
how to achieve this.  I quickly skimmed through your patch and I think
that can be used as a starting point though if we decide to go with
accumulating the partial tuple or all the data in shm, then the things
might differ.

> > Another approach that came up during an offlist discussion with Robert
> > is that we have one dedicated worker for reading the chunks from file
> > and it copies the complete tuples of one chunk in the shared memory
> > and once that is done, a handover that chunks to another worker which
> > can process tuples in that area.  We can imagine that the reader
> > worker is responsible to form some sort of work queue that can be
> > processed by the other workers.  In this idea, we won't be able to get
> > the benefit of initial tokenization (forming tuple boundaries) via
> > parallel workers and might need some additional memory processing as
> > after reader worker has handed the initial shared memory segment, we
> > need to somehow identify tuple boundaries and then process them.
>
> Yeah, I have also wondered about something like this in a slightly
> different context.  For parallel query in general, I wondered if there
> should be a Parallel Scatter node, that can be put on top of any
> parallel-safe plan, and it runs it in a worker process that just
> pushes tuples into a single-producer multi-consumer shm queue, and
> then other workers read from that whenever they need a tuple.
>

The idea sounds great but the past experience shows that shoving all
the tuples through queue might add a significant overhead.  However, I
don't know how exactly you are planning to use it?

>  Hmm,
> but for COPY, I suppose you'd want to push the raw lines with minimal
> examination, not tuples, into a shm queue, so I guess that's a bit
> different.
>

Yeah.

> > Another thing we need to figure out is the how many workers to use for
> > the copy command.  I think we can use it based on the file size which
> > needs some experiments or may be based on user input.
>
> It seems like we don't even really have a general model for that sort
> of thing in the rest of the system yet, and I guess some kind of
> fairly dumb explicit system would make sense in the early days...
>

makes sense.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Ranier Vilela
Date:
Subject: Re: [PATCH] libpq improvements and fixes
Next
From: Dmitry Dolgov
Date:
Subject: Re: Index Skip Scan