Re: Parallel copy - Mailing list pgsql-hackers

From Ants Aasma
Subject Re: Parallel copy
Date
Msg-id CANwKhkMGSari24F3TFMT=b_9SLt-+K4uGnxvc=FsMPnh=7FW6g@mail.gmail.com
Whole thread Raw
In response to Re: Parallel copy  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Parallel copy
List pgsql-hackers
On Wed, 19 Feb 2020 at 06:22, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Feb 18, 2020 at 8:08 PM Ants Aasma <ants@cybertec.at> wrote:
> >
> > On Tue, 18 Feb 2020 at 15:21, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma <ants@cybertec.at> wrote:
> > > >
> > > > On Tue, 18 Feb 2020 at 12:20, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > > This is something similar to what I had also in mind for this idea.  I
> > > > > had thought of handing over complete chunk (64K or whatever we
> > > > > decide).  The one thing that slightly bothers me is that we will add
> > > > > some additional overhead of copying to and from shared memory which
> > > > > was earlier from local process memory.  And, the tokenization (finding
> > > > > line boundaries) would be serial.  I think that tokenization should be
> > > > > a small part of the overall work we do during the copy operation, but
> > > > > will do some measurements to ascertain the same.
> > > >
> > > > I don't think any extra copying is needed.
> > > >
> > >
> > > I am talking about access to shared memory instead of the process
> > > local memory.  I understand that an extra copy won't be required.
> > >
> > > > The reader can directly
> > > > fread()/pq_copymsgbytes() into shared memory, and the workers can run
> > > > CopyReadLineText() inner loop directly off of the buffer in shared memory.
> > > >
> > >
> > > I am slightly confused here.  AFAIU, the for(;;) loop in
> > > CopyReadLineText is about finding the line endings which we thought
> > > that the reader process will do.
> >
> > Indeed, I somehow misread the code while scanning over it. So CopyReadLineText
> > currently copies data from cstate->raw_buf to the StringInfo in
> > cstate->line_buf. In parallel mode it would copy it from the shared data buffer
> > to local line_buf until it hits the line end found by the data reader. The
> > amount of copying done is still exactly the same as it is now.
> >
>
> Yeah, on a broader level it will be something like that, but actual
> details might vary during implementation.  BTW, have you given any
> thoughts on one other approach I have shared above [1]?  We might not
> go with that idea, but it is better to discuss different ideas and
> evaluate their pros and cons.
>
> [1] - https://www.postgresql.org/message-id/CAA4eK1LyAyPCtBk4rkwomeT6%3DyTse5qWws-7i9EFwnUFZhvu5w%40mail.gmail.com

It seems to be that at least for the general CSV case the tokenization to
tuples is an inherently serial task. Adding thread synchronization to that path
for coordinating between multiple workers is only going to make it slower. It
may be possible to enforce limitations on the input (e.g. no quotes allowed) or
do some speculative tokenization (e.g. if we encounter quote before newline
assume the chunk started in a quoted section) to make it possible to do the
tokenization in parallel. But given that the simpler and more featured approach
of handling it in a single reader process looks to be fast enough, I don't see
the point. I rather think that the next big step would be to overlap reading
input and tokenization, hopefully by utilizing Andres's work on asyncio.

Regards,
Ants Aasma



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: [HACKERS] WAL logging problem in 9.4.3?
Next
From: Tomas Vondra
Date:
Subject: Re: Parallel copy