Re: Parallel copy - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Parallel copy |
Date | |
Msg-id | CAA4eK1+k6z7aF8+t_+uRWsUqQg=kMyPrYH9gT5Y5EXW8eVV4Cw@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel copy (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: Parallel copy
|
List | pgsql-hackers |
On Fri, Feb 14, 2020 at 3:36 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Fri, Feb 14, 2020 at 9:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > This work is to parallelize the copy command and in particular "Copy > > <table_name> from 'filename' Where <condition>;" command. > > Nice project, and a great stepping stone towards parallel DML. > Thanks. > > The first idea is that we allocate each chunk to a worker and once the > > worker has finished processing the current chunk, it can start with > > the next unprocessed chunk. Here, we need to see how to handle the > > partial tuples at the end or beginning of each chunk. We can read the > > chunks in dsa/dsm instead of in local buffer for processing. > > Alternatively, if we think that accessing shared memory can be costly > > we can read the entire chunk in local memory, but copy the partial > > tuple at the beginning of a chunk (if any) to a dsa. We mainly need > > partial tuple in the shared memory area. The worker which has found > > the initial part of the partial tuple will be responsible to process > > the entire tuple. Now, to detect whether there is a partial tuple at > > the beginning of a chunk, we always start reading one byte, prior to > > the start of the current chunk and if that byte is not a terminating > > line byte, we know that it is a partial tuple. Now, while processing > > the chunk, we will ignore this first line and start after the first > > terminating line. > > That's quiet similar to the approach I took with a parallel file_fdw > patch[1], which mostly consisted of parallelising the reading part of > copy.c, except that... > > > To connect the partial tuple in two consecutive chunks, we need to > > have another data structure (for the ease of reference in this email, > > I call it CTM (chunk-tuple-map)) in shared memory where we store some > > per-chunk information like the chunk-number, dsa location of that > > chunk and a variable which indicates whether we can free/reuse the > > current entry. Whenever we encounter the partial tuple at the > > beginning of a chunk we note down its chunk number, and dsa location > > in CTM. Next, whenever we encounter any partial tuple at the end of > > the chunk, we search CTM for next chunk-number and read from > > corresponding dsa location till we encounter terminating line byte. > > Once we have read and processed this partial tuple, we can mark the > > entry as available for reuse. There are some loose ends here like how > > many entries shall we allocate in this data structure. It depends on > > whether we want to allow the worker to start reading the next chunk > > before the partial tuple of the previous chunk is processed. To keep > > it simple, we can allow the worker to process the next chunk only when > > the partial tuple in the previous chunk is processed. This will allow > > us to keep the entries equal to a number of workers in CTM. I think > > we can easily improve this if we want but I don't think it will matter > > too much as in most cases by the time we processed the tuples in that > > chunk, the partial tuple would have been consumed by the other worker. > > ... I didn't use a shm 'partial tuple' exchanging mechanism, I just > had each worker follow the final tuple in its chunk into the next > chunk, and have each worker ignore the first tuple in chunk after > chunk 0 because it knows someone else is looking after that. That > means that there was some double reading going on near the boundaries, > Right and especially if the part in the second chunk is bigger, then we might need to read most of the second chunk. > and considering how much I've been complaining about bogus extra > system calls on this mailing list lately, yeah, your idea of doing a > bit more coordination is a better idea. If you go this way, you might > at least find the copy.c part of the patch I wrote useful as stand-in > scaffolding code in the meantime while you prototype the parallel > writing side, if you don't already have something better for this? > No, I haven't started writing anything yet, but I have some ideas on how to achieve this. I quickly skimmed through your patch and I think that can be used as a starting point though if we decide to go with accumulating the partial tuple or all the data in shm, then the things might differ. > > Another approach that came up during an offlist discussion with Robert > > is that we have one dedicated worker for reading the chunks from file > > and it copies the complete tuples of one chunk in the shared memory > > and once that is done, a handover that chunks to another worker which > > can process tuples in that area. We can imagine that the reader > > worker is responsible to form some sort of work queue that can be > > processed by the other workers. In this idea, we won't be able to get > > the benefit of initial tokenization (forming tuple boundaries) via > > parallel workers and might need some additional memory processing as > > after reader worker has handed the initial shared memory segment, we > > need to somehow identify tuple boundaries and then process them. > > Yeah, I have also wondered about something like this in a slightly > different context. For parallel query in general, I wondered if there > should be a Parallel Scatter node, that can be put on top of any > parallel-safe plan, and it runs it in a worker process that just > pushes tuples into a single-producer multi-consumer shm queue, and > then other workers read from that whenever they need a tuple. > The idea sounds great but the past experience shows that shoving all the tuples through queue might add a significant overhead. However, I don't know how exactly you are planning to use it? > Hmm, > but for COPY, I suppose you'd want to push the raw lines with minimal > examination, not tuples, into a shm queue, so I guess that's a bit > different. > Yeah. > > Another thing we need to figure out is the how many workers to use for > > the copy command. I think we can use it based on the file size which > > needs some experiments or may be based on user input. > > It seems like we don't even really have a general model for that sort > of thing in the rest of the system yet, and I guess some kind of > fairly dumb explicit system would make sense in the early days... > makes sense. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: