Re: Parallel copy - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Parallel copy |
Date | |
Msg-id | 20200219103845.7rwdqe43z327sp3z@development Whole thread Raw |
In response to | Re: Parallel copy (Ants Aasma <ants@cybertec.at>) |
Responses |
Re: Parallel copy
Re: Parallel copy |
List | pgsql-hackers |
On Wed, Feb 19, 2020 at 11:02:15AM +0200, Ants Aasma wrote: >On Wed, 19 Feb 2020 at 06:22, Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Tue, Feb 18, 2020 at 8:08 PM Ants Aasma <ants@cybertec.at> wrote: >> > >> > On Tue, 18 Feb 2020 at 15:21, Amit Kapila <amit.kapila16@gmail.com> wrote: >> > > >> > > On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma <ants@cybertec.at> wrote: >> > > > >> > > > On Tue, 18 Feb 2020 at 12:20, Amit Kapila <amit.kapila16@gmail.com> wrote: >> > > > > This is something similar to what I had also in mind for this idea. I >> > > > > had thought of handing over complete chunk (64K or whatever we >> > > > > decide). The one thing that slightly bothers me is that we will add >> > > > > some additional overhead of copying to and from shared memory which >> > > > > was earlier from local process memory. And, the tokenization (finding >> > > > > line boundaries) would be serial. I think that tokenization should be >> > > > > a small part of the overall work we do during the copy operation, but >> > > > > will do some measurements to ascertain the same. >> > > > >> > > > I don't think any extra copying is needed. >> > > > >> > > >> > > I am talking about access to shared memory instead of the process >> > > local memory. I understand that an extra copy won't be required. >> > > >> > > > The reader can directly >> > > > fread()/pq_copymsgbytes() into shared memory, and the workers can run >> > > > CopyReadLineText() inner loop directly off of the buffer in shared memory. >> > > > >> > > >> > > I am slightly confused here. AFAIU, the for(;;) loop in >> > > CopyReadLineText is about finding the line endings which we thought >> > > that the reader process will do. >> > >> > Indeed, I somehow misread the code while scanning over it. So CopyReadLineText >> > currently copies data from cstate->raw_buf to the StringInfo in >> > cstate->line_buf. In parallel mode it would copy it from the shared data buffer >> > to local line_buf until it hits the line end found by the data reader. The >> > amount of copying done is still exactly the same as it is now. >> > >> >> Yeah, on a broader level it will be something like that, but actual >> details might vary during implementation. BTW, have you given any >> thoughts on one other approach I have shared above [1]? We might not >> go with that idea, but it is better to discuss different ideas and >> evaluate their pros and cons. >> >> [1] - https://www.postgresql.org/message-id/CAA4eK1LyAyPCtBk4rkwomeT6%3DyTse5qWws-7i9EFwnUFZhvu5w%40mail.gmail.com > >It seems to be that at least for the general CSV case the tokenization to >tuples is an inherently serial task. Adding thread synchronization to that path >for coordinating between multiple workers is only going to make it slower. It >may be possible to enforce limitations on the input (e.g. no quotes allowed) or >do some speculative tokenization (e.g. if we encounter quote before newline >assume the chunk started in a quoted section) to make it possible to do the >tokenization in parallel. But given that the simpler and more featured approach >of handling it in a single reader process looks to be fast enough, I don't see >the point. I rather think that the next big step would be to overlap reading >input and tokenization, hopefully by utilizing Andres's work on asyncio. > I generally agree with the impression that parsing CSV is tricky and unlikely to benefit from parallelism in general. There may be cases with restrictions making it easier (e.g. restrictions on the format) but that might be a bit too complex to start with. For example, I had an idea to parallelise the planning by splitting it into two phases: 1) indexing Splits the CSV file into equally-sized chunks, make each worker to just scan through it's chunk and store positions of delimiters, quotes, newlines etc. This is probably the most expensive part of the parsing (essentially go char by char), and we'd speed it up linearly. 2) merge Combine the information from (1) in a single process, and actually parse the CSV data - we would not have to inspect each character, because we'd know positions of interesting chars, so this should be fast. We might have to recheck some stuff (e.g. escaping) but it should still be much faster. But yes, this may be a bit complex and I'm not sure it's worth it. The one piece of information I'm missing here is at least a very rough quantification of the individual steps of CSV processing - for example if parsing takes only 10% of the time, it's pretty pointless to start by parallelising this part and we should focus on the rest. If it's 50% it might be a different story. Has anyone done any measurements? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: