Home > mailing lists

Re: Parallel copy - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: Parallel copy
Date	October 31, 2020 13:39:38
Msg-id	20201031133938.24uump5zoeourpah@development Whole thread Raw
In response to	Re: Parallel copy (Heikki Linnakangas <hlinnaka@iki.fi>)
List	pgsql-hackers

Tree view

On Sat, Oct 31, 2020 at 12:09:32AM +0200, Heikki Linnakangas wrote:
>On 30/10/2020 22:56, Tomas Vondra wrote:
>>I agree this design looks simpler. I'm a bit worried about serializing
>>the parsing like this, though. It's true the current approach (where the
>>first phase of parsing happens in the leader) has a similar issue, but I
>>think it would be easier to improve that in that design.
>>
>>My plan was to parallelize the parsing roughly like this:
>>
>>1) split the input buffer into smaller chunks
>>
>>2) let workers scan the buffers and record positions of interesting
>>characters (delimiters, quotes, ...) and pass it back to the leader
>>
>>3) use the information to actually parse the input data (we only need to
>>look at the interesting characters, skipping large parts of data)
>>
>>4) pass the parsed chunks to workers, just like in the current patch
>>
>>
>>But maybe something like that would be possible even with the approach
>>you propose - we could have a special parse phase for processing each
>>buffer, where any worker could look for the special characters, record
>>the positions in a bitmap next to the buffer. So the whole sequence of
>>states would look something like this:
>>
>>      EMPTY
>>      FILLED
>>      PARSED
>>      READY
>>      PROCESSING
>
>I think it's even simpler than that. You don't need to communicate the 
>"interesting positions" between processes, if the same worker takes 
>care of the chunk through all states from FILLED to DONE.
>
>You can build the bitmap of interesting positions immediately in 
>FILLED state, independently of all previous blocks. Once you've built 
>the bitmap, you need to wait for the information on where the first 
>line starts, but presumably finding the interesting positions is the 
>expensive part.
>

I don't think it's that simple. For example, the previous block may
contain a very long value (say, 1MB), so a bunch of blocks have to be
processed by the same worker. That probably makes the state transitions
a bit, and it also means the bitmap would need to be passed to the
worker that actually processes the block. Or we might just ignore this,
on the grounds that it's not a very common situation.


>>Of course, the question is whether parsing really is sufficiently
>>expensive for this to be worth it.
>
>Yeah, I don't think it's worth it. Splitting the lines is pretty fast, 
>I think we have many years to come before that becomes a bottleneck. 
>But if it turns out I'm wrong and we need to implement that, the path 
>is pretty straightforward.
>

OK. I agree the parsing is relatively cheap, and I don't recall seeing
CSV parsing as a bottleneck in production.  I suspect that's might be
simply because we're hitting other bottlenecks first, though.

regard

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-hackers by date:

From: vignesh C
Date: 31 October 2020, 11:10:49
Subject: Re: Log message for GSS connection is missing once connection authorization is successful.

From: Ranier Vilela
Date: 31 October 2020, 14:40:53
Subject: Dereference before NULL check (src/backend/storage/ipc/latch.c)

Re: Parallel copy - Mailing list pgsql-hackers

Previous

Next