Re: Parallel copy - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Parallel copy |
Date | |
Msg-id | 20200415181913.4gjqcnuzxfzbbzxa@alap3.anarazel.de Whole thread Raw |
In response to | Re: Parallel copy (Ants Aasma <ants@cybertec.at>) |
Responses |
Re: Parallel copy
|
List | pgsql-hackers |
Hi, On 2020-04-15 12:05:47 +0300, Ants Aasma wrote: > I see the benefit of having one process responsible for splitting as > being able to run ahead of the workers to queue up work when many of > them need new data at the same time. Yea, I agree. > I don't think the locking benefits of a ring are important in this > case. At current rather conservative chunk sizes we are looking at > ~100k chunks per second at best, normal locking should be perfectly > adequate. And chunk size can easily be increased. I see the main value > in it being simple. I think the locking benefits of not needing to hold a lock *while* splitting (as we'd need in some proposal floated earlier) is likely to already be beneficial. I don't think we need to worry about lock scalability protecting the queue of already split data, for now. I don't think we really want to have a much larger chunk size, btw. Makes it more likely for data to workers to take an uneven amount of time. > But there is a point that having a layer of indirection instead of a > linear buffer allows for some workers to fall behind. Yea. It'd probably make sense to read the input data into an array of evenly sized blocks, and have the datastructure (still think a ringbuffer makes sense) of split boundaries point into those entries. If we don't require the input blocks to be in-order in that array, we can reuse blocks therein that are fully processed, even if "earlier" data in the input has not yet been fully processed. > With a ring buffer reading has to wait on the slowest worker reading > its chunk. To be clear, I was only thinking of using a ringbuffer to indicate split boundaries. And that workers would just pop entries from it before they actually process the data (stored outside of the ringbuffer). Since the split boundaries will always be read in order by workers, and the entries will be tiny, there's no need to avoid copying out entries. So basically what I was thinking we *eventually* may want (I'd forgo some of this initially) is something like: struct InputBlock { uint32 unprocessed_chunk_parts; uint32 following_block; char data[INPUT_BLOCK_SIZE] }; // array of input data, with > 2*nworkers entries InputBlock *input_blocks; struct ChunkedInputBoundary { uint32 firstblock; uint32 startoff; }; struct ChunkedInputBoundaries { uint32 read_pos; uint32 write_end; ChunkedInputBoundary ring[RINGSIZE]; }; Where the leader would read data into InputBlocks with unprocessed_chunk_parts == 0. Then it'd split the read input data into chunks (presumably with chunk size << input block size), putting identified chunks into ChunkedInputBoundaries. For each ChunkedInputBoundary it'd increment the unprocessed_chunk_parts of each InputBlock containing parts of the chunk. For chunks across >1 InputBlocks each InputBlock's following_block would be set accordingly. Workers would just pop an entry from the ringbuffer (making that entry reusable), and process the chunk. The underlying data would not be copied out of the InputBlocks, but obviously readers would need to take care to handle InputBlock boundaries. Whenever a chunk is fully read, or when crossing a InputBlock boundary, the InputBlock's unprocessed_chunk_parts would be decremented. Recycling of InputBlocks could probably just be an occasional linear search for buffers with unprocessed_chunk_parts == 0. Something roughly like this should not be too complicated to implement. Unless extremely unluckly (very wide input data spanning many InputBlocks) a straggling reader would not prevent global progress, it'd just prevent reuse of the InputBlocks with data for its chunk (normally that'd be two InputBlocks, not more). > Having workers copy the data to a local buffer as the first > step would reduce the probability of hitting any issues. But still, at > GB/s rates, hiding a 10ms timeslice of delay would need 10's of > megabytes of buffer. Yea. Given the likelihood of blocking on resources (reading in index data, writing out dirty buffers for reclaim, row locks for uniqueness checks, extension locks, ...), as well as non uniform per-row costs (partial indexes, index splits, ...) I think we ought to try to cope well with that. IMO/IME it'll be common to see stalls that are much longer than 10ms for processes that do COPY, even when the system is not overloaded. > FWIW. I think just increasing the buffer is good enough - the CPUs > processing this workload are likely to have tens to hundreds of > megabytes of cache on board. It'll not necessarily be a cache shared between leader / workers though, and some of the cache-cache transfers will be more expensive even within a socket (between core complexes for AMD, multi chip processors for Intel). Greetings, Andres Freund
pgsql-hackers by date: