Re: Parallel copy - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Parallel copy
Date
Msg-id 20200415181913.4gjqcnuzxfzbbzxa@alap3.anarazel.de
Whole thread Raw
In response to Re: Parallel copy  (Ants Aasma <ants@cybertec.at>)
Responses Re: Parallel copy
List pgsql-hackers
Hi,

On 2020-04-15 12:05:47 +0300, Ants Aasma wrote:
> I see the benefit of having one process responsible for splitting as
> being able to run ahead of the workers to queue up work when many of
> them need new data at the same time.

Yea, I agree.


> I don't think the locking benefits of a ring are important in this
> case. At current rather conservative chunk sizes we are looking at
> ~100k chunks per second at best, normal locking should be perfectly
> adequate. And chunk size can easily be increased. I see the main value
> in it being simple.

I think the locking benefits of not needing to hold a lock *while*
splitting (as we'd need in some proposal floated earlier) is likely to
already be beneficial. I don't think we need to worry about lock
scalability protecting the queue of already split data, for now.

I don't think we really want to have a much larger chunk size,
btw. Makes it more likely for data to workers to take an uneven amount
of time.


> But there is a point that having a layer of indirection instead of a
> linear buffer allows for some workers to fall behind.

Yea. It'd probably make sense to read the input data into an array of
evenly sized blocks, and have the datastructure (still think a
ringbuffer makes sense) of split boundaries point into those entries. If
we don't require the input blocks to be in-order in that array, we can
reuse blocks therein that are fully processed, even if "earlier" data in
the input has not yet been fully processed.


> With a ring buffer reading has to wait on the slowest worker reading
> its chunk.

To be clear, I was only thinking of using a ringbuffer to indicate split
boundaries. And that workers would just pop entries from it before they
actually process the data (stored outside of the ringbuffer). Since the
split boundaries will always be read in order by workers, and the
entries will be tiny, there's no need to avoid copying out entries.


So basically what I was thinking we *eventually* may want (I'd forgo some
of this initially) is something like:

struct InputBlock
{
    uint32 unprocessed_chunk_parts;
    uint32 following_block;
    char data[INPUT_BLOCK_SIZE]
};

// array of input data, with > 2*nworkers entries
InputBlock *input_blocks;

struct ChunkedInputBoundary
{
    uint32 firstblock;
    uint32 startoff;
};

struct ChunkedInputBoundaries
{
    uint32 read_pos;
    uint32 write_end;
    ChunkedInputBoundary ring[RINGSIZE];
};

Where the leader would read data into InputBlocks with
unprocessed_chunk_parts == 0. Then it'd split the read input data into
chunks (presumably with chunk size << input block size), putting
identified chunks into ChunkedInputBoundaries. For each
ChunkedInputBoundary it'd increment the unprocessed_chunk_parts of each
InputBlock containing parts of the chunk.  For chunks across >1
InputBlocks each InputBlock's following_block would be set accordingly.

Workers would just pop an entry from the ringbuffer (making that entry
reusable), and process the chunk. The underlying data would not be
copied out of the InputBlocks, but obviously readers would need to take
care to handle InputBlock boundaries. Whenever a chunk is fully read, or
when crossing a InputBlock boundary, the InputBlock's
unprocessed_chunk_parts would be decremented.

Recycling of InputBlocks could probably just be an occasional linear
search for buffers with unprocessed_chunk_parts == 0.


Something roughly like this should not be too complicated to
implement. Unless extremely unluckly (very wide input data spanning many
InputBlocks) a straggling reader would not prevent global progress, it'd
just prevent reuse of the InputBlocks with data for its chunk (normally
that'd be two InputBlocks, not more).


> Having workers copy the data to a local buffer as the first
> step would reduce the probability of hitting any issues. But still, at
> GB/s rates, hiding a 10ms timeslice of delay would need 10's of
> megabytes of buffer.

Yea. Given the likelihood of blocking on resources (reading in index
data, writing out dirty buffers for reclaim, row locks for uniqueness
checks, extension locks, ...), as well as non uniform per-row costs
(partial indexes, index splits, ...) I think we ought to try to cope
well with that. IMO/IME it'll be common to see stalls that are much
longer than 10ms for processes that do COPY, even when the system is not
overloaded.


> FWIW. I think just increasing the buffer is good enough - the CPUs
> processing this workload are likely to have tens to hundreds of
> megabytes of cache on board.

It'll not necessarily be a cache shared between leader / workers though,
and some of the cache-cache transfers will be more expensive even within
a socket (between core complexes for AMD, multi chip processors for
Intel).

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: James Coleman
Date:
Subject: Re: Incremental sorts and EXEC_FLAG_REWIND
Next
From: Juan José Santamaría Flecha
Date:
Subject: Re: PG compilation error with Visual Studio 2015/2017/2019