Re: Parallel copy - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Parallel copy
Date
Msg-id 78C0107E-62F2-4F76-BFD8-34C73B716944@anarazel.de
Whole thread Raw
In response to Re: Parallel copy  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Parallel copy  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi,

On April 9, 2020 4:01:43 AM PDT, Amit Kapila <amit.kapila16@gmail.com> wrote:
>On Thu, Apr 9, 2020 at 3:55 AM Ants Aasma <ants@cybertec.at> wrote:
>>
>> On Wed, 8 Apr 2020 at 22:30, Robert Haas <robertmhaas@gmail.com>
>wrote:
>>
>> > - The portion of the time that is used to split the lines is not
>> > easily parallelizable. That seems to be a fairly small percentage
>for
>> > a reasonably wide table, but it looks significant (13-18%) for a
>> > narrow table. Such cases will gain less performance and be limited
>to
>> > a smaller number of workers. I think we also need to be careful
>about
>> > files whose lines are longer than the size of the buffer. If we're
>not
>> > careful, we could get a significant performance drop-off in such
>> > cases. We should make sure to pick an algorithm that seems like it
>> > will handle such cases without serious regressions and check that a
>> > file composed entirely of such long lines is handled reasonably
>> > efficiently.
>>
>> I don't have a proof, but my gut feel tells me that it's
>fundamentally
>> impossible to ingest csv without a serial line-ending/comment
>> tokenization pass.

I can't quite see a way either. But even if it were, I have a hard time seeing parallelizing that path as the right
thing.


>I think even if we try to do it via multiple workers it might not be
>better.  In such a scheme,  every worker needs to update the end
>boundaries and the next worker to keep a check if the previous has
>updated the end pointer.  I think this can add a significant
>synchronization effort for cases where tuples are of 100 or so bytes
>which will be a common case.

It seems like it'd also have terrible caching and instruction level parallelism behavior. By constantly switching the
processthat analyzes boundaries, the current data will have to be brought into l1/register, rather than staying there. 

I'm fairly certain that we do *not* want to distribute input data between processes on a single tuple basis. Probably
noteven below a few hundred kb. If there's any sort of natural clustering in the loaded data - extremely common, think
timestamps- splitting on a granular basis will make indexing much more expensive. And have a lot more contention. 


>> The current line splitting algorithm is terrible.
>> I'm currently working with some scientific data where on ingestion
>> CopyReadLineText() is about 25% on profiles. I prototyped a
>> replacement that can do ~8GB/s on narrow rows, more on wider ones.

We should really replace the entire copy parsing code. It's terrible.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



pgsql-hackers by date:

Previous
From: Jeff Janes
Date:
Subject: Re: BUG #16345: ts_headline does not find phrase matches correctly
Next
From: Peter Geoghegan
Date:
Subject: Re: Multiple FPI_FOR_HINT for the same block during killing btreeindex items