Home > mailing lists

Re: Parallel copy - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: Parallel copy
Date	April 9, 2020 21:55:47
Msg-id	78C0107E-62F2-4F76-BFD8-34C73B716944@anarazel.de Whole thread Raw
In response to	Re: Parallel copy (Amit Kapila <amit.kapila16@gmail.com>)
Responses	Re: Parallel copy (Robert Haas <robertmhaas@gmail.com>)
List	pgsql-hackers

Tree view

Hi,

On April 9, 2020 4:01:43 AM PDT, Amit Kapila <amit.kapila16@gmail.com> wrote:
>On Thu, Apr 9, 2020 at 3:55 AM Ants Aasma <ants@cybertec.at> wrote:
>>
>> On Wed, 8 Apr 2020 at 22:30, Robert Haas <robertmhaas@gmail.com>
>wrote:
>>
>> > - The portion of the time that is used to split the lines is not
>> > easily parallelizable. That seems to be a fairly small percentage
>for
>> > a reasonably wide table, but it looks significant (13-18%) for a
>> > narrow table. Such cases will gain less performance and be limited
>to
>> > a smaller number of workers. I think we also need to be careful
>about
>> > files whose lines are longer than the size of the buffer. If we're
>not
>> > careful, we could get a significant performance drop-off in such
>> > cases. We should make sure to pick an algorithm that seems like it
>> > will handle such cases without serious regressions and check that a
>> > file composed entirely of such long lines is handled reasonably
>> > efficiently.
>>
>> I don't have a proof, but my gut feel tells me that it's
>fundamentally
>> impossible to ingest csv without a serial line-ending/comment
>> tokenization pass.

I can't quite see a way either. But even if it were, I have a hard time seeing parallelizing that path as the right
thing.


>I think even if we try to do it via multiple workers it might not be
>better.  In such a scheme,  every worker needs to update the end
>boundaries and the next worker to keep a check if the previous has
>updated the end pointer.  I think this can add a significant
>synchronization effort for cases where tuples are of 100 or so bytes
>which will be a common case.

It seems like it'd also have terrible caching and instruction level parallelism behavior. By constantly switching the
processthat analyzes boundaries, the current data will have to be brought into l1/register, rather than staying there. 

I'm fairly certain that we do *not* want to distribute input data between processes on a single tuple basis. Probably
noteven below a few hundred kb. If there's any sort of natural clustering in the loaded data - extremely common, think
timestamps- splitting on a granular basis will make indexing much more expensive. And have a lot more contention. 


>> The current line splitting algorithm is terrible.
>> I'm currently working with some scientific data where on ingestion
>> CopyReadLineText() is about 25% on profiles. I prototyped a
>> replacement that can do ~8GB/s on narrow rows, more on wider ones.

We should really replace the entire copy parsing code. It's terrible.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

pgsql-hackers by date:

From: Jeff Janes
Date: 09 April 2020, 21:39:41
Subject: Re: BUG #16345: ts_headline does not find phrase matches correctly

From: Peter Geoghegan
Date: 09 April 2020, 22:05:32
Subject: Re: Multiple FPI_FOR_HINT for the same block during killing btreeindex items

Re: Parallel copy - Mailing list pgsql-hackers

Previous

Next