Re: Parallel copy - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Parallel copy
Date
Msg-id F4A53483-833D-4972-8114-980B465F1A57@anarazel.de
Whole thread Raw
In response to Re: Parallel copy  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Parallel copy  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi,

On April 9, 2020 12:29:09 PM PDT, Robert Haas <robertmhaas@gmail.com> wrote:
>On Thu, Apr 9, 2020 at 2:55 PM Andres Freund <andres@anarazel.de>
>wrote:
>> I'm fairly certain that we do *not* want to distribute input data
>between processes on a single tuple basis. Probably not even below a
>few hundred kb. If there's any sort of natural clustering in the loaded
>data - extremely common, think timestamps - splitting on a granular
>basis will make indexing much more expensive. And have a lot more
>contention.
>
>That's a fair point. I think the solution ought to be that once any
>process starts finding line endings, it continues until it's grabbed
>at least a certain amount of data for itself. Then it stops and lets
>some other process grab a chunk of data.
>
>Or are you are arguing that there should be only one process that's
>allowed to find line endings for the entire duration of the load?

I've not yet read the whole thread. So I'm probably restating ideas.

Imo, yes, there should be only one process doing the chunking. For ilp, cache efficiency, but also because the leader
isthe only process with access to the network socket. It should load input data into one large buffer that's shared
acrossprocesses. There should be a separate ringbuffer with tuple/partial tuple (for huge tuples) offsets. Worker
processesshould grab large chunks of offsets from the offset ringbuffer. If the ringbuffer is not full, the worker
chunksshould be reduced in size.   

Given that everything stalls if the leader doesn't accept further input data, as well as when there are no available
splittedchunks, it doesn't seem like a good idea to have the leader do other work. 


I don't think optimizing/targeting copy from local files, where multiple processes could read, is useful. COPY STDIN is
theonly thing that practically matters. 

Andres


--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: where should I stick that backup?
Next
From: Jeff Davis
Date:
Subject: Re: Default setting for enable_hashagg_disk