Re: Parallel copy - Mailing list pgsql-hackers

From Ants Aasma
Subject Re: Parallel copy
Date
Msg-id CANwKhkN8jEeKREkM+g0RqPHwT=AkH+Qb3LpEAkb=wPKHMZfS8A@mail.gmail.com
Whole thread Raw
In response to Re: Parallel copy  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Parallel copy  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Wed, 8 Apr 2020 at 22:30, Robert Haas <robertmhaas@gmail.com> wrote:
> - If we're unable to supply data to the COPY process as fast as the
> workers could load it, then speed will be limited at that point. We
> know reading the file from disk is pretty fast compared to what a
> single process can do. I'm not sure we've tested what happens with a
> network socket. It will depend on the network speed some, but it might
> be useful to know how many MB/s we can pump through over a UNIX
> socket.

This raises a good point. If at some point we want to minimize the
amount of memory copies then we might want to allow for RDMA to
directly write incoming network traffic into a distributing ring
buffer, which would include the protocol level headers. But at this
point we are so far off from network reception becoming a bottleneck I
don't think it's worth holding anything up for not allowing for zero
copy transfers.

> - The portion of the time that is used to split the lines is not
> easily parallelizable. That seems to be a fairly small percentage for
> a reasonably wide table, but it looks significant (13-18%) for a
> narrow table. Such cases will gain less performance and be limited to
> a smaller number of workers. I think we also need to be careful about
> files whose lines are longer than the size of the buffer. If we're not
> careful, we could get a significant performance drop-off in such
> cases. We should make sure to pick an algorithm that seems like it
> will handle such cases without serious regressions and check that a
> file composed entirely of such long lines is handled reasonably
> efficiently.

I don't have a proof, but my gut feel tells me that it's fundamentally
impossible to ingest csv without a serial line-ending/comment
tokenization pass. The current line splitting algorithm is terrible.
I'm currently working with some scientific data where on ingestion
CopyReadLineText() is about 25% on profiles. I prototyped a
replacement that can do ~8GB/s on narrow rows, more on wider ones.

For rows that are consistently wider than the input buffer I think
parallelism will still give a win - the serial phase is just memcpy
through a ringbuffer, after which a worker goes away to perform the
actual insert, letting the next worker read the data. The memcpy is
already happening today, CopyReadLineText() copies the input buffer
into a StringInfo, so the only extra work is synchronization between
leader and worker.

> - There could be index contention. Let's suppose that we can read data
> super fast and break it up into lines super fast. Maybe the file we're
> reading is fully RAM-cached and the lines are long. Now all of the
> backends are inserting into the indexes at the same time, and they
> might be trying to insert into the same pages. If so, lock contention
> could become a factor that hinders performance.

Different data distribution strategies can have an effect on that.
Dealing out input data in larger or smaller chunks will have a
considerable effect on contention, btree page splits and all kinds of
things. I think the common theme would be a push to increase chunk
size to reduce contention..

> - There could also be similar contention on the heap. Say the tuples
> are narrow, and many backends are trying to insert tuples into the
> same heap page at the same time. This would lead to many lock/unlock
> cycles. This could be avoided if the backends avoid targeting the same
> heap pages, but I'm not sure there's any reason to expect that they
> would do so unless we make some special provision for it.

I thought there already was a provision for that. Am I mis-remembering?

> - What else? I bet the above list is not comprehensive.

I think parallel copy patch needs to concentrate on splitting input
data to workers. After that any performance issues would be basically
the same as a normal parallel insert workload. There may well be
bottlenecks there, but those could be tackled independently.

Regards,
Ants Aasma
Cybertec



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: explain HashAggregate to report bucket and memory stats
Next
From: Andres Freund
Date:
Subject: Re: Improving connection scalability: GetSnapshotData()