Re: Parallel copy - Mailing list pgsql-hackers

From Ants Aasma
Subject Re: Parallel copy
Date
Msg-id CANwKhkPmM18UYpOt_AEB4JC6fa0dfA1PfgiQyNzeNUxEpG=XUw@mail.gmail.com
Whole thread Raw
In response to Re: Parallel copy  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Parallel copy  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Tue, 18 Feb 2020 at 12:20, Amit Kapila <amit.kapila16@gmail.com> wrote:
> This is something similar to what I had also in mind for this idea.  I
> had thought of handing over complete chunk (64K or whatever we
> decide).  The one thing that slightly bothers me is that we will add
> some additional overhead of copying to and from shared memory which
> was earlier from local process memory.  And, the tokenization (finding
> line boundaries) would be serial.  I think that tokenization should be
> a small part of the overall work we do during the copy operation, but
> will do some measurements to ascertain the same.

I don't think any extra copying is needed. The reader can directly
fread()/pq_copymsgbytes() into shared memory, and the workers can run
CopyReadLineText() inner loop directly off of the buffer in shared memory.

For serial performance of tokenization into lines, I really think a SIMD
based approach will be fast enough for quite some time. I hacked up the code in
the simdcsv  project to only tokenize on line endings and it was able to
tokenize a CSV file with short lines at 8+ GB/s. There are going to be many
other bottlenecks before this one starts limiting. Patch attached if you'd
like to try that out.

Regards,
Ants Aasma

Attachment

pgsql-hackers by date:

Previous
From: Juan José Santamaría Flecha
Date:
Subject: Re: Clean up some old cruft related to Windows
Next
From: Fujii Masao
Date:
Subject: Re: pg_stat_progress_basebackup - progress reporting forpg_basebackup, in the server side