Re: Parallel copy - Mailing list pgsql-hackers

From Ants Aasma
Subject Re: Parallel copy
Date
Msg-id CANwKhkOhucXnFomwFS+Sas5=69k21J3JbVJuL-BPXAb4RbsREQ@mail.gmail.com
Whole thread Raw
In response to Re: Parallel copy  (Kuntal Ghosh <kuntalghosh.2007@gmail.com>)
Responses Re: Parallel copy  (Kuntal Ghosh <kuntalghosh.2007@gmail.com>)
List pgsql-hackers
On Tue, 14 Apr 2020 at 22:40, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> 1. Each worker scans a distinct fixed sized chunk of the CSV file and
> collects the following three stats from the chunk:
> a) number of quotes
> b) position of the first new line after even number of quotes
> c) position of the first new line after odd number of quotes
> 2. Once stats from all the chunks are collected, the leader identifies
> the adjusted chunk boundaries by iterating over the stats linearly:
> - For the k-th chunk, the leader adds the number of quotes in k-1 chunks.
> - If the number is even, then the k-th chunk does not start in the
> middle of a quoted field, and the first newline after an even number
> of quotes (the second collected information) is the first record
> delimiter in this chunk.
> - Otherwise, if the number is odd, the first newline after an odd
> number of quotes (the third collected information) is the first record
> delimiter.
> - The end position of the adjusted chunk is obtained based on the
> starting position of the next adjusted chunk.

The trouble is that, at least with current coding, the number of
quotes in a chunk can depend on whether the chunk started in a quote
or not. That's because escape characters only count inside quotes. See
for example the following csv:

foo,\"bar
baz",\"xyz"

This currently parses as one line and the number of parsed quotes
doesn't change if you add a quote in front.

But the general approach of doing the tokenization in parallel and
then a serial pass over the tokenization would still work. The quote
counting and new line finding just has to be done for both starting in
quote and not starting in quote case.

Using phases doesn't look like the correct approach - the tokenization
can be prepared just in time for the serial pass and processing the
chunk can proceed immediately after. This could all be done by having
the data in a single ringbuffer with a processing pipeline where one
process does the reading, then workers grab tokenization chunks as
they become available, then one process handles determining the chunk
boundaries, after which the chunks are processed.

But I still don't think this is something to worry about for the first
version. Just a better line splitting algorithm should go a looong way
in feeding a large number of workers, even when inserting to an
unindexed unlogged table. If we get the SIMD line splitting in, it
will be enough to overwhelm most I/O subsystems available today.

Regards,
Ants Aasma



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: Race condition in SyncRepGetSyncStandbysPriority
Next
From: Ahsan Hadi
Date:
Subject: Re: WIP/PoC for parallel backup