Re: Parallel copy - Mailing list pgsql-hackers

From Kuntal Ghosh
Subject Re: Parallel copy
Date
Msg-id CAGz5QCKQUs4vH5=3PEx0rJyvy48dij9QvzNyw+qZsTnwsxTCuw@mail.gmail.com
Whole thread Raw
In response to Re: Parallel copy  (Ants Aasma <ants@cybertec.at>)
Responses Re: Parallel copy  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Wed, Apr 15, 2020 at 2:15 PM Ants Aasma <ants@cybertec.at> wrote:
>
> On Tue, 14 Apr 2020 at 22:40, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> > 1. Each worker scans a distinct fixed sized chunk of the CSV file and
> > collects the following three stats from the chunk:
> > a) number of quotes
> > b) position of the first new line after even number of quotes
> > c) position of the first new line after odd number of quotes
> > 2. Once stats from all the chunks are collected, the leader identifies
> > the adjusted chunk boundaries by iterating over the stats linearly:
> > - For the k-th chunk, the leader adds the number of quotes in k-1 chunks.
> > - If the number is even, then the k-th chunk does not start in the
> > middle of a quoted field, and the first newline after an even number
> > of quotes (the second collected information) is the first record
> > delimiter in this chunk.
> > - Otherwise, if the number is odd, the first newline after an odd
> > number of quotes (the third collected information) is the first record
> > delimiter.
> > - The end position of the adjusted chunk is obtained based on the
> > starting position of the next adjusted chunk.
>
> The trouble is that, at least with current coding, the number of
> quotes in a chunk can depend on whether the chunk started in a quote
> or not. That's because escape characters only count inside quotes. See
> for example the following csv:
>
> foo,\"bar
> baz",\"xyz"
>
> This currently parses as one line and the number of parsed quotes
> doesn't change if you add a quote in front.
>
> But the general approach of doing the tokenization in parallel and
> then a serial pass over the tokenization would still work. The quote
> counting and new line finding just has to be done for both starting in
> quote and not starting in quote case.
>
Yeah, right.

> Using phases doesn't look like the correct approach - the tokenization
> can be prepared just in time for the serial pass and processing the
> chunk can proceed immediately after. This could all be done by having
> the data in a single ringbuffer with a processing pipeline where one
> process does the reading, then workers grab tokenization chunks as
> they become available, then one process handles determining the chunk
> boundaries, after which the chunks are processed.
>
I was thinking from this point of view - the sooner we introduce
parallelism in the process, the greater the benefits. Probably there
isn't any way to avoid a single-pass over the data (phase - 2 in the
above case) to tokenise the chunks. So yeah, if the reading and
tokenisation phase doesn't take much time, parallelising the same will
just be an overkill. As pointed by Andres and you, using a lock-free
circular buffer implementation sounds the way to go forward. AFAIK,
FIFO circular queue with CAS-based implementation suffers from two
problems - 1. (as pointed by you) slow workers may block producers. 2.
Since it doesn't partition the queue among the workers, does not
achieve good locality and cache-friendliness, limits their scalability
on NUMA systems.

> But I still don't think this is something to worry about for the first
> version. Just a better line splitting algorithm should go a looong way
> in feeding a large number of workers, even when inserting to an
> unindexed unlogged table. If we get the SIMD line splitting in, it
> will be enough to overwhelm most I/O subsystems available today.
>
Yeah. Parsing text is a great use case for data parallelism which can
be achieved by SIMD instructions. Consider processing 8-bit ASCII
characters in 512-bit SIMD word. A lot of code and complexity from
CopyReadLineText will surely go away. And further (I'm not sure in
this point), if we can use the schema of the table, perhaps JIT can
generate machine code to efficient read of fields based on their
types.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: James Coleman
Date:
Subject: Re: Incremental sorts and EXEC_FLAG_REWIND
Next
From: Alvaro Herrera
Date:
Subject: Re: wrong relkind error messages