Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling
Date
Msg-id CAPpHfdvV8FC67Emeb9XJpULkMOtrJiyC0dGL7FMSyRZ2SLk=5Q@mail.gmail.com
Whole thread Raw
In response to [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errorshandling  (Alexey Kondratov <kondratov.aleksey@gmail.com>)
Responses Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling  (Alex K <kondratov.aleksey@gmail.com>)
List pgsql-hackers
Hi, Alexey!

On Tue, Mar 28, 2017 at 1:54 AM, Alexey Kondratov <kondratov.aleksey@gmail.com> wrote:
Thank you for your responses and valuable comments!


It seems that COPY currently is able to return first error line and error type (extra or missing columns, type parse error, etc).
Thus, the approach similar to the Stas wrote should work and, being optimised for a small number of error rows, should not 
affect COPY performance in such case.

I will be glad to receive any critical remarks and suggestions.

I've following questions about your proposal.

1. Suppose we have to insert N records
2. We create subtransaction with these N records
3. Error is raised on k-th line
4. Then, we can safely insert all lines from 1st and till (k - 1)
5. Report, save to errors table or silently drop k-th line
6. Next, try to insert lines from (k + 1) till N with another subtransaction
7. Repeat until the end of file

Do you assume that we start new subtransaction in 4 since subtransaction we started in 2 is rolled back?

I am planning to use background worker processes for parallel COPY execution. Each process will receive equal piece of the input file. Since file is splitted by size not by lines, each worker will start import from the first new line to do not hit a broken line.

I think that situation when backend is directly reading file during COPY is not typical.  More typical case is \copy psql command.  In that case "COPY ... FROM stdin;" is actually executed while psql is streaming the data.
How can we apply parallel COPY in this case?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

pgsql-hackers by date:

Previous
From: Kevin Grittner
Date:
Subject: Re: [HACKERS] [GSoC] Push-based query executor discussion
Next
From: Alexander Korotkov
Date:
Subject: Re: [HACKERS] LWLock optimization for multicore Power machines