Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling - Mailing list pgsql-hackers

From Alex K
Subject Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling
Date
Msg-id CADfU8WxKzLun7X0o_HZ75p07JVanvXpkym6YmjrGX1n9CzNz6w@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling  (Alexander Korotkov <a.korotkov@postgrespro.ru>)
Responses Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi Alexander!

I've missed your reply, since proposal submission deadline have passed last Monday and I didn't check hackers mailing list too frequently.

(1) It seems that starting new subtransaction at step 4 is not necessary. We can just gather all error lines in one pass and at the end of input start the only one additional subtransaction with all safe-lines at once: [1, ..., k1 - 1, k1 + 1, ..., k2 - 1, k2 + 1, ...], where ki is an error line number.

But assuming that the only livable use-case is when number of errors is relatively small compared to the total rows number, because if the input is in totally inconsistent format, then it seems useless to import it into the db. Thus, it is not 100% clear for me, would it be any real difference in performance, if one starts new subtransaction at step 4 or not.

(2) Hmm, good question. As far as I know it is impossible to get stdin input size, thus it is impossible to distribute stdin directly to the parallel workers. The first approach which comes to the mind is to store stdin input in any kind of buffer/query and next read it in parallel by workers. The question is how it will perform in the case of large file, I guess poor, at least from the memory consumption perspective. But would parallel execution still be faster is the next question.


Alexey



On Thu, Apr 6, 2017 at 4:47 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Hi, Alexey!

On Tue, Mar 28, 2017 at 1:54 AM, Alexey Kondratov <kondratov.aleksey@gmail.com> wrote:
Thank you for your responses and valuable comments!


It seems that COPY currently is able to return first error line and error type (extra or missing columns, type parse error, etc).
Thus, the approach similar to the Stas wrote should work and, being optimised for a small number of error rows, should not 
affect COPY performance in such case.

I will be glad to receive any critical remarks and suggestions.

I've following questions about your proposal.

1. Suppose we have to insert N records
2. We create subtransaction with these N records
3. Error is raised on k-th line
4. Then, we can safely insert all lines from 1st and till (k - 1)
5. Report, save to errors table or silently drop k-th line
6. Next, try to insert lines from (k + 1) till N with another subtransaction
7. Repeat until the end of file

Do you assume that we start new subtransaction in 4 since subtransaction we started in 2 is rolled back?

I am planning to use background worker processes for parallel COPY execution. Each process will receive equal piece of the input file. Since file is splitted by size not by lines, each worker will start import from the first new line to do not hit a broken line.

I think that situation when backend is directly reading file during COPY is not typical.  More typical case is \copy psql command.  In that case "COPY ... FROM stdin;" is actually executed while psql is streaming the data.
How can we apply parallel COPY in this case?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: [HACKERS] Merge join for GiST
Next
From: Ildar Musin
Date:
Subject: [HACKERS] Repetitive code in RI triggers