Home > mailing lists

Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling - Mailing list pgsql-hackers

From	Alexander Korotkov
Subject	Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling
Date	April 6, 2017 19:47:46
Msg-id	CAPpHfdvV8FC67Emeb9XJpULkMOtrJiyC0dGL7FMSyRZ2SLk=5Q@mail.gmail.com Whole thread Raw
In response to	[HACKERS] GSOC'17 project introduction: Parallel COPY execution with errorshandling (Alexey Kondratov <kondratov.aleksey@gmail.com>)
Responses	Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling (Alex K <kondratov.aleksey@gmail.com>)
List	pgsql-hackers

Tree view

Hi, Alexey!

On Tue, Mar 28, 2017 at 1:54 AM, Alexey Kondratov <kondratov.aleksey@gmail.com> wrote:

Thank you for your responses and valuable comments!

I have written draft proposal https://docs.google.com/document/d/1Y4mc_PCvRTjLsae-_fhevYfepv4sxaqwhOo4rlxvK1c/edit

It seems that COPY currently is able to return first error line and error type (extra or missing columns, type parse error, etc).
Thus, the approach similar to the Stas wrote should work and, being optimised for a small number of error rows, should not
affect COPY performance in such case.

I will be glad to receive any critical remarks and suggestions.

I've following questions about your proposal.

1. Suppose we have to insert N records
2. We create subtransaction with these N records
3. Error is raised on k-th line
4. Then, we can safely insert all lines from 1st and till (k - 1)

5. Report, save to errors table or silently drop k-th line
6. Next, try to insert lines from (k + 1) till N with another subtransaction
7. Repeat until the end of file

Do you assume that we start new subtransaction in 4 since subtransaction we started in 2 is rolled back?

I am planning to use background worker processes for parallel COPY execution. Each process will receive equal piece of the input file. Since file is splitted by size not by lines, each worker will start import from the first new line to do not hit a broken line.

I think that situation when backend is directly reading file during COPY is not typical. More typical case is \copy psql command. In that case "COPY ... FROM stdin;" is actually executed while psql is streaming the data.

How can we apply parallel COPY in this case?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

pgsql-hackers by date:

From: Kevin Grittner
Date: 06 April 2017, 19:31:37
Subject: Re: [HACKERS] [GSoC] Push-based query executor discussion

From: Alexander Korotkov
Date: 06 April 2017, 20:37:25
Subject: Re: [HACKERS] LWLock optimization for multicore Power machines

Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling - Mailing list pgsql-hackers

Previous

Next