Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling - Mailing list pgsql-hackers

From Pavel Stehule
Subject Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling
Date
Msg-id CAFj8pRCqyF+G4=gG0vi+hS96zBNeikjoS1w6P8HS2ybN9_nAtA@mail.gmail.com
Whole thread Raw
In response to [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errorshandling  (Alexey Kondratov <kondratov.aleksey@gmail.com>)
Responses Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling  (Pavel Stehule <pavel.stehule@gmail.com>)
List pgsql-hackers
Hi

2017-03-23 12:33 GMT+01:00 Alexey Kondratov <kondratov.aleksey@gmail.com>:
Hi pgsql-hackers,

I'm planning to apply to GSOC'17 and my proposal consists currently of two parts:

(1) Add errors handling to COPY as a minimum program

Motivation: Using PG on the daily basis for years I found that there are some cases when you need to load (e.g. for a further analytics) a bunch of not well consistent records with rare type/column mismatches. Since PG throws exception on the first error, currently the only one solution is to preformat your data with any other tool and then load to PG. However, frequently it is easier to drop certain records instead of doing such preprocessing for every data source you have.

I have done a small research and found the item in PG's TODO https://wiki.postgresql.org/wiki/Todo#COPY, previous attempt to push similar patch https://www.postgresql.org/message-id/flat/603c8f070909141218i291bc983t501507ebc996a531%40mail.gmail.com#603c8f070909141218i291bc983t501507ebc996a531@mail.gmail.com. There were no negative responses against this patch and it seams that it was just forgoten and have not been finalized.

As an example of a general idea I can provide read_csv method of python package – pandas (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). It uses C parser which throws error on first columns mismatch. However, it has two flags error_bad_lines and warn_bad_lines, which being set to False helps to drop bad lines or even hide warn messages about them.


(2) Parallel COPY execution as a maximum program

I guess that there is nothing necessary to say about motivation, it just should be faster on multicore CPUs.

There is also an record about parallel COPY in PG's wiki https://wiki.postgresql.org/wiki/Parallel_Query_Execution. There are some side extensions, e.g. https://github.com/ossc-db/pg_bulkload, but it always better to have well-performing core functionality out of the box.


My main concerns here are:

1) Is there anyone out of PG comunity who will be interested in such project and can be a menthor?
2) These two points have a general idea – to simplify work with a large amount of data from a different sources, but mybe it would be better to focus on the single task?

I spent lot of time on implementation @1 - maybe I found somewhere a patch. Both tasks has some common - you have to divide import to more batches. 

 
3) Is it realistic to mostly finish both parts during the 3+ months of almost full-time work or I am too presumptuous?

It is possible, I am thinking - I am not sure about all possible details, but basic implementation can be done in 3 months.
 

I will be very appreciate to any comments and criticism.


P.S. I know about very interesting ready projects from the PG's comunity https://wiki.postgresql.org/wiki/GSoC_2017, but it always more interesting to solve your own problems, issues and questions, which are the product of you experience with software. That's why I dare to propose my own project.

P.P.S. A few words about me: I'm a PhD stident in Theoretical physics from Moscow, Russia, and highly involved in software development since 2010. I guess that I have good skills in Python, Ruby, JavaScript, MATLAB, C, Fortran development and basic understanding of algorithms design and analysis.


Best regards,

Alexey

pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: [HACKERS] Monitoring roles patch
Next
From: Etsuro Fujita
Date:
Subject: Re: [HACKERS] postgres_fdw: support parameterized foreign joins