Thread: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errorshandling

Hi pgsql-hackers,

I'm planning to apply to GSOC'17 and my proposal consists currently of two parts:

(1) Add errors handling to COPY as a minimum program

Motivation: Using PG on the daily basis for years I found that there are some cases when you need to load (e.g. for a further analytics) a bunch of not well consistent records with rare type/column mismatches. Since PG throws exception on the first error, currently the only one solution is to preformat your data with any other tool and then load to PG. However, frequently it is easier to drop certain records instead of doing such preprocessing for every data source you have.

I have done a small research and found the item in PG's TODO https://wiki.postgresql.org/wiki/Todo#COPY, previous attempt to push similar patch https://www.postgresql.org/message-id/flat/603c8f070909141218i291bc983t501507ebc996a531%40mail.gmail.com#603c8f070909141218i291bc983t501507ebc996a531@mail.gmail.com. There were no negative responses against this patch and it seams that it was just forgoten and have not been finalized.

As an example of a general idea I can provide read_csv method of python package – pandas (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). It uses C parser which throws error on first columns mismatch. However, it has two flags error_bad_lines and warn_bad_lines, which being set to False helps to drop bad lines or even hide warn messages about them.


(2) Parallel COPY execution as a maximum program

I guess that there is nothing necessary to say about motivation, it just should be faster on multicore CPUs.

There is also an record about parallel COPY in PG's wiki https://wiki.postgresql.org/wiki/Parallel_Query_Execution. There are some side extensions, e.g. https://github.com/ossc-db/pg_bulkload, but it always better to have well-performing core functionality out of the box.


My main concerns here are:

1) Is there anyone out of PG comunity who will be interested in such project and can be a menthor?
2) These two points have a general idea – to simplify work with a large amount of data from a different sources, but mybe it would be better to focus on the single task?
3) Is it realistic to mostly finish both parts during the 3+ months of almost full-time work or I am too presumptuous?

I will be very appreciate to any comments and criticism.


P.S. I know about very interesting ready projects from the PG's comunity https://wiki.postgresql.org/wiki/GSoC_2017, but it always more interesting to solve your own problems, issues and questions, which are the product of you experience with software. That's why I dare to propose my own project.

P.P.S. A few words about me: I'm a PhD stident in Theoretical physics from Moscow, Russia, and highly involved in software development since 2010. I guess that I have good skills in Python, Ruby, JavaScript, MATLAB, C, Fortran development and basic understanding of algorithms design and analysis.


Best regards,

Alexey
Hi

2017-03-23 12:33 GMT+01:00 Alexey Kondratov <kondratov.aleksey@gmail.com>:
Hi pgsql-hackers,

I'm planning to apply to GSOC'17 and my proposal consists currently of two parts:

(1) Add errors handling to COPY as a minimum program

Motivation: Using PG on the daily basis for years I found that there are some cases when you need to load (e.g. for a further analytics) a bunch of not well consistent records with rare type/column mismatches. Since PG throws exception on the first error, currently the only one solution is to preformat your data with any other tool and then load to PG. However, frequently it is easier to drop certain records instead of doing such preprocessing for every data source you have.

I have done a small research and found the item in PG's TODO https://wiki.postgresql.org/wiki/Todo#COPY, previous attempt to push similar patch https://www.postgresql.org/message-id/flat/603c8f070909141218i291bc983t501507ebc996a531%40mail.gmail.com#603c8f070909141218i291bc983t501507ebc996a531@mail.gmail.com. There were no negative responses against this patch and it seams that it was just forgoten and have not been finalized.

As an example of a general idea I can provide read_csv method of python package – pandas (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). It uses C parser which throws error on first columns mismatch. However, it has two flags error_bad_lines and warn_bad_lines, which being set to False helps to drop bad lines or even hide warn messages about them.


(2) Parallel COPY execution as a maximum program

I guess that there is nothing necessary to say about motivation, it just should be faster on multicore CPUs.

There is also an record about parallel COPY in PG's wiki https://wiki.postgresql.org/wiki/Parallel_Query_Execution. There are some side extensions, e.g. https://github.com/ossc-db/pg_bulkload, but it always better to have well-performing core functionality out of the box.


My main concerns here are:

1) Is there anyone out of PG comunity who will be interested in such project and can be a menthor?
2) These two points have a general idea – to simplify work with a large amount of data from a different sources, but mybe it would be better to focus on the single task?

I spent lot of time on implementation @1 - maybe I found somewhere a patch. Both tasks has some common - you have to divide import to more batches. 

 
3) Is it realistic to mostly finish both parts during the 3+ months of almost full-time work or I am too presumptuous?

It is possible, I am thinking - I am not sure about all possible details, but basic implementation can be done in 3 months.
 

I will be very appreciate to any comments and criticism.


P.S. I know about very interesting ready projects from the PG's comunity https://wiki.postgresql.org/wiki/GSoC_2017, but it always more interesting to solve your own problems, issues and questions, which are the product of you experience with software. That's why I dare to propose my own project.

P.P.S. A few words about me: I'm a PhD stident in Theoretical physics from Moscow, Russia, and highly involved in software development since 2010. I guess that I have good skills in Python, Ruby, JavaScript, MATLAB, C, Fortran development and basic understanding of algorithms design and analysis.


Best regards,

Alexey





1) Is there anyone out of PG comunity who will be interested in such project and can be a menthor?
2) These two points have a general idea – to simplify work with a large amount of data from a different sources, but mybe it would be better to focus on the single task?

I spent lot of time on implementation @1 - maybe I found somewhere a patch. Both tasks has some common - you have to divide import to more batches. 

Patch is in /dev/null :( - My implementation was based on subtransactions for 1000 rows. When some checks fails, then I throw subtransaction and I imported every row from block in own subtransaction. It was a prototype - I didn't search some smarter implementation.  

 
3) Is it realistic to mostly finish both parts during the 3+ months of almost full-time work or I am too presumptuous?

It is possible, I am thinking - I am not sure about all possible details, but basic implementation can be done in 3 months.

Some data, some check depends on order - it can be a problem in parallel processing - you should to define corner cases.
 
 

I will be very appreciate to any comments and criticism.


P.S. I know about very interesting ready projects from the PG's comunity https://wiki.postgresql.org/wiki/GSoC_2017, but it always more interesting to solve your own problems, issues and questions, which are the product of you experience with software. That's why I dare to propose my own project.

P.P.S. A few words about me: I'm a PhD stident in Theoretical physics from Moscow, Russia, and highly involved in software development since 2010. I guess that I have good skills in Python, Ruby, JavaScript, MATLAB, C, Fortran development and basic understanding of algorithms design and analysis.


Best regards,

Alexey


On 23 March 2017 at 19:33, Alexey Kondratov <kondratov.aleksey@gmail.com> wrote:

> (1) Add errors handling to COPY as a minimum program

Huge +1 if you can do it in an efficient way.

I think the main barrier to doing so is that the naïve approach
creates a subtransaction for every row, which is pretty dire in
performance terms and burns transaction IDs very rapidly.

Most of our datatype I/O functions, etc, have no facility for being
invoked in a mode where they fail nicely and clean up after
themselves. We rely on unwinding the subtransaction's memory context
for error handling, for releasing any LWLocks that were taken, etc.
There's no try_timestamptz_in function or anything, just
timestamptz_in, and it ERROR's if it doesn't like its input. You
cannot safely PG_TRY / PG_CATCH such an exception and continue
processing to, say, write another row.

Currently we also don't have a way to differentiate between

* "this row is structurally invalid" (wrong number of columns, etc)
* "this row is structually valid but has fields we could not parse
into their data types"
* "this row looks structurally valid and has data types we could
parse, but does not satisfy a constraint on the destination table"

Nor do we have a way to write to any kind of failure-log table in the
database, since a simple approach relies on aborting subtransactions
to clean up failed inserts so it can't write anything for failed rows.
Not without starting a 2nd subxact to record the failure, anyway.

So, having said why it's hard, I don't really have much for you in
terms of suggestions for ways forward. User-defined data types,
user-defined constraints and triggers, etc mean anything involving
significant interface changes will be a hard sell, especially in
something pretty performance-sensitive like COPY.

I guess it'd be worth setting out your goals first. Do you want to
handle all the kinds of problems above? Malformed  rows, rows with
malformed field values, and rows that fail to satisfy a constraint? or
just some subset?



-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



> On 23 Mar 2017, at 15:53, Craig Ringer <craig@2ndquadrant.com> wrote:
>
> On 23 March 2017 at 19:33, Alexey Kondratov <kondratov.aleksey@gmail.com> wrote:
>
>> (1) Add errors handling to COPY as a minimum program
>
> Huge +1 if you can do it in an efficient way.
>
> I think the main barrier to doing so is that the naïve approach
> creates a subtransaction for every row, which is pretty dire in
> performance terms and burns transaction IDs very rapidly.
>
> Most of our datatype I/O functions, etc, have no facility for being
> invoked in a mode where they fail nicely and clean up after
> themselves. We rely on unwinding the subtransaction's memory context
> for error handling, for releasing any LWLocks that were taken, etc.
> There's no try_timestamptz_in function or anything, just
> timestamptz_in, and it ERROR's if it doesn't like its input. You
> cannot safely PG_TRY / PG_CATCH such an exception and continue
> processing to, say, write another row.
>
> Currently we also don't have a way to differentiate between
>
> * "this row is structurally invalid" (wrong number of columns, etc)
> * "this row is structually valid but has fields we could not parse
> into their data types"
> * "this row looks structurally valid and has data types we could
> parse, but does not satisfy a constraint on the destination table"
>
> Nor do we have a way to write to any kind of failure-log table in the
> database, since a simple approach relies on aborting subtransactions
> to clean up failed inserts so it can't write anything for failed rows.
> Not without starting a 2nd subxact to record the failure, anyway.

If we are optimising COPY for case with small amount of bad rows
than 2nd subtransaction seems as not a bad idea. We can try to
apply batch in subtx, if it fails on some row N then insert rows [1, N)
in next subtx, report an error, commit subtx. Row N+1 can be treated
as beginning of next batch.


But if there will be some problems with handling everything with
subtransaction and since parallelism is anyway mentioned, what about
starting bgworker that will perform data insertion and will be controlled
by backend?

For example backend can do following:

* Start bgworker (or even parallel worker)
* Get chunk of rows out of the file and try to apply them in batch
as subtransaction in bgworker.
* If it fails than we can open transaction in backend itself and
raise notice / move failed rows to special errors table.

> So, having said why it's hard, I don't really have much for you in
> terms of suggestions for ways forward. User-defined data types,
> user-defined constraints and triggers, etc mean anything involving
> significant interface changes will be a hard sell, especially in
> something pretty performance-sensitive like COPY.
>
> I guess it'd be worth setting out your goals first. Do you want to
> handle all the kinds of problems above? Malformed  rows, rows with
> malformed field values, and rows that fail to satisfy a constraint? or
> just some subset?
>
>
>
> --
> Craig Ringer                   http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers




Re: GSOC'17 project introduction: Parallel COPY executionwith errors handling

From
Alexey Kondratov
Date:
Pavel, Craig and Stas,

Thank you for your responses and valuable comments!


It seems that COPY currently is able to return first error line and error type (extra or missing columns, type parse error, etc).
Thus, the approach similar to the Stas wrote should work and, being optimised for a small number of error rows, should not 
affect COPY performance in such case.

I will be glad to receive any critical remarks and suggestions.


Alexey


On 23 Mar 2017, at 17:24, Stas Kelvich <stas.kelvich@gmail.com> wrote:


On 23 Mar 2017, at 15:53, Craig Ringer <craig@2ndquadrant.com> wrote:

On 23 March 2017 at 19:33, Alexey Kondratov <kondratov.aleksey@gmail.com> wrote:

(1) Add errors handling to COPY as a minimum program

Huge +1 if you can do it in an efficient way.

I think the main barrier to doing so is that the naïve approach
creates a subtransaction for every row, which is pretty dire in
performance terms and burns transaction IDs very rapidly.

Most of our datatype I/O functions, etc, have no facility for being
invoked in a mode where they fail nicely and clean up after
themselves. We rely on unwinding the subtransaction's memory context
for error handling, for releasing any LWLocks that were taken, etc.
There's no try_timestamptz_in function or anything, just
timestamptz_in, and it ERROR's if it doesn't like its input. You
cannot safely PG_TRY / PG_CATCH such an exception and continue
processing to, say, write another row.

Currently we also don't have a way to differentiate between

* "this row is structurally invalid" (wrong number of columns, etc)
* "this row is structually valid but has fields we could not parse
into their data types"
* "this row looks structurally valid and has data types we could
parse, but does not satisfy a constraint on the destination table"

Nor do we have a way to write to any kind of failure-log table in the
database, since a simple approach relies on aborting subtransactions
to clean up failed inserts so it can't write anything for failed rows.
Not without starting a 2nd subxact to record the failure, anyway.

If we are optimising COPY for case with small amount of bad rows
than 2nd subtransaction seems as not a bad idea. We can try to
apply batch in subtx, if it fails on some row N then insert rows [1, N)
in next subtx, report an error, commit subtx. Row N+1 can be treated
as beginning of next batch.


But if there will be some problems with handling everything with
subtransaction and since parallelism is anyway mentioned, what about
starting bgworker that will perform data insertion and will be controlled
by backend?

For example backend can do following:

* Start bgworker (or even parallel worker) 
* Get chunk of rows out of the file and try to apply them in batch
as subtransaction in bgworker.
* If it fails than we can open transaction in backend itself and
raise notice / move failed rows to special errors table.

So, having said why it's hard, I don't really have much for you in
terms of suggestions for ways forward. User-defined data types,
user-defined constraints and triggers, etc mean anything involving
significant interface changes will be a hard sell, especially in
something pretty performance-sensitive like COPY.

I guess it'd be worth setting out your goals first. Do you want to
handle all the kinds of problems above? Malformed  rows, rows with
malformed field values, and rows that fail to satisfy a constraint? or
just some subset?



-- 
Craig Ringer                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Hi, Alexey!

On Tue, Mar 28, 2017 at 1:54 AM, Alexey Kondratov <kondratov.aleksey@gmail.com> wrote:
Thank you for your responses and valuable comments!


It seems that COPY currently is able to return first error line and error type (extra or missing columns, type parse error, etc).
Thus, the approach similar to the Stas wrote should work and, being optimised for a small number of error rows, should not 
affect COPY performance in such case.

I will be glad to receive any critical remarks and suggestions.

I've following questions about your proposal.

1. Suppose we have to insert N records
2. We create subtransaction with these N records
3. Error is raised on k-th line
4. Then, we can safely insert all lines from 1st and till (k - 1)
5. Report, save to errors table or silently drop k-th line
6. Next, try to insert lines from (k + 1) till N with another subtransaction
7. Repeat until the end of file

Do you assume that we start new subtransaction in 4 since subtransaction we started in 2 is rolled back?

I am planning to use background worker processes for parallel COPY execution. Each process will receive equal piece of the input file. Since file is splitted by size not by lines, each worker will start import from the first new line to do not hit a broken line.

I think that situation when backend is directly reading file during COPY is not typical.  More typical case is \copy psql command.  In that case "COPY ... FROM stdin;" is actually executed while psql is streaming the data.
How can we apply parallel COPY in this case?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Hi Alexander!

I've missed your reply, since proposal submission deadline have passed last Monday and I didn't check hackers mailing list too frequently.

(1) It seems that starting new subtransaction at step 4 is not necessary. We can just gather all error lines in one pass and at the end of input start the only one additional subtransaction with all safe-lines at once: [1, ..., k1 - 1, k1 + 1, ..., k2 - 1, k2 + 1, ...], where ki is an error line number.

But assuming that the only livable use-case is when number of errors is relatively small compared to the total rows number, because if the input is in totally inconsistent format, then it seems useless to import it into the db. Thus, it is not 100% clear for me, would it be any real difference in performance, if one starts new subtransaction at step 4 or not.

(2) Hmm, good question. As far as I know it is impossible to get stdin input size, thus it is impossible to distribute stdin directly to the parallel workers. The first approach which comes to the mind is to store stdin input in any kind of buffer/query and next read it in parallel by workers. The question is how it will perform in the case of large file, I guess poor, at least from the memory consumption perspective. But would parallel execution still be faster is the next question.


Alexey



On Thu, Apr 6, 2017 at 4:47 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Hi, Alexey!

On Tue, Mar 28, 2017 at 1:54 AM, Alexey Kondratov <kondratov.aleksey@gmail.com> wrote:
Thank you for your responses and valuable comments!


It seems that COPY currently is able to return first error line and error type (extra or missing columns, type parse error, etc).
Thus, the approach similar to the Stas wrote should work and, being optimised for a small number of error rows, should not 
affect COPY performance in such case.

I will be glad to receive any critical remarks and suggestions.

I've following questions about your proposal.

1. Suppose we have to insert N records
2. We create subtransaction with these N records
3. Error is raised on k-th line
4. Then, we can safely insert all lines from 1st and till (k - 1)
5. Report, save to errors table or silently drop k-th line
6. Next, try to insert lines from (k + 1) till N with another subtransaction
7. Repeat until the end of file

Do you assume that we start new subtransaction in 4 since subtransaction we started in 2 is rolled back?

I am planning to use background worker processes for parallel COPY execution. Each process will receive equal piece of the input file. Since file is splitted by size not by lines, each worker will start import from the first new line to do not hit a broken line.

I think that situation when backend is directly reading file during COPY is not typical.  More typical case is \copy psql command.  In that case "COPY ... FROM stdin;" is actually executed while psql is streaming the data.
How can we apply parallel COPY in this case?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

On Mon, Apr 10, 2017 at 11:39 AM, Alex K <kondratov.aleksey@gmail.com> wrote:
> (1) It seems that starting new subtransaction at step 4 is not necessary. We
> can just gather all error lines in one pass and at the end of input start
> the only one additional subtransaction with all safe-lines at once: [1, ...,
> k1 - 1, k1 + 1, ..., k2 - 1, k2 + 1, ...], where ki is an error line number.

The only way to recover from an error is to abort the subtransaction,
or to abort the toplevel transaction.  Anything else is unsafe.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Yes, sure, I don't doubt it. The question was around step 4 in the following possible algorithm:

1. Suppose we have to insert N records
2. Start subtransaction with these N records
3. Error is raised on k-th line
4. Then, we know that we can safely insert all lines from the 1st till (k - 1)
5. Report, save to errors table or silently drop k-th line
6. Next, try to insert lines from (k + 1) till Nth with another subtransaction
7. Repeat until the end of file

One can start subtransaction with those (k - 1) safe-lines and repeat it after each error line
OR
iterate till the end of file and start only one subtransaction with all lines excepting error lines.


Alexey


> On 10 Apr 2017, at 19:55, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Apr 10, 2017 at 11:39 AM, Alex K <kondratov.aleksey@gmail.com> wrote:
>> (1) It seems that starting new subtransaction at step 4 is not necessary. We
>> can just gather all error lines in one pass and at the end of input start
>> the only one additional subtransaction with all safe-lines at once: [1, ...,
>> k1 - 1, k1 + 1, ..., k2 - 1, k2 + 1, ...], where ki is an error line number.
>
> The only way to recover from an error is to abort the subtransaction,
> or to abort the toplevel transaction.  Anything else is unsafe.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company




On Mon, Apr 10, 2017 at 2:46 PM, Alexey Kondratov
<kondratov.aleksey@gmail.com> wrote:
> Yes, sure, I don't doubt it. The question was around step 4 in the following possible algorithm:
>
> 1. Suppose we have to insert N records
> 2. Start subtransaction with these N records
> 3. Error is raised on k-th line
> 4. Then, we know that we can safely insert all lines from the 1st till (k - 1)
> 5. Report, save to errors table or silently drop k-th line
> 6. Next, try to insert lines from (k + 1) till Nth with another subtransaction
> 7. Repeat until the end of file
>
> One can start subtransaction with those (k - 1) safe-lines and repeat it after each error line

I don't understand what you mean by that.

> OR
> iterate till the end of file and start only one subtransaction with all lines excepting error lines.

That could involve buffering a huge file.  Imagine a 300GB load.

Also consider how many XIDs whatever design is proposed will blow
through when loading 300GB of data.  There's a nasty trade-off here
between XID consumption (and the aggressive vacuums it eventually
causes) and preserving performance in the face of errors - e.g. if you
make k = 100,000 you consume 100x fewer XIDs than if you make k =
1000, but you also have 100x the work to redo (on average) every time
you hit an error.  If the data quality is poor (say, 50% of lines have
errors) it's almost impossible to avoid runaway XID consumption.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



2017-04-11 Robert Haas <robertmhaas@gmail.com>:

> There's a nasty trade-off here between XID consumption (and the
> aggressive vacuums it eventually causes) and preserving performance in
> the face of errors - e.g. if you make k = 100,000 you consume 100x
> fewer XIDs than if you make k = 1000, but you also have 100x the work
> to redo (on average) every time you hit an error.

You could make it dynamic: Commit the subtransaction even when not
encountering any error after N lines (N starts out at 1), then double
N and continue. When encountering an error, roll back the current
subtransaction back and re-insert all the known good rows that have
been rolled back (plus maybe the erroneous row into a separate table
or whatever) in one new subtransaction and commit; then reset N to 1
and continue processing the rest of the file.

That would work reasonable well whenever the ratio of erroneous rows
is not extremely high: whether the erroneous rows are all clumped
together, entirely randomly spread out over the file, or a combination
of both.

> If the data quality is poor (say, 50% of lines have errors) it's
> almost impossible to avoid runaway XID consumption.

Yup, that seems difficult to work around with anything similar to the
proposed. So the docs might need to suggest not to insert a 300 GB
file with 50% erroneous lines :-).

Greetings,

Nicolas



On Wed, Apr 12, 2017 at 1:18 PM, Nicolas Barbier
<nicolas.barbier@gmail.com> wrote:
> 2017-04-11 Robert Haas <robertmhaas@gmail.com>:
>> There's a nasty trade-off here between XID consumption (and the
>> aggressive vacuums it eventually causes) and preserving performance in
>> the face of errors - e.g. if you make k = 100,000 you consume 100x
>> fewer XIDs than if you make k = 1000, but you also have 100x the work
>> to redo (on average) every time you hit an error.
>
> You could make it dynamic: Commit the subtransaction even when not
> encountering any error after N lines (N starts out at 1), then double
> N and continue. When encountering an error, roll back the current
> subtransaction back and re-insert all the known good rows that have
> been rolled back (plus maybe the erroneous row into a separate table
> or whatever) in one new subtransaction and commit; then reset N to 1
> and continue processing the rest of the file.
>
> That would work reasonable well whenever the ratio of erroneous rows
> is not extremely high: whether the erroneous rows are all clumped
> together, entirely randomly spread out over the file, or a combination
> of both.

Right.  I wouldn't suggest the exact algorithm you proposed; I think
you ought to vary between some lower limit >1, maybe 10, and some
upper limit, maybe 1,000,000, ratcheting up and down based on how
often you hit errors in some way that might not be as simple as
doubling.  But something along those lines.

>> If the data quality is poor (say, 50% of lines have errors) it's
>> almost impossible to avoid runaway XID consumption.
>
> Yup, that seems difficult to work around with anything similar to the
> proposed. So the docs might need to suggest not to insert a 300 GB
> file with 50% erroneous lines :-).

Yep.  But it does seem reasonably likely that someone might shoot
themselves in the foot anyway.  Maybe we just live with that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



> On 12 Apr 2017, at 20:23, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Apr 12, 2017 at 1:18 PM, Nicolas Barbier
> <nicolas.barbier@gmail.com> wrote:
>> 2017-04-11 Robert Haas <robertmhaas@gmail.com>:
>>> If the data quality is poor (say, 50% of lines have errors) it's
>>> almost impossible to avoid runaway XID consumption.
>>
>> Yup, that seems difficult to work around with anything similar to the
>> proposed. So the docs might need to suggest not to insert a 300 GB
>> file with 50% erroneous lines :-).
>
> Yep.  But it does seem reasonably likely that someone might shoot
> themselves in the foot anyway.  Maybe we just live with that.
>

Moreover if that file consists of one-byte lines (plus one byte of newline char)
then during its import xid wraparound will happens 18 times =)

I think it’s reasonable at least to have something like max_errors parameter
to COPY, that will be set by default to 1000 for example. If user will hit that
limit then it is a good moment to put a warning about possible xid consumption
in case of bigger limit.

However I think it worth of quick research whether it is possible to create special
code path for COPY in which errors don’t cancel transaction. At least when
COPY called outside of transaction block.


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





On 13 April 2017 at 01:57, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:

> However I think it worth of quick research whether it is possible to create special
> code path for COPY in which errors don’t cancel transaction.

Not really. Anything at any layer of the system expects to be able to ERROR:

* datatype input functions
* CHECK constraints
* FK constraints
* unique indexes
* user defined functions run by triggers
* interrupt signalling (think deadlock detector)
* ...

and we rely on ERROR unwinding any relevant memory contexts, releasing
lwlocks, etc.

When an xact aborts it may leave all sorts of mess on disk. Nothing
gets deleted, it's just ignored due to an aborted xmin.

Maybe some xid burn could be saved by trying harder to pre-validate
batches of data as much as possible before we write anything to the
heap, sorting obviously faulty data into buffers and doing as much
work as possible before allocating a new (sub)xid and writing to the
heap. We'd still abort but we'd only be aborting a vtxid.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Hi pgsql-hackers,

Thank you again for all these replies. I have started working under this project 
and learnt a lot of new stuff last month, so here are some new thoughts about 
ERRORS handling in COPY. I decided to stick to the same thread, since it 
has a neutral subject.

(1) One of my mentors--Alvaro Herrera--suggested me to have a look on the 
UPSERT. It may be a good point to be able to achieve the same functionality 
as during the ON CONFLICT DO NOTHING, when COPY actually inserts tuples 
and errors handling is turned on. It could additionally reduce number of failed 
subtransactions and reduce XIDs consumption, while still ignoring some common 
errors like unique index violation.

Adding a full support of ON CONFLICT DO NOTHING/UPDATE to COPY seems 
to be a large separated task and is out of the current project scope, but maybe there is 
a relatively simple way to somehow perform internally tuples insert with 
ON CONFLICT DO NOTHING? I have added Peter Geoghegan to cc, as 
I understand he is the major contributor of UPSERT in PostgreSQL. It would be great
if he will answer this question.

(2) Otherwise, I am still going to use subtransactions via BeginInternalSubTransactio
and PG_TRY / PG_CATCH with 
ReleaseCurrentSubTransaction / RollbackAndReleaseCurrentSubTransaction. 
To minimize XIDs consumption I will try to insert tuples in batches and pre-validate 
them as much as possible (as was suggested in the thread before).



Alexey



On Wed, Jun 7, 2017 at 12:34 PM, Alex K <kondratov.aleksey@gmail.com> wrote:
> (1) One of my mentors--Alvaro Herrera--suggested me to have a look on the
> UPSERT.

> It may be a good point to be able to achieve the same functionality
> as during the ON CONFLICT DO NOTHING, when COPY actually inserts tuples
> and errors handling is turned on. It could additionally reduce number of
> failed
> subtransactions and reduce XIDs consumption, while still ignoring some
> common
> errors like unique index violation.

Alvaro and I talked about this informally at PGCon.

> Adding a full support of ON CONFLICT DO NOTHING/UPDATE to COPY seems
> to be a large separated task and is out of the current project scope, but
> maybe there is
> a relatively simple way to somehow perform internally tuples insert with
> ON CONFLICT DO NOTHING? I have added Peter Geoghegan to cc, as
> I understand he is the major contributor of UPSERT in PostgreSQL. It would
> be great
> if he will answer this question.

I think that there is a way of making COPY use "speculative
insertion", so that it behaves the same as ON CONFLICT DO NOTHING with
no inference specification. Whether or not this is useful depends on a
lot of things.

You seem to be talking about doing this as an optimization on top of a
base feature that does the main thing you want (captures all errors
within an implementation level subxact without passing them to the
client). That could make sense, as a way of preventing extreme bloat
for a very bad case where almost all inserts have conflicts. (This
seems quite possible, whereas it seems much less likely that users
would have an input file simple full of illformed tuples.)

I think that you need to more formally identify what errors your new
COPY error handling will need to swallow. I'm not sure if it's
possible to avoid using subtransactions all together, but speculative
insertion would help if you find that you can do it without
subtransactions. Using subtransactions is always going to be a bit
ugly, because you'll need to continually reassess whether or not
you're batching insertions together at the right granularity (that is,
that you've weighed the rate of XID consumption against how much work
you lose when a batched transaction has to be "replayed" to include
things that are known to be valid). And, if you care about duplicate
violations, then you can't really be sure that replaying a "known
good" tuple will stay good from one moment to the next.

My advice right now is: see if you can figure out a way of doing what
you want without subtransactions at all, possibly by cutting some
scope. For example, maybe it would be satisfactory to have the
implementation just ignore constraint violations, but still raise
errors for invalid input for types. Is there really much value in
ignoring errors due to invalid encoding? It's not as if such problems
can be reliably detected today. If you use the wrong encoding, and
ignore some errors that COPY would generally raise, then there is an
excellent chance that you'll still insert some remaining rows with
text that has been incorrectly interpreted as valid in the database
encoding -- some text datums are bound to accidentally appear valid.
There are probably similar issues with other types. It's not clear
what the point is at which the user is no longer helped by ignoring
problems, because we cannot reliably detect *all* problems at the
level of each row.

If you must ignore errors within the input functions of types, then
maybe you can optionally let the user do that by way of a "dry run",
where the entire input file is examined for basic structural soundness
ahead of considering constraints. Any errors are saved then and there,
in a format that can be used to make sure that those entries are
skipped on a later COPY. As a further enhancement, in the future, the
user might then be able to define special transform functions that
correct the errors for those rows only. You kind of need to see all
the faulty rows together to do something like that, so a dry run could
make a lot of sense.

-- 
Peter Geoghegan



Thank you for your comments Peter, there are some points that I did not think about before.

On 9 Jun 2017, at 01:09, Peter Geoghegan <pg@bowt.ie> wrote:

Adding a full support of ON CONFLICT DO NOTHING/UPDATE to COPY seems
to be a large separated task and is out of the current project scope, but
maybe there is
a relatively simple way to somehow perform internally tuples insert with
ON CONFLICT DO NOTHING? I have added Peter Geoghegan to cc, as
I understand he is the major contributor of UPSERT in PostgreSQL. It would
be great
if he will answer this question.

I think that there is a way of making COPY use "speculative
insertion", so that it behaves the same as ON CONFLICT DO NOTHING with
no inference specification. Whether or not this is useful depends on a
lot of things.


I am not going to start with "speculative insertion" right now, but it would be very 
useful, if you give me a point, where to start. Maybe I will at least try to evaluate 
the complexity of the problem.

I think that you need to more formally identify what errors your new
COPY error handling will need to swallow.
...
My advice right now is: see if you can figure out a way of doing what
you want without subtransactions at all, possibly by cutting some
scope. For example, maybe it would be satisfactory to have the
implementation just ignore constraint violations, but still raise
errors for invalid input for types.

Initially I was thinking only about malformed rows, e.g. less or extra columns. 
Honestly, I did not know that there are so many levels and ways where error 
can occur. So currently (and especially after your comments) I prefer to focus 
only on the following list of errors:

1) File format issues
a. Less columns than needed
b. Extra columns

2) I am doubt about type mismatch. It is possible to imagine a situation when, 
e.g. some integers are exported as int, and some as "int", but I am not sure 
that is is a common situation.

3) Some constraint violations, e.g. unique index.

First appeared to be easy achievable without subtransactions. I have created a 
proof of concept version of copy, where the errors handling is turned on by default. 
Please, see small patch attached (applicable to 76b11e8a43eca4612dfccfe7f3ebd293fb8a46ec) 
It throws warnings instead of errors for malformed lines with less/extra columns 
and reports line number.

Second is probably achievable without subtransactions via the PG_TRY/PG_CATCH 
around heap_form_tuple, since it is not yet inserted into the heap.

But third is questionable without subtransactions, since even if we check 
constraints once, there maybe various before/after triggers which can modify 
tuple, so it will not satisfy them. Corresponding comment inside copy.c states: 
"Note that a BR trigger might modify tuple such that the partition constraint is 
no satisfied, so we need to check in that case." Thus, there are maybe different 
situations here, as I understand. However, it a point where "speculative insertion"
is able to help.

These three cases should cover most real-life scenarios.

Is there really much value in ignoring errors due to invalid encoding?

Now, I have some doubts about it too. If there is an encoding problem, 
it is probably about the whole file, not only a few rows.


Alexey


Attachment
On Mon, Jun 12, 2017 at 3:52 AM, Alexey Kondratov
<kondratov.aleksey@gmail.com> wrote:
> I am not going to start with "speculative insertion" right now, but it would
> be very
> useful, if you give me a point, where to start. Maybe I will at least try to
> evaluate
> the complexity of the problem.

Speculative insertion has the following special entry points to
heapam.c and execIndexing.c, currently only called within
nodeModifyTable.c:

* SpeculativeInsertionLockAcquire()

* HeapTupleHeaderSetSpeculativeToken()

* heap_insert() called with HEAP_INSERT_SPECULATIVE argument

* ExecInsertIndexTuples() with specInsert = true

* heap_finish_speculative()

* heap_abort_speculative()

Offhand, it doesn't seem like it would be that hard to teach another
heap_insert() caller the same tricks.

>> My advice right now is: see if you can figure out a way of doing what
>> you want without subtransactions at all, possibly by cutting some
>> scope. For example, maybe it would be satisfactory to have the
>> implementation just ignore constraint violations, but still raise
>> errors for invalid input for types.
>
>
> Initially I was thinking only about malformed rows, e.g. less or extra
> columns.
> Honestly, I did not know that there are so many levels and ways where error
> can occur.

My sense is that it's going to be hard to sell a committer on any
design that consumes subtransactions in a way that's not fairly
obvious to the user, and doesn't have a pretty easily understood worse
case. But, that's just my opinion, and it's possible that someone else
will disagree. Try to get a second opinion.

Limiting the feature to just skip rows on the basis of a formally
defined constraint failing (not including type input failure, or a
trigger throwing an error, and probably not including foreign key
failures because they're really triggers) might be a good approach.
MySQL's INSERT IGNORE is a bit like that, I think. (It doesn't *just*
ignore duplicate violations, unlike our ON CONFLICT DO NOTHING
feature).

I haven't thought about this very carefully, but I guess you could do
something like passing a flag to ExecConstraints() that indicates
"don't throw an error; instead, just return false so I know not to
proceed". Plus maybe one or two other cases, like using speculative
insertion to back out of unique violation without consuming a subxact.

-- 
Peter Geoghegan




On 13 Jun 2017, at 01:44, Peter Geoghegan <pg@bowt.ie> wrote:
I am not going to start with "speculative insertion" right now, but it would
be very
useful, if you give me a point, where to start. Maybe I will at least try to
evaluate
the complexity of the problem.

Speculative insertion has the following special entry points to
heapam.c and execIndexing.c, currently only called within
nodeModifyTable.c

Offhand, it doesn't seem like it would be that hard to teach another
heap_insert() caller the same tricks.

I went through the nodeModifyTable.c code and it seems not to be so 
difficult to do the same inside COPY.

My sense is that it's going to be hard to sell a committer on any
design that consumes subtransactions in a way that's not fairly
obvious to the user, and doesn't have a pretty easily understood worse
case. 

Yes, and worse case probably will be a quite frequent case, since it is not possible to do heap_multy_insert, if BEFORE/INSTEAD triggers or partitioning exist (according to the current copy.c code). Thus, it will frequently fall back into a single heap_insert, each being wrapped with subtransaction will consume XIDs too greedy and seriously affect performance. I like my previous idea less and less.

I haven't thought about this very carefully, but I guess you could do
something like passing a flag to ExecConstraints() that indicates
"don't throw an error; instead, just return false so I know not to
proceed"

Currently ExecConstraints always throws an error and I do not think, that it would be wise from my side to modify its behaviour.

I have updated my patch (rebased over the topmost master commit 94da2a6a9a05776953524424a3d8079e54bc5d94). Please, find patch file attached or always up to date version on GitHub https://github.com/ololobus/postgres/pull/1/files

Currently, It caches all major errors in the input data:

1) Rows with less/extra columns cause WARNINGs and are skipped

2) I found that input type format errors are thrown from the InputFunctionCall; and wrapped it up with PG_TRY/CATCH. I am not 100%



Alexey



Attachment
Sorry for a previous email, I have accidentally sent it unfinished.

On 13 Jun 2017, at 01:44, Peter Geoghegan <pg@bowt.ie> wrote:

Speculative insertion has the following special entry points to
heapam.c and execIndexing.c, currently only called within
nodeModifyTable.c

Offhand, it doesn't seem like it would be that hard to teach another
heap_insert() caller the same tricks.

I went through the nodeModifyTable.c code and it seems not to be so 
difficult to do the same inside COPY.

My sense is that it's going to be hard to sell a committer on any
design that consumes subtransactions in a way that's not fairly
obvious to the user, and doesn't have a pretty easily understood worse
case. 

Yes, and worse case probably will be a quite frequent case, since it is not 
possible to do heap_multi_insert, when BEFORE/INSTEAD triggers or partitioning 
exist (according to the current copy.c code). Thus, it will frequently fall back 
into a single heap_insert, each being wrapped with subtransaction will 
consume XIDs too greedy and seriously affect performance. I like my initial 
idea less and less.

By the way, is it possible to use heap_multi_insert with speculative insertion too?

I haven't thought about this very carefully, but I guess you could do
something like passing a flag to ExecConstraints() that indicates
"don't throw an error; instead, just return false so I know not to
proceed"

Currently, ExecConstraints always throws an error and I do not think, that 
it would be wise from my side to modify its behaviour.

I have updated my patch (rebased over the topmost master commit 
94da2a6a9a05776953524424a3d8079e54bc5d94). Please, find patch 
file attached or always up to date version on GitHub 

It catches all major errors in the input data:

1) Rows with less/extra columns cause WARNINGs and are skipped

2) I found that input type format errors are thrown from the 
InputFunctionCall; and wrapped it up with PG_TRY/CATCH. 

I am not sure that it is 100% transactionally safe, but it seems so, 
since all these errors are handled before this point
where current COPY implementation has a mechanism to skip tuple. 
I use the same skip_tuple flag.

Patch passes all regression tests, excepting a few tests due to the slightly 
changed error message texts.


Now, I think that it may be a good idea to separate all possible errors 
into two groups:
– Malformed input data
– DB conflicts during insertion

First is solved (I hope) well with the current patch. I can add, e.g. 
MAXERRORS flag to COPY, which will limit number of errors.

Second may be solved with speculative insertion using the same 
syntax ON CONFLICT DO as in INSERT statement.

Following this way, we do not use subtransactions at all; and keeping 
predictable and consistent behaviour of INSERT and COPY along the 
database. For me it sounds much better, than just swallowing all errors 
without a difference and any logic.


Alexey

Attachment
> On 16 Jun 2017, at 21:30, Alexey Kondratov <kondratov.aleksey@gmail.com> wrote:

> > On 13 Jun 2017, at 01:44, Peter Geoghegan <pg@bowt.ie> wrote:


> > Speculative insertion has the following special entry points to
> > heapam.c and execIndexing.c, currently only called within
> > nodeModifyTable.c

> > Offhand, it doesn't seem like it would be that hard to teach another
> > heap_insert() caller the same tricks.


> I went through the nodeModifyTable.c code and it seems not to be so
> difficult to do the same inside COPY.

After a more precise look, I have figured out at least one difficulty, COPY
and INSERT follow the different execution paths: INSERT goes through
the Planner, while COPY does not. It leads to the absence of some required
attributes like arbiterIndexes, which are available during INSERT via
PlanState/ModifyTableState. Probably it is possible to get the same in the
COPY, but it is not clear for me how.

Anyway, adding of the 'speculative insertion' into the COPY is worth of a
separated patch; and I would be glad to try implementing it.

In the same time I have prepared a complete working patch with:

- ignoring of the input data formatting errors
- IGNORE_ERRORS parameter in the COPY options
- updated regression tests

Please, find the patch attached or check the web UI diff on GitHub as always:
https://github.com/ololobus/postgres/pull/1/files


Alexey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Wed, Jun 21, 2017 at 11:37 PM, Alex K <kondratov.aleksey@gmail.com> wrote:
>> On 16 Jun 2017, at 21:30, Alexey Kondratov <kondratov.aleksey@gmail.com> wrote:
>
>> > On 13 Jun 2017, at 01:44, Peter Geoghegan <pg@bowt.ie> wrote:
>
>
>> > Speculative insertion has the following special entry points to
>> > heapam.c and execIndexing.c, currently only called within
>> > nodeModifyTable.c
>
>> > Offhand, it doesn't seem like it would be that hard to teach another
>> > heap_insert() caller the same tricks.
>
>
>> I went through the nodeModifyTable.c code and it seems not to be so
>> difficult to do the same inside COPY.
>
> After a more precise look, I have figured out at least one difficulty, COPY
> and INSERT follow the different execution paths: INSERT goes through
> the Planner, while COPY does not. It leads to the absence of some required
> attributes like arbiterIndexes, which are available during INSERT via
> PlanState/ModifyTableState. Probably it is possible to get the same in the
> COPY, but it is not clear for me how.
>
> Anyway, adding of the 'speculative insertion' into the COPY is worth of a
> separated patch; and I would be glad to try implementing it.
>
> In the same time I have prepared a complete working patch with:
>
> - ignoring of the input data formatting errors
> - IGNORE_ERRORS parameter in the COPY options
> - updated regression tests
>
> Please, find the patch attached or check the web UI diff on GitHub as always:
> https://github.com/ololobus/postgres/pull/1/files

"git diff master --check" complains heavily, and the patch does not
apply anymore. The last patch is 5-month old as well, so I am marking
the patch as returned with feedback.
-- 
Michael