Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features) - Mailing list pgsql-hackers

From Damir Belyalov
Subject Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Date
Msg-id CALH1Lgsr7Asz6r6jnRta7DVEjKwyz=b3XJhyCMoYv8Yy9y4cXg@mail.gmail.com
Whole thread Raw
In response to Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)  (torikoshia <torikoshia@oss.nttdata.com>)
Responses Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
List pgsql-hackers
Thank you for reviewing.
In the previous patch there was an error when processing constraints. The patch was fixed, but the code grew up and became more complicated (0005-COPY_IGNORE_ERRORS). I also simplified the logic of safeNextCopyFrom().
You asked why we need subtransactions, so the answer is in this patch. When processing a row that does not satisfy constraints or INSTEAD OF triggers, it is necessary to rollback the subtransaction and return the table to its original state.
Cause of complexity, I had to abandon the constraints, triggers processing in and handle only errors that occur when reading the file. Attaching simplified patch (0006-COPY_IGNORE_ERRORS).
Checked out these patches on all COPY regress tests and it worked correctly.
 
BTW in v4 patch, data are loaded into the buffer one by one, and when
the buffer fills up, the data in the buffer are 'replayed' also one by
one, right?
Wouldn't this have more overhead than a normal COPY?
The data is loaded into the buffer one by one, but in the "replay mode" ignore_errors works as standard COPY. Tuples add to the slot and depending on the type of the slot (CIM_SINGLE or CIM_MULTI) tuples are replayed in the corresponding case.
For the 0006 patch you can imagine that we divide the loop for(;;) in 2 parts. The first part is adding tuples to the buffer and the second part is inserting tuples to the table. These parts don't intersect with each other and are completed sequentially.
The main idea of replay_buffer is that it is needed for the CIM_SINGLE case. You can implement the CIM_SINGLE case and see that tuples before an error occurring don't add to the table. Logic of the 0005 patch is similar but with some differences.

As a test, I COPYed slightly larger data with and without ignore_errors
option.
There might be other reasons, but I found a performance difference.
Tried to reduce performance difference with cleaning up replay_buffer with resetting the new context for replay_buffer - replay_cxt.
```
Before:
Without ignore_errors:
COPY 10000000
Time: 15538,579 ms (00:15,539)
With ignore_errors:
COPY 10000000
Time: 21289,121 ms (00:21,289)

After:
Without ignore_errors:
COPY 10000000
Time: 15318,922 ms (00:15,319)
With ignore_errors:
COPY 10000000
Time: 19868,175 ms (00:19,868)
```

 - Put in the documentation that the warnings will not be output for more
than 101 cases.
Yeah, I point it out in the doc. 
 
I applied v4 patch and when canceled the COPY, there was a case I found
myself left in a transaction.
Should this situation be prevented from occurring?

```
=# copy test from '/tmp/10000000.data' with (ignore_errors );

^CCancel request sent
ERROR:  canceling statement due to user request

=# truncate test;
ERROR:  current transaction is aborted, commands ignored until end of
transaction block
```
Tried to implement your error and could not. The result was the same as COPY FROM implements. 
Attachment

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: tweak to a few index tests to hits ambuildempty() routine.
Next
From: Julien Rouhaud
Date:
Subject: Re: [PATCH] polish the error message of creating proc