Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features) - Mailing list pgsql-hackers

From Damir Belyalov
Subject Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Date
Msg-id CALH1Lgv5QeYSWYDgMBd=wvKum8egiaMR-_77VGsiHPt7E8dK9Q@mail.gmail.com
Whole thread Raw
In response to Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)  (torikoshia <torikoshia@oss.nttdata.com>)
Responses Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
List pgsql-hackers
> +               /* Buffer was filled, commit subtransaction and prepare to replay */
> +               ReleaseCurrentSubTransaction();
What is actually being committed by this ReleaseCurrentSubTransaction()?
It seems to me that just safecstate->replay_buffer is fulfilled before
this commit.
All tuples are collected in replay_buffer, which is created in ''oldcontext'' of CopyFrom() (not context of a subtransaction). That's why it didn't clean up when you used RollbackAndReleaseCurrentSubTransaction().
Subtransactions are needed for better safety. There is no error when copying from a file to the replay_buffer, but an error may occur at the next stage - when adding a tuple to the table. Also there may be other errors that are not obvious at first glance.

I feel it might be better to have it as a parameter rather than fixed at
1000.
Yes, I thought about it too. So I created another version with the GUC parameter - replay_buffer_size. Attached it. But I think users won't need to change replay_buffer_size.
Also replay_buffer does the same thing as MultiInsert buffer does and MultiInsert buffer defined by const = 1000.

NextCopyFrom() is in copyfromparse.c while safeNextCopyFrom() is in
copyfrom.c.
Since safeNextCopyFrom() is analog of NextCopyFrom() as commented, would
it be natural to put them in the same file?
Sure, corrected it. 
 
> 188 +               safecstate->errors++;
> 189 +               if (safecstate->errors <= 100)
> 190 +                   ereport(WARNING,
> 191 +                           (errcode(errdata->sqlerrcode),
> 192 +                           errmsg("%s", errdata->context)));

It would be better to state in the manual that errors exceeding 100 are
not displayed.
Or, it might be acceptable to output all errors that exceed 100.
It'll be too complicated to create a new parameter just for showing the given number of errors. I think 100 is an optimal size.
 
> +typedef struct SafeCopyFromState
> +{
> +#define            REPLAY_BUFFER_SIZE 1000
> +   HeapTuple       replay_buffer[REPLAY_BUFFER_SIZE]; /* accumulates
> tuples for replaying it after an error */
> +   int             saved_tuples;               /* # of tuples in
> replay_buffer */
> +   int             replayed_tuples;            /* # of tuples was
> replayed from buffer */
> +   int             errors;                     /* total # of errors */
> +   bool            replay_is_active;
> +   bool            begin_subtransaction;
> +   bool            processed_remaining_tuples; /* for case of
> replaying last tuples */
> +   bool            skip_row;

It would be helpful to add comments about skip_row, etc.
Corrected it.
 
WARNING:  FIND 0 ERRORS
When there were no errors, this WARNING seems not necessary.
Corrected it. 

Add to this patch processing other errors and constraints and tests for them.
I had to create another function safeExecConstraints() only for processing constraints, because ExecConstraints() is after NextCopyFrom() and is not in PG_TRY. This thing a little bit complicated the code.
Maybe it is a good approach to create a new function SafeCopyFrom() and do all ''safe copying'' in PG_TRY, but it will almost duplicate the CopyFrom() code.
Or maybe create a function only for loop for(;;). But we have the same thing with duplicating code and passing a lot of variables (which are created at the beginning of CopyFrom()) to this function.

pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: O(n) tasks cause lengthy startups and checkpoints
Next
From: Nathan Bossart
Date:
Subject: Re: [PATCH] Optimize json_lex_string by batching character copying