Re: Fast COPY FROM based on batch insert - Mailing list pgsql-hackers
From | Zhihong Yu |
---|---|
Subject | Re: Fast COPY FROM based on batch insert |
Date | |
Msg-id | CALNJ-vQd-bMrydDzUg5+fr9w+gTnErCADhk3KSi72_RCcKJw=Q@mail.gmail.com Whole thread Raw |
In response to | Re: Fast COPY FROM based on batch insert (Etsuro Fujita <etsuro.fujita@gmail.com>) |
Responses |
Re: Fast COPY FROM based on batch insert
|
List | pgsql-hackers |
On Tue, Aug 9, 2022 at 4:45 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Tue, Jul 26, 2022 at 7:19 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> Yeah, I think the patch is getting better, but I noticed some issues,
> so I'm working on them. I think I can post a new version in the next
> few days.
* When running AFTER ROW triggers in CopyMultiInsertBufferFlush(), the
patch uses the slots passed to ExecForeignBatchInsert(), not the ones
returned by the callback function, but I don't think that that is
always correct, as the documentation about the callback function says:
The return value is an array of slots containing the data that was
actually inserted (this might differ from the data supplied, for
example as a result of trigger actions.)
The passed-in <literal>slots</literal> can be re-used for this purpose.
postgres_fdw re-uses the passed-in slots, but other FDWs might not, so
I modified the patch to reference the returned slots when running the
AFTER ROW triggers. I also modified the patch to initialize the
tts_tableOid. Attached is an updated patch, in which I made some
minor adjustments to CopyMultiInsertBufferFlush() as well.
* The patch produces incorrect error context information:
create extension postgres_fdw;
create server loopback foreign data wrapper postgres_fdw options
(dbname 'postgres');
create user mapping for current_user server loopback;
create table t1 (f1 int, f2 text);
create foreign table ft1 (f1 int, f2 text) server loopback options
(table_name 't1');
alter table t1 add constraint t1_f1positive check (f1 >= 0);
alter foreign table ft1 add constraint ft1_f1positive check (f1 >= 0);
— single insert
copy ft1 from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> -1 foo
>> 1 bar
>> \.
ERROR: new row for relation "t1" violates check constraint "t1_f1positive"
DETAIL: Failing row contains (-1, foo).
CONTEXT: remote SQL command: INSERT INTO public.t1(f1, f2) VALUES ($1, $2)
COPY ft1, line 1: "-1 foo"
— batch insert
alter server loopback options (add batch_size '2');
copy ft1 from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> -1 foo
>> 1 bar
>> \.
ERROR: new row for relation "t1" violates check constraint "t1_f1positive"
DETAIL: Failing row contains (-1, foo).
CONTEXT: remote SQL command: INSERT INTO public.t1(f1, f2) VALUES
($1, $2), ($3, $4)
COPY ft1, line 3
In single-insert mode the error context information is correct, but in
batch-insert mode it isn’t (i.e., the line number isn’t correct).
The error occurs on the remote side, so I'm not sure if there is a
simple fix. What I came up with is to just suppress error context
information other than the relation name, like the attached. What do
you think about that?
(In CopyMultiInsertBufferFlush() your patch sets cstate->cur_lineno to
buffer->linenos[i] even when running AFTER ROW triggers for the i-th
row returned by ExecForeignBatchInsert(), but that wouldn’t always be
correct, as the i-th returned row might not correspond to the i-th row
originally stored in the buffer as the callback function returns only
the rows that were actually inserted on the remote side. I think the
proposed fix would address this issue as well.)
* The patch produces incorrect row count in cases where some/all of
the rows passed to ExecForeignBatchInsert() weren’t inserted on the
remote side:
create function trig_null() returns trigger as $$ begin return NULL;
end $$ language plpgsql;
create trigger trig_null before insert on t1 for each row execute
function trig_null();
— single insert
alter server loopback options (drop batch_size);
copy ft1 from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> 0 foo
>> 1 bar
>> \.
COPY 0
— batch insert
alter server loopback options (add batch_size '2');
copy ft1 from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> 0 foo
>> 1 bar
>> \.
COPY 2
The row count is correct in single-insert mode, but isn’t in batch-insert mode.
The reason is that in batch-insert mode the row counter is updated
immediately after adding the row to the buffer, not after doing
ExecForeignBatchInsert(), which might ignore the row. To fix, I
modified the patch to delay updating the row counter (and the progress
of the COPY command) until after doing the callback function. For
consistency, I also modified the patch to delay it even when batching
into plain tables. IMO I think that that would be more consistent
with the single-insert mode, as in that mode we update them after
writing the tuple out to the table or sending it to the remote side.
* I modified the patch so that when batching into foreign tables we
skip useless steps in CopyMultiInsertBufferInit() and
CopyMultiInsertBufferCleanup().
That’s all I have for now. Sorry for the delay.
Best regards,
Etsuro Fujita
Hi,
+ /* If any rows were inserted, run AFTER ROW INSERT triggers. */
...
+ for (i = 0; i < inserted; i++)+ {
+ TupleTableSlot *slot = rslots[i];
...
+ slot->tts_tableOid =+ RelationGetRelid(resultRelInfo->ri_RelationDesc);
It seems the return value of `RelationGetRelid(resultRelInfo->ri_RelationDesc)` can be stored in a variable outside the for loop.
Inside the for loop, assign this variable to slot->tts_tableOid.
Cheers
pgsql-hackers by date: