Re: POC: postgres_fdw insert batching - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: POC: postgres_fdw insert batching
Date
Msg-id 0ab26876-9e89-ef24-094d-35b991672309@enterprisedb.com
Whole thread Raw
In response to RE: POC: postgres_fdw insert batching  ("tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com>)
Responses RE: POC: postgres_fdw insert batching
List pgsql-hackers

On 11/24/20 9:45 AM, tsunakawa.takay@fujitsu.com wrote:
> From: Tomas Vondra <tomas.vondra@enterprisedb.com>
>> 1) We're calling it "batch_size" but the API function is named
>> postgresGetMaxBulkInsertTuples(). Perhaps we should rename the function
>> to postgresGetModifyBatchSize()? That has the advantage it'd work if we
>> ever add support for batching to UPDATE/DELETE.
> 
> Actually, I was in two minds whether the term batch or bulk is better.  Because Oracle uses "bulk insert" and "bulk
fetch",like in FETCH cur BULK COLLECT INTO array and FORALL in array INSERT INTO, while JDBC uses batch as in "batch
updates"and its API method names (addBatch, executeBatch).
 
> 
> But it seems better or common to use batch according to the etymology and the following Stack Overflow page:
> 
> https://english.stackexchange.com/questions/141884/which-is-a-better-and-commonly-used-word-bulk-or-batch
> 
> OTOH, as for the name GetModifyBatchSize() you suggest, I think GetInsertBatchSize may be better.  That is, this API
dealswith multiple records in a single INSERT statement.  Your GetModifyBatchSize will be reserved for statement
batchingwhen libpq has supported batch/pipelining to execute multiple INSERT/UPDATE/DELETE statements, as in the
followingJDBC batch updates.  What do you think?
 
> 

I don't know. I was really only thinking about batching in the context
of a single DML command, not about batching of multiple commands at the
protocol level. IMHO it's far more likely we'll add support for batching
for DELETE/UPDATE than libpq pipelining, which seems rather different
from how the FDW API works. Which is why I was suggesting to use a name
that would work for all DML commands, not just for inserts.

> CODE EXAMPLE 14-1 Creating and executing a batch of insert statements 
> --------------------------------------------------
> Statement stmt = con.createStatement(); 
> stmt.addBatch("INSERT INTO employees VALUES (1000, 'Joe Jones')"); 
> stmt.addBatch("INSERT INTO departments VALUES (260, 'Shoe')"); 
> stmt.addBatch("INSERT INTO emp_dept VALUES (1000, 260)"); 
> 
> // submit a batch of update commands for execution 
> int[] updateCounts = stmt.executeBatch(); 
> --------------------------------------------------
> 

Sure. We already have a patch to support something like this at the
libpq level, IIRC. But I'm not sure how well that matches the FDW API
approach in general.

> 
>> 2) Do we have to lookup the batch_size in create_foreign_modify (in
>> server/table options)? I'd have expected to look it up while planning
>> the modify and then pass it through the list, just like the other
>> FdwModifyPrivateIndex stuff. But maybe that's not possible.
> 
> Don't worry, create_foreign_modify() is called from PlanForeignModify() during planning.  Unfortunately, it's also
calledfrom BeginForeignInsert(), but other stuff passed to create_foreign_modify() including the query string is
constructedthere.
 
> 

Hmm, ok.

> 
>> 3) That reminds me - should we show the batching info on EXPLAIN? That
>> seems like a fairly interesting thing to show to the user. Perhaps
>> showing the average batch size would also be useful? Or maybe not, we
>> create the batches as large as possible, with the last one smaller.
> 
> Hmm, maybe batch_size is not for EXPLAIN because its value doesn't change dynamically based on the planning or system
stateunlike shared buffers and parallel workers.  OTOH, I sometimes want to see what configuration parameter values the
userset, such as work_mem, enable_*, and shared_buffers, together with the query plan (EXPLAIN and auto_explain).  For
example,it'd be nice if EXPLAIN (parameters on) could do that.  Some relevant FDW-related parameters could be included
inthat output.
 
> 

Not sure, but I'd guess knowing whether batching is used would be
useful. We only print the single-row SQL query, which kinda gives the
impression that there's no batching.

>> 4) It seems that ExecInsert executes GetMaxBulkInsertTuples() over and
>> over for every tuple. I don't know it that has measurable impact, but it
>> seems a bit excessive IMO. I don't think we should support the batch
>> size changing during execution (seems tricky).
> 
> Don't worry about this, too.  GetMaxBulkInsertTuples() just returns a value that was already saved in a struct in
create_foreign_modify().
> 

Well, I do worry for two reasons.

Firstly, the fact that in postgres_fdw the call is cheap does not mean
it'll be like that in every other FDW. Presumably, the other FDWs might
cache it in the struct and do the same thing, of course.

But the fact that we're calling it over and over for each row kinda
seems like we allow the value to change during execution, but I very
much doubt the code is expecting that. I haven't tried, but assume the
function first returns 10 and then 100. ISTM the code will allocate
ri_Slots with 25 slots, but then we'll try stashing 100 tuples there.
That can't end well. Sure, we can claim it's a bug in the FDW extension,
but it's also due to the API design.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: pg_ls_tmpdir to show directories and shared filesets (and pg_ls_*)
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] Custom compression methods