Thread: POC: postgres_fdw insert batching

POC: postgres_fdw insert batching

From
Tomas Vondra
Date:
Hi,

One of the issues I'm fairly regularly reminded by users/customers is
that inserting into tables sharded using FDWs are rather slow. We do
even get it reported on pgsql-bugs from time to time [1].

Some of the slowness / overhead is expected, doe to the latency between
machines in the sharded setup. Even just 1ms latency will make it way
more expensive than a single instance.

But let's do a simple experiment, comparing a hash-partitioned regular
partitions, and one with FDW partitions in the same instance. Scripts to
run this are attached. The duration of inserting 1M rows to this table
(average of 10 runs on my laptop) looks like this:

   regular: 2872 ms
   FDW:     64454 ms

Yep, it's ~20x slower. On setup with ping latency well below 0.05ms.
Imagine how would it look on sharded setups with 0.1ms or 1ms latency,
which is probably where most single-DC clusters are :-(

Now, the primary reason why the performance degrades like this is that
while FDW has batching for SELECT queries (i.e. we read larger chunks of
data from the cursors), we don't have that for INSERTs (or other DML).
Every time you insert a row, it has to go all the way down into the
partition synchronously.

For some use cases this may be reduced by having many independent
connnections from different users, so the per-user latency is higher but
acceptable. But if you need to import larger amounts of data (say, a CSV
file for analytics, ...) this may not work.

Some time ago I wrote an ugly PoC adding batching, just to see how far
would it get us, and it seems quite promising - results for he same
INSERT benchmarks look like this:

    FDW batching: 4584 ms

So, rather nice improvement, I'd say ...

Before I spend more time hacking on this, I have a couple open questions
about the design, restrictions etc.


1) Extend the FDW API?

In the patch, the batching is simply "injected" into the existing insert
API method, i.e. ExecForeignInsert et al. I wonder if it'd be better to
extend the API with a "batched" version of the method, so that we can
easily determine whether the FDW supports batching or not - it would
require changes in the callers, though. OTOH it might be useful for
COPY, where we could do something similar to multi_insert (COPY already
benefits from this patch, but it does not use the batching built-into
COPY).


2) What about the insert results?

I'm not sure what to do about "result" status for the inserted rows. We
only really "stash" the rows into a buffer, so we don't know if it will
succeed or not. The patch simply assumes it will succeed, but that's
clearly wrong, and it may result in reporting a wrong number or rows.

The patch also disables the batching when the insert has a RETURNING
clause, because there's just a single slot (for the currently inserted
row). I suppose a "batching" method would take an array of slots.


3) What about the other DML operations (DELETE/UPDATE)?

The other DML operations could probably benefit from the batching too.
INSERT was good enough for a PoC, but having batching only for INSERT
seems somewhat asmymetric. DELETE/UPDATE seem more complicated because
of quals, but likely doable.


3) Should we do batching for COPY insteads?

While looking at multi_insert, I've realized it's mostly exactly what
the new "batching insert" API function would need to be. But it's only
really used in COPY, so I wonder if we should just abandon the idea of
batching INSERTs and do batching COPY for FDW tables.

For cases that can replace INSERT with COPY this would be enough, but
unfortunately it does nothing for DELETE/UPDATE so I'm hesitant to do
this :-(


4) Expected consistency?

I'm not entirely sure what are the consistency expectations for FDWs.
Currently the FDW nodes pointing to the same server share a connection,
so the inserted rows might be visible to other nodes. But if we only
stash the rows in a local buffer for a while, that's no longer true. So
maybe this breaks the consistency expectations?

But maybe that's OK - I'm not sure how the prepared statements/cursors
affect this. I can imagine restricting the batching only to plans where
this is not an issue (single FDW node or something), but it seems rather
fragile and undesirable.

I was thinking about adding a GUC to enable/disable the batching at some
level (global, server, table, ...) but it seems like a bad match because
the consistency expectations likely depend on a query. There should be a
GUC to set the batch size, though (it's hardcoded to 100 for now).


regards



[1] https://www.postgresql.org/message-id/CACnz%2BQ1q0%2B2KoJam9LyNMk8JmdC6qYHXWB895Wu2xcpoip18xQ%40mail.gmail.com

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: POC: postgres_fdw insert batching

From
Amit Langote
Date:
Hi Tomas,

On Mon, Jun 29, 2020 at 12:10 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> Hi,
>
> One of the issues I'm fairly regularly reminded by users/customers is
> that inserting into tables sharded using FDWs are rather slow. We do
> even get it reported on pgsql-bugs from time to time [1].
>
> Some of the slowness / overhead is expected, doe to the latency between
> machines in the sharded setup. Even just 1ms latency will make it way
> more expensive than a single instance.
>
> But let's do a simple experiment, comparing a hash-partitioned regular
> partitions, and one with FDW partitions in the same instance. Scripts to
> run this are attached. The duration of inserting 1M rows to this table
> (average of 10 runs on my laptop) looks like this:
>
>    regular: 2872 ms
>    FDW:     64454 ms
>
> Yep, it's ~20x slower. On setup with ping latency well below 0.05ms.
> Imagine how would it look on sharded setups with 0.1ms or 1ms latency,
> which is probably where most single-DC clusters are :-(
>
> Now, the primary reason why the performance degrades like this is that
> while FDW has batching for SELECT queries (i.e. we read larger chunks of
> data from the cursors), we don't have that for INSERTs (or other DML).
> Every time you insert a row, it has to go all the way down into the
> partition synchronously.
>
> For some use cases this may be reduced by having many independent
> connnections from different users, so the per-user latency is higher but
> acceptable. But if you need to import larger amounts of data (say, a CSV
> file for analytics, ...) this may not work.
>
> Some time ago I wrote an ugly PoC adding batching, just to see how far
> would it get us, and it seems quite promising - results for he same
> INSERT benchmarks look like this:
>
>     FDW batching: 4584 ms
>
> So, rather nice improvement, I'd say ...

Very nice indeed.

> Before I spend more time hacking on this, I have a couple open questions
> about the design, restrictions etc.

I think you may want to take a look this recent proposal by Andrey Lepikhov:

* [POC] Fast COPY FROM command for the table with foreign partitions *
https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com



Re: POC: postgres_fdw insert batching

From
Ashutosh Bapat
Date:
On Sun, Jun 28, 2020 at 8:40 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

>
>     FDW batching: 4584 ms
>
> So, rather nice improvement, I'd say ...

Very nice.

>
> Before I spend more time hacking on this, I have a couple open questions
> about the design, restrictions etc.
>
>
> 1) Extend the FDW API?
>
> In the patch, the batching is simply "injected" into the existing insert
> API method, i.e. ExecForeignInsert et al. I wonder if it'd be better to
> extend the API with a "batched" version of the method, so that we can
> easily determine whether the FDW supports batching or not - it would
> require changes in the callers, though. OTOH it might be useful for
> COPY, where we could do something similar to multi_insert (COPY already
> benefits from this patch, but it does not use the batching built-into
> COPY).

Amit Langote has pointed out a related patch being discussed on hackers at [1].

That patch introduces a new API. But if we can do it without
introducing a new API that will be good. FDWs which can support
batching can just modify their code and don't have to implement and
manage a new API. We already have a handful of those APIs.

>
> 2) What about the insert results?
>
> I'm not sure what to do about "result" status for the inserted rows. We
> only really "stash" the rows into a buffer, so we don't know if it will
> succeed or not. The patch simply assumes it will succeed, but that's
> clearly wrong, and it may result in reporting a wrong number or rows.

I didn't get this. We are executing an INSERT on the foreign server,
so we get the number of rows INSERTed from that server. We should just
add those up across batches. If there's a failure, it would abort the
transaction, local as well as remote.

>
> The patch also disables the batching when the insert has a RETURNING
> clause, because there's just a single slot (for the currently inserted
> row). I suppose a "batching" method would take an array of slots.
>

It will be a rare case when a bulk load also has a RETURNING clause.
So, we can leave with this restriction. We should try to choose a
design which allows that restriction to be lifted in the future. But I
doubt that restriction will be a serious one.

>
> 3) What about the other DML operations (DELETE/UPDATE)?
>
> The other DML operations could probably benefit from the batching too.
> INSERT was good enough for a PoC, but having batching only for INSERT
> seems somewhat asmymetric. DELETE/UPDATE seem more complicated because
> of quals, but likely doable.

Bulk INSERTs are more common in a sharded environment because of data
load in say OLAP systems. Bulk update/delete are rare, although not
that rare. So if an approach just supports bulk insert and not bulk
UPDATE/DELETE that will address a large number of usecases IMO. But if
we can make everything work together that would be good as well.

In your patch, I see that an INSERT statement with batch is
constructed as INSERT INTO ... VALUES (...), (...) as many values as
the batch size. That won't work as is for UPDATE/DELETE since we can't
pass multiple pairs of ctids and columns to be updated for each ctid
in one statement. Maybe we could build as many UPDATE/DELETE
statements as the size of a batch, but that would be ugly. What we
need is a feature like a batch prepared statement in libpq similar to
what JDBC supports
((https://mkyong.com/jdbc/jdbc-preparedstatement-example-batch-update/).
This will allow a single prepared statement to be executed with a
batch of parameters, each batch corresponding to one foreign DML
statement.

>
>
> 3) Should we do batching for COPY insteads?
>
> While looking at multi_insert, I've realized it's mostly exactly what
> the new "batching insert" API function would need to be. But it's only
> really used in COPY, so I wonder if we should just abandon the idea of
> batching INSERTs and do batching COPY for FDW tables.

I think this won't support RETURNING as well. But if we could somehow
use copy protocol to send the data to the foreign server and yet treat
it as INSERT, that might work. I think we have find out which performs
better COPY or batch INSERT.

>
> For cases that can replace INSERT with COPY this would be enough, but
> unfortunately it does nothing for DELETE/UPDATE so I'm hesitant to do
> this :-(

Agreed, if we want to support bulk UPDATE/DELETE as well.

>
>
> 4) Expected consistency?
>
> I'm not entirely sure what are the consistency expectations for FDWs.
> Currently the FDW nodes pointing to the same server share a connection,
> so the inserted rows might be visible to other nodes. But if we only
> stash the rows in a local buffer for a while, that's no longer true. So
> maybe this breaks the consistency expectations?
>
> But maybe that's OK - I'm not sure how the prepared statements/cursors
> affect this. I can imagine restricting the batching only to plans where
> this is not an issue (single FDW node or something), but it seems rather
> fragile and undesirable.

I think that area is grey. Depending upon where the cursor is
positioned when a DML node executes a query, the data fetched from
cursor may or may not see the effect of DML. The cursor position is
based on the batch size so we already have problems in this area I
think. Assuming that the DML and SELECT are independent this will
work. So, the consistency problems exists, it will just be modulated
by batching DML. I doubt that's related to this feature exclusively
and should be solved independent of this feature.

>
> I was thinking about adding a GUC to enable/disable the batching at some
> level (global, server, table, ...) but it seems like a bad match because
> the consistency expectations likely depend on a query. There should be a
> GUC to set the batch size, though (it's hardcoded to 100 for now).
>

Similar to fetch_size, it should foreign server, table level setting, IMO.

[1] https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru


-- 
Best Wishes,
Ashutosh Bapat



Re: POC: postgres_fdw insert batching

From
Etsuro Fujita
Date:
On Mon, Jun 29, 2020 at 7:52 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> On Sun, Jun 28, 2020 at 8:40 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:

> > 3) What about the other DML operations (DELETE/UPDATE)?
> >
> > The other DML operations could probably benefit from the batching too.
> > INSERT was good enough for a PoC, but having batching only for INSERT
> > seems somewhat asmymetric. DELETE/UPDATE seem more complicated because
> > of quals, but likely doable.
>
> Bulk INSERTs are more common in a sharded environment because of data
> load in say OLAP systems. Bulk update/delete are rare, although not
> that rare. So if an approach just supports bulk insert and not bulk
> UPDATE/DELETE that will address a large number of usecases IMO. But if
> we can make everything work together that would be good as well.

In most cases, I think the entire UPDATE/DELETE operations would be
pushed down to the remote side by DirectModify.  So, I'm not sure we
really need the bulk UPDATE/DELETE.

> > 3) Should we do batching for COPY insteads?
> >
> > While looking at multi_insert, I've realized it's mostly exactly what
> > the new "batching insert" API function would need to be. But it's only
> > really used in COPY, so I wonder if we should just abandon the idea of
> > batching INSERTs and do batching COPY for FDW tables.

> I think we have find out which performs
> better COPY or batch INSERT.

Maybe I'm missing something, but I think the COPY patch [1] seems more
promising to me, because 1) it would not get the remote side's planner
and executor involved, and 2) the data would be loaded more
efficiently by multi-insert on the demote side.  (Yeah, COPY doesn't
support RETURNING, but it's rare that RETURNING is needed in a bulk
load, as you mentioned.)

> [1] https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru

Best regards,
Etsuro Fujita



Re: POC: postgres_fdw insert batching

From
Ashutosh Bapat
Date:


On Tue, 30 Jun 2020 at 08:47, Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
On Mon, Jun 29, 2020 at 7:52 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> On Sun, Jun 28, 2020 at 8:40 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:

> > 3) What about the other DML operations (DELETE/UPDATE)?
> >
> > The other DML operations could probably benefit from the batching too.
> > INSERT was good enough for a PoC, but having batching only for INSERT
> > seems somewhat asmymetric. DELETE/UPDATE seem more complicated because
> > of quals, but likely doable.
>
> Bulk INSERTs are more common in a sharded environment because of data
> load in say OLAP systems. Bulk update/delete are rare, although not
> that rare. So if an approach just supports bulk insert and not bulk
> UPDATE/DELETE that will address a large number of usecases IMO. But if
> we can make everything work together that would be good as well.

In most cases, I think the entire UPDATE/DELETE operations would be
pushed down to the remote side by DirectModify.  So, I'm not sure we
really need the bulk UPDATE/DELETE.

That may not be true for a partitioned table whose partitions are foreign tables. Esp. given the work that Amit Langote is doing [1]. It really depends on the ability of postgres_fdw to detect that the DML modifying each of the partitions can be pushed down. That may not come easily.
 

> > 3) Should we do batching for COPY insteads?
> >
> > While looking at multi_insert, I've realized it's mostly exactly what
> > the new "batching insert" API function would need to be. But it's only
> > really used in COPY, so I wonder if we should just abandon the idea of
> > batching INSERTs and do batching COPY for FDW tables.

> I think we have find out which performs
> better COPY or batch INSERT.

Maybe I'm missing something, but I think the COPY patch [1] seems more
promising to me, because 1) it would not get the remote side's planner
and executor involved, and 2) the data would be loaded more
efficiently by multi-insert on the demote side.  (Yeah, COPY doesn't
support RETURNING, but it's rare that RETURNING is needed in a bulk
load, as you mentioned.)

> [1] https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru

Best regards,
Etsuro Fujita

--
Best Wishes,
Ashutosh

Re: POC: postgres_fdw insert batching

From
Amit Langote
Date:
On Tue, Jun 30, 2020 at 1:22 PM Ashutosh Bapat
<ashutosh.bapat@2ndquadrant.com> wrote:
> On Tue, 30 Jun 2020 at 08:47, Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
>> On Mon, Jun 29, 2020 at 7:52 PM Ashutosh Bapat
>> <ashutosh.bapat.oss@gmail.com> wrote:
>> > On Sun, Jun 28, 2020 at 8:40 PM Tomas Vondra
>> > <tomas.vondra@2ndquadrant.com> wrote:
>>
>> > > 3) What about the other DML operations (DELETE/UPDATE)?
>> > >
>> > > The other DML operations could probably benefit from the batching too.
>> > > INSERT was good enough for a PoC, but having batching only for INSERT
>> > > seems somewhat asmymetric. DELETE/UPDATE seem more complicated because
>> > > of quals, but likely doable.
>> >
>> > Bulk INSERTs are more common in a sharded environment because of data
>> > load in say OLAP systems. Bulk update/delete are rare, although not
>> > that rare. So if an approach just supports bulk insert and not bulk
>> > UPDATE/DELETE that will address a large number of usecases IMO. But if
>> > we can make everything work together that would be good as well.
>>
>> In most cases, I think the entire UPDATE/DELETE operations would be
>> pushed down to the remote side by DirectModify.  So, I'm not sure we
>> really need the bulk UPDATE/DELETE.
> That may not be true for a partitioned table whose partitions are foreign tables. Esp. given the work that Amit
Langoteis doing [1]. It really depends on the ability of postgres_fdw to detect that the DML modifying each of the
partitionscan be pushed down. That may not come easily. 

While it's true that how to accommodate the DirectModify API in the
new inherited update/delete planning approach is an open question on
that thread, I would eventually like to find an answer to that.  That
is, that work shouldn't result in losing the foreign partition's
ability to use DirectModify API to optimize updates/deletes.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com



Re: POC: postgres_fdw insert batching

From
Etsuro Fujita
Date:
On Tue, Jun 30, 2020 at 2:54 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Tue, Jun 30, 2020 at 1:22 PM Ashutosh Bapat
> <ashutosh.bapat@2ndquadrant.com> wrote:
> > On Tue, 30 Jun 2020 at 08:47, Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> >> On Mon, Jun 29, 2020 at 7:52 PM Ashutosh Bapat
> >> <ashutosh.bapat.oss@gmail.com> wrote:
> >> > On Sun, Jun 28, 2020 at 8:40 PM Tomas Vondra
> >> > <tomas.vondra@2ndquadrant.com> wrote:
> >>
> >> > > 3) What about the other DML operations (DELETE/UPDATE)?
> >> > >
> >> > > The other DML operations could probably benefit from the batching too.
> >> > > INSERT was good enough for a PoC, but having batching only for INSERT
> >> > > seems somewhat asmymetric. DELETE/UPDATE seem more complicated because
> >> > > of quals, but likely doable.
> >> >
> >> > Bulk INSERTs are more common in a sharded environment because of data
> >> > load in say OLAP systems. Bulk update/delete are rare, although not
> >> > that rare. So if an approach just supports bulk insert and not bulk
> >> > UPDATE/DELETE that will address a large number of usecases IMO. But if
> >> > we can make everything work together that would be good as well.
> >>
> >> In most cases, I think the entire UPDATE/DELETE operations would be
> >> pushed down to the remote side by DirectModify.  So, I'm not sure we
> >> really need the bulk UPDATE/DELETE.
> > That may not be true for a partitioned table whose partitions are foreign tables. Esp. given the work that Amit
Langoteis doing [1]. It really depends on the ability of postgres_fdw to detect that the DML modifying each of the
partitionscan be pushed down. That may not come easily. 
>
> While it's true that how to accommodate the DirectModify API in the
> new inherited update/delete planning approach is an open question on
> that thread, I would eventually like to find an answer to that.  That
> is, that work shouldn't result in losing the foreign partition's
> ability to use DirectModify API to optimize updates/deletes.

That would be great!

Best regards,
Etsuro Fujita



Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:
On Mon, Jun 29, 2020 at 04:22:15PM +0530, Ashutosh Bapat wrote:
>On Sun, Jun 28, 2020 at 8:40 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>
>>
>>     FDW batching: 4584 ms
>>
>> So, rather nice improvement, I'd say ...
>
>Very nice.
>
>>
>> Before I spend more time hacking on this, I have a couple open questions
>> about the design, restrictions etc.
>>
>>
>> 1) Extend the FDW API?
>>
>> In the patch, the batching is simply "injected" into the existing insert
>> API method, i.e. ExecForeignInsert et al. I wonder if it'd be better to
>> extend the API with a "batched" version of the method, so that we can
>> easily determine whether the FDW supports batching or not - it would
>> require changes in the callers, though. OTOH it might be useful for
>> COPY, where we could do something similar to multi_insert (COPY already
>> benefits from this patch, but it does not use the batching built-into
>> COPY).
>
>Amit Langote has pointed out a related patch being discussed on hackers at [1].
>
>That patch introduces a new API. But if we can do it without
>introducing a new API that will be good. FDWs which can support
>batching can just modify their code and don't have to implement and
>manage a new API. We already have a handful of those APIs.
>

I don't think extending the API is a big issue - the FDW code will need
changing anyway, so this seems minor.

I'll take a look at the COPY patch - I agree it seems like a good idea,
although it can be less convenient in various caes (e.g. I've seen a lot
of INSERT ... SELECT queries in sharded systems, etc.). 

>>
>> 2) What about the insert results?
>>
>> I'm not sure what to do about "result" status for the inserted rows. We
>> only really "stash" the rows into a buffer, so we don't know if it will
>> succeed or not. The patch simply assumes it will succeed, but that's
>> clearly wrong, and it may result in reporting a wrong number or rows.
>
>I didn't get this. We are executing an INSERT on the foreign server,
>so we get the number of rows INSERTed from that server. We should just
>add those up across batches. If there's a failure, it would abort the
>transaction, local as well as remote.
>

True, but it's not the FDW code doing the counting - it's the caller,
depending on whether the ExecForeignInsert returns a valid slot or NULL.
So it's not quite possible to just return a number of inserted tuples,
as returned by the remote server.

>>
>> The patch also disables the batching when the insert has a RETURNING
>> clause, because there's just a single slot (for the currently inserted
>> row). I suppose a "batching" method would take an array of slots.
>>
>
>It will be a rare case when a bulk load also has a RETURNING clause.
>So, we can leave with this restriction. We should try to choose a
>design which allows that restriction to be lifted in the future. But I
>doubt that restriction will be a serious one.
>
>>
>> 3) What about the other DML operations (DELETE/UPDATE)?
>>
>> The other DML operations could probably benefit from the batching too.
>> INSERT was good enough for a PoC, but having batching only for INSERT
>> seems somewhat asmymetric. DELETE/UPDATE seem more complicated because
>> of quals, but likely doable.
>
>Bulk INSERTs are more common in a sharded environment because of data
>load in say OLAP systems. Bulk update/delete are rare, although not
>that rare. So if an approach just supports bulk insert and not bulk
>UPDATE/DELETE that will address a large number of usecases IMO. But if
>we can make everything work together that would be good as well.
>
>In your patch, I see that an INSERT statement with batch is
>constructed as INSERT INTO ... VALUES (...), (...) as many values as
>the batch size. That won't work as is for UPDATE/DELETE since we can't
>pass multiple pairs of ctids and columns to be updated for each ctid
>in one statement. Maybe we could build as many UPDATE/DELETE
>statements as the size of a batch, but that would be ugly. What we
>need is a feature like a batch prepared statement in libpq similar to
>what JDBC supports
>((https://mkyong.com/jdbc/jdbc-preparedstatement-example-batch-update/).
>This will allow a single prepared statement to be executed with a
>batch of parameters, each batch corresponding to one foreign DML
>statement.
>

I'm pretty sure we could make it work with some array/unnest tricks to
build a relation, and use that as a source of data.

>>
>>
>> 3) Should we do batching for COPY insteads?
>>
>> While looking at multi_insert, I've realized it's mostly exactly what
>> the new "batching insert" API function would need to be. But it's only
>> really used in COPY, so I wonder if we should just abandon the idea of
>> batching INSERTs and do batching COPY for FDW tables.
>
>I think this won't support RETURNING as well. But if we could somehow
>use copy protocol to send the data to the foreign server and yet treat
>it as INSERT, that might work. I think we have find out which performs
>better COPY or batch INSERT.
>

I don't see why not support both, the use cases are somewhat different I
think.

>>
>> For cases that can replace INSERT with COPY this would be enough, but
>> unfortunately it does nothing for DELETE/UPDATE so I'm hesitant to do
>> this :-(
>
>Agreed, if we want to support bulk UPDATE/DELETE as well.
>
>>
>>
>> 4) Expected consistency?
>>
>> I'm not entirely sure what are the consistency expectations for FDWs.
>> Currently the FDW nodes pointing to the same server share a connection,
>> so the inserted rows might be visible to other nodes. But if we only
>> stash the rows in a local buffer for a while, that's no longer true. So
>> maybe this breaks the consistency expectations?
>>
>> But maybe that's OK - I'm not sure how the prepared statements/cursors
>> affect this. I can imagine restricting the batching only to plans where
>> this is not an issue (single FDW node or something), but it seems rather
>> fragile and undesirable.
>
>I think that area is grey. Depending upon where the cursor is
>positioned when a DML node executes a query, the data fetched from
>cursor may or may not see the effect of DML. The cursor position is
>based on the batch size so we already have problems in this area I
>think. Assuming that the DML and SELECT are independent this will
>work. So, the consistency problems exists, it will just be modulated
>by batching DML. I doubt that's related to this feature exclusively
>and should be solved independent of this feature.
>

OK, thanks for the feedback.

>>
>> I was thinking about adding a GUC to enable/disable the batching at some
>> level (global, server, table, ...) but it seems like a bad match because
>> the consistency expectations likely depend on a query. There should be a
>> GUC to set the batch size, though (it's hardcoded to 100 for now).
>>
>
>Similar to fetch_size, it should foreign server, table level setting, IMO.
>
>[1] https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru
>

Yeah, I agree we should have a GUC to define the batch size. What I had
in mind was something that would allow us to enable/disable batching to
increase the consistency guarantees, or something like that. I think
simple GUCs are a poor solution for that.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: POC: postgres_fdw insert batching

From
Ashutosh Bapat
Date:


On Tue, 30 Jun 2020 at 22:23, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>I didn't get this. We are executing an INSERT on the foreign server,
>so we get the number of rows INSERTed from that server. We should just
>add those up across batches. If there's a failure, it would abort the
>transaction, local as well as remote.
>

True, but it's not the FDW code doing the counting - it's the caller,
depending on whether the ExecForeignInsert returns a valid slot or NULL.
So it's not quite possible to just return a number of inserted tuples,
as returned by the remote server.

Hmm yes, now I remember that bit. So for every row buffered, we return a valid slot without knowing whether that row was inserted on the remote server or not. I think we have that problem even now where a single INSERT might result in multiple INSERTs on the remote server (rare but not completely impossible).
 

>In your patch, I see that an INSERT statement with batch is
>constructed as INSERT INTO ... VALUES (...), (...) as many values as
>the batch size. That won't work as is for UPDATE/DELETE since we can't
>pass multiple pairs of ctids and columns to be updated for each ctid
>in one statement. Maybe we could build as many UPDATE/DELETE
>statements as the size of a batch, but that would be ugly. What we
>need is a feature like a batch prepared statement in libpq similar to
>what JDBC supports
>((https://mkyong.com/jdbc/jdbc-preparedstatement-example-batch-update/).
>This will allow a single prepared statement to be executed with a
>batch of parameters, each batch corresponding to one foreign DML
>statement.
>

I'm pretty sure we could make it work with some array/unnest tricks to
build a relation, and use that as a source of data.

That sounds great. The solution will be limited to postgres_fdw only.
 
I don't see why not support both, the use cases are somewhat different I
think.

+1, if we can do both.

--
Best Wishes,
Ashutosh

Re: POC: postgres_fdw insert batching

From
Andres Freund
Date:
Hi,

On 2020-06-28 17:10:02 +0200, Tomas Vondra wrote:
> 3) Should we do batching for COPY insteads?
> 
> While looking at multi_insert, I've realized it's mostly exactly what
> the new "batching insert" API function would need to be. But it's only
> really used in COPY, so I wonder if we should just abandon the idea of
> batching INSERTs and do batching COPY for FDW tables.
> 
> For cases that can replace INSERT with COPY this would be enough, but
> unfortunately it does nothing for DELETE/UPDATE so I'm hesitant to do
> this :-(

I personally think - and I realize that that might be annoying to
somebody looking to make an incremental improvement - that the
nodeModifyTable.c and copy.c code dealing with DML has become too
complicated to add features like this without a larger
refactoring. Leading to choices like this, whether to add a feature in
one place but not the other.

I think before we add more complexity, we ought to centralize and clean
up the DML handling code so most is shared between copy.c and
nodeModifyTable.c. Then we can much more easily add batching to FDWs, to
CTAS, to INSERT INTO SELECT etc, for which there are patches already.


> 4) Expected consistency?
> 
> I'm not entirely sure what are the consistency expectations for FDWs.
> Currently the FDW nodes pointing to the same server share a connection,
> so the inserted rows might be visible to other nodes. But if we only
> stash the rows in a local buffer for a while, that's no longer true. So
> maybe this breaks the consistency expectations?

Given that for local queries that's not the case (since the snapshot
won't have those changes visible), I think we shouldn't be too concerned
about that. If anything we should be concerned about the opposite.

If we are concerned, perhaps we could add functionality to flush all
pending changes before executing further statements?



> I was thinking about adding a GUC to enable/disable the batching at some
> level (global, server, table, ...) but it seems like a bad match because
> the consistency expectations likely depend on a query. There should be a
> GUC to set the batch size, though (it's hardcoded to 100 for now).

Hm. If libpq allowed to utilize pipelining ISTM the answer here would be
to not batch by building a single statement with all rows as a VALUES,
but issue the single INSERTs in a pipelined manner. That'd probably
remove all behavioural differences.   I really wish somebody would pick
up that libpq patch again.

Greetings,

Andres Freund



Re: POC: postgres_fdw insert batching

From
"Andrey V. Lepikhov"
Date:
On 6/28/20 8:10 PM, Tomas Vondra wrote:
> Now, the primary reason why the performance degrades like this is that
> while FDW has batching for SELECT queries (i.e. we read larger chunks of
> data from the cursors), we don't have that for INSERTs (or other DML).
> Every time you insert a row, it has to go all the way down into the
> partition synchronously.

You added new fields into the PgFdwModifyState struct. Why you didn't 
reused ResultRelInfo::ri_CopyMultiInsertBuffer field and 
CopyMultiInsertBuffer machinery as storage for incoming tuples?

-- 
regards,
Andrey Lepikhov
Postgres Professional



Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:
On Fri, Jul 10, 2020 at 09:28:44AM +0500, Andrey V. Lepikhov wrote:
>On 6/28/20 8:10 PM, Tomas Vondra wrote:
>>Now, the primary reason why the performance degrades like this is that
>>while FDW has batching for SELECT queries (i.e. we read larger chunks of
>>data from the cursors), we don't have that for INSERTs (or other DML).
>>Every time you insert a row, it has to go all the way down into the
>>partition synchronously.
>
>You added new fields into the PgFdwModifyState struct. Why you didn't 
>reused ResultRelInfo::ri_CopyMultiInsertBuffer field and 
>CopyMultiInsertBuffer machinery as storage for incoming tuples?
>

Because I was focused on speeding-up inserts, and that is not using
CopyMultiInsertBuffer I think. I agree the way the tuples are stored
may be improved, of course.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
Hello Tomas san,


Thank you for picking up this.  I'm interested in this topic, too.  (As an aside, we'd like to submit a bulk insert
patchfor ECPG in the near future.) 

As others referred, Andrey-san's fast COPY to foreign partitions is also promising.  But I think your bulk INSERT is a
separatefeature and offers COPY cannot do -- data transformation during loading with INSERT SELECT and CREATE TABLE AS
SELECT.

Is there anything that makes you worry and stops development?  Could I give it a try to implement this (I'm not sure I
can,sorry.  I'm worried if we can change the executor's call chain easily.) 


> 1) Extend the FDW API?

Yes, I think, because FDWs for other DBMSs will benefit from this.  (But it's questionable whether we want users to
transferdata in Postgres database to other DBMSs...) 

MySQL and SQL Server has the same bulk insert syntax as Postgres, i.e., INSERT INTO table VALUES(record1), (record2),
... Oracle doesn't have this syntax, but it can use CTE as follows: 

  INSERT INTO table
  WITH t AS (
    SELECT record1 FROM DUAL UNION ALL
    SELECT record2 FROM DUAL UNION ALL
    ...
  )
  SELECT * FROM t;

And many DBMSs should have CTAS, INSERT SELECT, and INSERT SELECT record1 UNION ALL SELECT record2 ...

The API would simply be:

TupleTableSlot **
ExecForeignMultiInsert(EState *estate,
                  ResultRelInfo *rinfo,
                  TupleTableSlot **slot,
                  TupleTableSlot **planSlot,
                  int numSlots);


> 2) What about the insert results?

I'm wondering if we can report success or failure of each inserted row, because the remote INSERT will fail entirely.
OtherFDWs may be able to do it, so the API can be like above. 

For the same reason, support for RETURNING clause will vary from DBMS to DBMS.


> 3) What about the other DML operations (DELETE/UPDATE)?

I don't think they are necessary for the time being.  If we want them, they will be implemented using the libpq
batch/pipeliningas Andres-san said. 


> 3) Should we do batching for COPY insteads?

I'm thinking of issuing INSERT with multiple records as your patch does, because:

* When the user executed INSERT statements, it would look strange to the user if the remote SQL is displayed as COPY.

* COPY doesn't invoke rules unlike INSERT.  (I don't think the rule is a feature what users care about, though.)  Also,
I'ma bit concerned that there might be, or will be, other differences between INSERT and COPY. 


[1]
Fast COPY FROM command for the table with foreign partitions
https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru


Regards
Takayuki Tsunakawa




Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:
On Thu, Oct 08, 2020 at 02:40:10AM +0000, tsunakawa.takay@fujitsu.com wrote:
>Hello Tomas san,
>
>
>Thank you for picking up this.  I'm interested in this topic, too.  (As an aside, we'd like to submit a bulk insert
patchfor ECPG in the near future.)
 
>
>As others referred, Andrey-san's fast COPY to foreign partitions is also promising.  But I think your bulk INSERT is a
separatefeature and offers COPY cannot do -- data transformation during loading with INSERT SELECT and CREATE TABLE AS
SELECT.
>
>Is there anything that makes you worry and stops development?  Could I give it a try to implement this (I'm not sure I
can,sorry.  I'm worried if we can change the executor's call chain easily.)
 
>

It's primarily a matter of having too much other stuff on my plate, thus
not having time to work on this feature. I was not too worried about any
particular issue, but I wanted some feedback before spending more time
on extending the API.

I'm not sure when I'll have time to work on this again, so if you are
interested and willing to work on it, please go ahead. I'll gladly do
reviews and help you with it.

>
>> 1) Extend the FDW API?
>
>Yes, I think, because FDWs for other DBMSs will benefit from this.  (But it's questionable whether we want users to
transferdata in Postgres database to other DBMSs...)
 
>

I think transferring data to other databases is fine - interoperability
is a big advantage for users, I don't see it as something threatening
the PostgreSQL project. I doubt this would make it more likely for users
to migrate from PostgreSQL - there are many ways to do that already.


>MySQL and SQL Server has the same bulk insert syntax as Postgres, i.e., INSERT INTO table VALUES(record1), (record2),
... Oracle doesn't have this syntax, but it can use CTE as follows:
 
>
>  INSERT INTO table
>  WITH t AS (
>    SELECT record1 FROM DUAL UNION ALL
>    SELECT record2 FROM DUAL UNION ALL
>    ...
>  )
>  SELECT * FROM t;
>
>And many DBMSs should have CTAS, INSERT SELECT, and INSERT SELECT record1 UNION ALL SELECT record2 ...
>

True. In some cases INSERT may be replaced by COPY, but it has various
other features too.

>The API would simply be:
>
>TupleTableSlot **
>ExecForeignMultiInsert(EState *estate,
>                  ResultRelInfo *rinfo,
>                  TupleTableSlot **slot,
>                  TupleTableSlot **planSlot,
>                  int numSlots);
>
>

+1, seems quite reasonable

>> 2) What about the insert results?
>
>I'm wondering if we can report success or failure of each inserted row, because the remote INSERT will fail entirely.
OtherFDWs may be able to do it, so the API can be like above.
 
>

Yeah. I think handling complete failure should not be very difficult,
but there are cases that worry me more. For example, what if there's a
before trigger (on the remote db) that "skips" inserting some of the
rows by returning NULL?

>For the same reason, support for RETURNING clause will vary from DBMS to DBMS.
>

Yeah. I wonder if the FDW needs to indicate which features are supported
by the ExecForeignMultiInsert, e.g. by adding a function that decides
whether batch insert is supported (it might also do that internally by
calling ExecForeignInsert, of course).

>
>> 3) What about the other DML operations (DELETE/UPDATE)?
>
>I don't think they are necessary for the time being.  If we want them, they will be implemented using the libpq
batch/pipeliningas Andres-san said.
 
>

I agree.

>
>> 3) Should we do batching for COPY insteads?
>
>I'm thinking of issuing INSERT with multiple records as your patch does, because:
>
>* When the user executed INSERT statements, it would look strange to the user if the remote SQL is displayed as COPY.
>
>* COPY doesn't invoke rules unlike INSERT.  (I don't think the rule is a feature what users care about, though.)
Also,I'm a bit concerned that there might be, or will be, other differences between INSERT and COPY.
 
>

I agree.



regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Tomas Vondra <tomas.vondra@2ndquadrant.com>
> I'm not sure when I'll have time to work on this again, so if you are
> interested and willing to work on it, please go ahead. I'll gladly do
> reviews and help you with it.

Thank you very much.


> I think transferring data to other databases is fine - interoperability
> is a big advantage for users, I don't see it as something threatening
> the PostgreSQL project. I doubt this would make it more likely for users
> to migrate from PostgreSQL - there are many ways to do that already.

Definitely true.  Users may want to use INSERT SELECT to do some data transformation in their OLTP database and load it
intoa non-Postgres data warehouse. 


> Yeah. I think handling complete failure should not be very difficult,
> but there are cases that worry me more. For example, what if there's a
> before trigger (on the remote db) that "skips" inserting some of the
> rows by returning NULL?

> Yeah. I wonder if the FDW needs to indicate which features are supported
> by the ExecForeignMultiInsert, e.g. by adding a function that decides
> whether batch insert is supported (it might also do that internally by
> calling ExecForeignInsert, of course).

Thanks for your advice.  I'll try to address them.


 Regards
Takayuki Tsunakawa




RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
Hello,


The attached patch implements the new bulk insert routine for postgres_fdw and the executor utilizing it.  It passes
makecheck-world. 

I measured performance in a basic non-partitioned case by modifying Tomas-san's scripts.  They perform an INSERT SELECT
statementthat copies one million records.  The table consists of two integer columns, with a primary key on one of
thosethem.  You can run the attached prepare.sql to set up once.  local.sql inserts to the table directly, while
fdw.sqlinserts through a foreign table. 

The performance results, the average time of 5 runs,  were as follows on a Linux host where the average round-trip time
of"ping localhost" was 34 us: 

    master, local: 6.1 seconds
    master, fdw: 125.3 seconds
    patched, fdw: 11.1 seconds (11x improvement)


The patch accumulates at most 100 records in ModifyTableState before inserting in bulk.  Also, when an input record is
targetedfor a different relation (= partition) than that for already accumulated records, insert the accumulated
recordsand store the new record for later insert. 

[Issues]

1. Do we want a GUC parameter, say, max_bulk_insert_records = (integer), to control the number of records inserted at
once?
The range of allowed values would be between 1 and 1,000.  1 disables bulk insert.
The possible reason of the need for this kind of parameter would be to limit the amount of memory used for accumulated
records,which could be prohibitively large if each record is big.  I don't think this is a must, but I think we can
haveit. 

2. Should we accumulate records per relation in ResultRelInfo instead?
That is, when inserting into a partitioned table that has foreign partitions, delay insertion until a certain number of
inputrecords accumulate, and then insert accumulated records per relation (e.g., 50 records to relation A, 30 records
torelation B, and 20 records to relation C.)  If we do that, 

* The order of insertion differs from the order of input records.  Is it OK?

* Should the maximum count of accumulated records be applied per relation or the query?
When many foreign partitions belong to a partitioned table, if the former is chosen, it may use much memory in total.
Ifthe latter is chosen, the records per relation could be few and thus the benefit of bulk insert could be small. 


Regards
Takayuki Tsunakawa


Attachment

Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:
Hi,

Thanks for working on this!

On 11/10/20 1:45 AM, tsunakawa.takay@fujitsu.com wrote:
> Hello,
> 
> 
> The attached patch implements the new bulk insert routine for
> postgres_fdw and the executor utilizing it.  It passes make
> check-world.
> 

I haven't done any testing yet, just a quick review.

I see the patch builds the "bulk" query in execute_foreign_modify. IMO
that's something we should do earlier, when we're building the simple
query (for 1-row inserts). I'd understand if you were concerned about
overhead in case of 1-row inserts, trying to not plan the bulk query
until necessary, but I'm not sure this actually helps.

Or was the goal to build a query for every possible number of slots? I
don't think that's really useful, considering it requires deallocating
the old plan, preparing a new one, etc. IMO it should be sufficient to
have two queries - one for 1-row inserts, one for the full batch. The
last incomplete batch can be inserted using a loop of 1-row queries.

That's what my patch was doing, but I'm not insisting on that - it just
seems like a better approach to me. So feel free to argue why this is
better.


> I measured performance in a basic non-partitioned case by modifying
> Tomas-san's scripts.  They perform an INSERT SELECT statement that
> copies one million records.  The table consists of two integer
> columns, with a primary key on one of those them.  You can run the
> attached prepare.sql to set up once.  local.sql inserts to the table
> directly, while fdw.sql inserts through a foreign table.
> 
> The performance results, the average time of 5 runs,  were as follows
> on a Linux host where the average round-trip time of "ping localhost"
> was 34 us:
> 
> master, local: 6.1 seconds master, fdw: 125.3 seconds patched, fdw:
> 11.1 seconds (11x improvement)
> 

Nice. I think we can't really get much closer to local master, so 6.1
vs. 11.1 seconds look quite acceptable.

> 
> The patch accumulates at most 100 records in ModifyTableState before
> inserting in bulk.  Also, when an input record is targeted for a
> different relation (= partition) than that for already accumulated
> records, insert the accumulated records and store the new record for
> later insert.
> 
> [Issues]
> 
> 1. Do we want a GUC parameter, say, max_bulk_insert_records =
> (integer), to control the number of records inserted at once? The
> range of allowed values would be between 1 and 1,000.  1 disables
> bulk insert. The possible reason of the need for this kind of
> parameter would be to limit the amount of memory used for accumulated
> records, which could be prohibitively large if each record is big.  I
> don't think this is a must, but I think we can have it.
> 

I think it'd be good to have such GUC, even if only for testing and
development. We should probably have a way to disable the batching,
which the GUC could also do, I think. So +1 to have the GUC.

> 2. Should we accumulate records per relation in ResultRelInfo
> instead? That is, when inserting into a partitioned table that has
> foreign partitions, delay insertion until a certain number of input
> records accumulate, and then insert accumulated records per relation
> (e.g., 50 records to relation A, 30 records to relation B, and 20
> records to relation C.)  If we do that,
> 

I think there's a chunk of text missing here? If we do that, then what?

Anyway, I don't see why accumulating the records in ResultRelInfo would
be better than what the patch does now. It seems to me like fairly
specific to FDWs, so keeping it int FDW state seems appropriate. What
would be the advantage of stashing it in ResultRelInfo?

> * The order of insertion differs from the order of input records.  Is
> it OK?
> 

I think that's OK for most use cases, and if it's not (e.g. when there's
something requiring the exact order of writes) then it's not possible to
use batching. That's one of the reasons why I think we should have a GUC
to disable the batching.

> * Should the maximum count of accumulated records be applied per
> relation or the query? When many foreign partitions belong to a
> partitioned table, if the former is chosen, it may use much memory in
> total.  If the latter is chosen, the records per relation could be
> few and thus the benefit of bulk insert could be small.
> 

I think it needs to be applied per relation, because that's the level at
which we can do it easily and consistently. The whole point is to send
data in sufficiently large chunks to minimize the communication overhead
(latency etc.), but if you enforce it "per query" that seems hard.

Imagine you're inserting data into a table with many partitions - how do
you pick the number of rows to accumulate? The table may have 10 or 1000
partitions, we may be inserting into all partitions or just a small
subset, not all partitions may be foreign, etc. It seems pretty
difficult to pick and enforce a reliable limit at the query level. But
maybe I'm missing something and it's easier than I think?

Of course, you're entirely correct enforcing this at the partition level
may require a lot of memory. Sadly, I don't see a way around that,
except for (a) disabling batching or (b) ordering the data to insert
data into one partition at a time.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:

On 11/10/20 4:05 PM, Tomas Vondra wrote:
> Hi,
> 
> Thanks for working on this!
> 
> On 11/10/20 1:45 AM, tsunakawa.takay@fujitsu.com wrote:
>> Hello,
>>
>>
>> The attached patch implements the new bulk insert routine for
>> postgres_fdw and the executor utilizing it.  It passes make
>> check-world.
>>
> 
> I haven't done any testing yet, just a quick review.
> 
> I see the patch builds the "bulk" query in execute_foreign_modify. IMO
> that's something we should do earlier, when we're building the simple
> query (for 1-row inserts). I'd understand if you were concerned about
> overhead in case of 1-row inserts, trying to not plan the bulk query
> until necessary, but I'm not sure this actually helps.
> 
> Or was the goal to build a query for every possible number of slots? I
> don't think that's really useful, considering it requires deallocating
> the old plan, preparing a new one, etc. IMO it should be sufficient to
> have two queries - one for 1-row inserts, one for the full batch. The
> last incomplete batch can be inserted using a loop of 1-row queries.
> 
> That's what my patch was doing, but I'm not insisting on that - it just
> seems like a better approach to me. So feel free to argue why this is
> better.
> 
> 
>> I measured performance in a basic non-partitioned case by modifying
>> Tomas-san's scripts.  They perform an INSERT SELECT statement that
>> copies one million records.  The table consists of two integer
>> columns, with a primary key on one of those them.  You can run the
>> attached prepare.sql to set up once.  local.sql inserts to the table
>> directly, while fdw.sql inserts through a foreign table.
>>
>> The performance results, the average time of 5 runs,  were as follows
>> on a Linux host where the average round-trip time of "ping localhost"
>> was 34 us:
>>
>> master, local: 6.1 seconds master, fdw: 125.3 seconds patched, fdw:
>> 11.1 seconds (11x improvement)
>>
> 
> Nice. I think we can't really get much closer to local master, so 6.1
> vs. 11.1 seconds look quite acceptable.
> 
>>
>> The patch accumulates at most 100 records in ModifyTableState before
>> inserting in bulk.  Also, when an input record is targeted for a
>> different relation (= partition) than that for already accumulated
>> records, insert the accumulated records and store the new record for
>> later insert.
>>
>> [Issues]
>>
>> 1. Do we want a GUC parameter, say, max_bulk_insert_records =
>> (integer), to control the number of records inserted at once? The
>> range of allowed values would be between 1 and 1,000.  1 disables
>> bulk insert. The possible reason of the need for this kind of
>> parameter would be to limit the amount of memory used for accumulated
>> records, which could be prohibitively large if each record is big.  I
>> don't think this is a must, but I think we can have it.
>>
> 
> I think it'd be good to have such GUC, even if only for testing and
> development. We should probably have a way to disable the batching,
> which the GUC could also do, I think. So +1 to have the GUC.
> 
>> 2. Should we accumulate records per relation in ResultRelInfo
>> instead? That is, when inserting into a partitioned table that has
>> foreign partitions, delay insertion until a certain number of input
>> records accumulate, and then insert accumulated records per relation
>> (e.g., 50 records to relation A, 30 records to relation B, and 20
>> records to relation C.)  If we do that,
>>
> 
> I think there's a chunk of text missing here? If we do that, then what?
> 
> Anyway, I don't see why accumulating the records in ResultRelInfo would
> be better than what the patch does now. It seems to me like fairly
> specific to FDWs, so keeping it int FDW state seems appropriate. What
> would be the advantage of stashing it in ResultRelInfo?
> 
>> * The order of insertion differs from the order of input records.  Is
>> it OK?
>>
> 
> I think that's OK for most use cases, and if it's not (e.g. when there's
> something requiring the exact order of writes) then it's not possible to
> use batching. That's one of the reasons why I think we should have a GUC
> to disable the batching.
> 
>> * Should the maximum count of accumulated records be applied per
>> relation or the query? When many foreign partitions belong to a
>> partitioned table, if the former is chosen, it may use much memory in
>> total.  If the latter is chosen, the records per relation could be
>> few and thus the benefit of bulk insert could be small.
>>
> 
> I think it needs to be applied per relation, because that's the level at
> which we can do it easily and consistently. The whole point is to send
> data in sufficiently large chunks to minimize the communication overhead
> (latency etc.), but if you enforce it "per query" that seems hard.
> 
> Imagine you're inserting data into a table with many partitions - how do
> you pick the number of rows to accumulate? The table may have 10 or 1000
> partitions, we may be inserting into all partitions or just a small
> subset, not all partitions may be foreign, etc. It seems pretty
> difficult to pick and enforce a reliable limit at the query level. But
> maybe I'm missing something and it's easier than I think?
> 
> Of course, you're entirely correct enforcing this at the partition level
> may require a lot of memory. Sadly, I don't see a way around that,
> except for (a) disabling batching or (b) ordering the data to insert
> data into one partition at a time.
> 

Two more comments regarding this:

1) If we want to be more strict about the memory consumption, we should
probably set the limit in terms of memory, not number of rows. Currently
the 100 rows may be 10kB or 10MB, there's no way to know. Of course,
this is not the only place with this issue.

2) I wonder what the COPY FROM patch [1] does in this regard. I don't
have time to check right now, but I suggest we try to do the same thing,
if only to be consistent.

[1]
https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
> I see the patch builds the "bulk" query in execute_foreign_modify. IMO
> that's something we should do earlier, when we're building the simple
> query (for 1-row inserts). I'd understand if you were concerned about
> overhead in case of 1-row inserts, trying to not plan the bulk query
> until necessary, but I'm not sure this actually helps.
> 
> Or was the goal to build a query for every possible number of slots? I
> don't think that's really useful, considering it requires deallocating
> the old plan, preparing a new one, etc. IMO it should be sufficient to
> have two queries - one for 1-row inserts, one for the full batch. The
> last incomplete batch can be inserted using a loop of 1-row queries.
> 
> That's what my patch was doing, but I'm not insisting on that - it just
> seems like a better approach to me. So feel free to argue why this is
> better.

Don't be concerned, the processing is not changed for 1-row inserts: the INSERT query string is built in
PlanForeignModify(),and the remote statement is prepared in execute_foreign_modify() during the first call to
ExecForeignInsert()and it's reused for subsequent ExecForeignInsert() calls.
 

The re-creation of INSERT query string and its corresponding PREPARE happen when the number of tuples to be inserted is
differentfrom the previous call to ExecForeignInsert()/ExecForeignBulkInsert().  That's because we don't know how many
tupleswill be inserted during planning (PlanForeignModify) or execution (until the scan ends for SELECT).  For example,
ifwe insert 10,030 rows with the bulk size 100, the flow is:
 

  PlanForeignModify():
    build the INSERT query string for 1 row
  ExecForeignBulkInsert(100):
    drop the INSERT query string and prepared statement for 1 row
    build the query string and prepare statement for 100 row INSERT
    execute it
  ExecForeignBulkInsert(100):
    reuse the prepared statement for 100 row INSERT and execute it
...
  ExecForeignBulkInsert(30):
    drop the INSERT query string and prepared statement for 100 row
    build the query string and prepare statement for 30 row INSERT
    execute it


> I think it'd be good to have such GUC, even if only for testing and
> development. We should probably have a way to disable the batching,
> which the GUC could also do, I think. So +1 to have the GUC.

OK, I'll add it.  The name would be max_bulk_insert_tuples, because a) it might cover bulk insert for local relations
inthe future, and b) "tuple" is used in cpu_(index_)tuple_cost and parallel_tuple_cost, while "row" or "record" is not
usedin GUC (except for row_security).
 

The valid range would be between 1 and 1,000 (I said 10,000 previously, but I think it's overreaction and am a bit
worriedabout unforseen trouble too many tuples might cause.)  1 disables the bulk processing and uses the traditonal
ExecForeignInsert(). The default value is 100 (would 1 be sensible as a default value to avoid surprising users by
increasedmemory usage?)
 


> > 2. Should we accumulate records per relation in ResultRelInfo
> > instead? That is, when inserting into a partitioned table that has
> > foreign partitions, delay insertion until a certain number of input
> > records accumulate, and then insert accumulated records per relation
> > (e.g., 50 records to relation A, 30 records to relation B, and 20
> > records to relation C.)  If we do that,
> >
> 
> I think there's a chunk of text missing here? If we do that, then what?

Sorry, the two bullets below there are what follows.  Perhaps I should have written ":" instead of ",".


> Anyway, I don't see why accumulating the records in ResultRelInfo would
> be better than what the patch does now. It seems to me like fairly
> specific to FDWs, so keeping it int FDW state seems appropriate. What
> would be the advantage of stashing it in ResultRelInfo?

I thought of distributing input records to their corresponding partitions' ResultRelInfos.  For example, input record
forpartition 1 comes, store it in the ResultRelInfo for partition 1, then input record for partition 2 comes, store it
inthe ResultRelInfo for partition 2.  When a ResultRelInfo accumulates some number of rows, insert the accumulated rows
thereininto the partition.  When the input endds, perform bulk inserts for ResultRelInfos that have accumulated rows.
 



> I think that's OK for most use cases, and if it's not (e.g. when there's
> something requiring the exact order of writes) then it's not possible to
> use batching. That's one of the reasons why I think we should have a GUC
> to disable the batching.

Agreed.


> > * Should the maximum count of accumulated records be applied per
> > relation or the query? When many foreign partitions belong to a
> > partitioned table, if the former is chosen, it may use much memory in
> > total.  If the latter is chosen, the records per relation could be
> > few and thus the benefit of bulk insert could be small.
> >
> 
> I think it needs to be applied per relation, because that's the level at
> which we can do it easily and consistently. The whole point is to send
> data in sufficiently large chunks to minimize the communication overhead
> (latency etc.), but if you enforce it "per query" that seems hard.
> 
> Imagine you're inserting data into a table with many partitions - how do
> you pick the number of rows to accumulate? The table may have 10 or 1000
> partitions, we may be inserting into all partitions or just a small
> subset, not all partitions may be foreign, etc. It seems pretty
> difficult to pick and enforce a reliable limit at the query level. But
> maybe I'm missing something and it's easier than I think?
> 
> Of course, you're entirely correct enforcing this at the partition level
> may require a lot of memory. Sadly, I don't see a way around that,
> except for (a) disabling batching or (b) ordering the data to insert
> data into one partition at a time.

OK, I think I'll try doing like that, after waiting for other opinions some days.


> Two more comments regarding this:
> 
> 1) If we want to be more strict about the memory consumption, we should
> probably set the limit in terms of memory, not number of rows. Currently
> the 100 rows may be 10kB or 10MB, there's no way to know. Of course,
> this is not the only place with this issue.
> 
> 2) I wonder what the COPY FROM patch [1] does in this regard. I don't
> have time to check right now, but I suggest we try to do the same thing,
> if only to be consistent.
> 
> [1]
> https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-909
> 86e55489f%40postgrespro.ru

That COPY FROM patch uses the tuple accumulation mechanism for local tables as-is.  That is, it accumulates at most
1,000tuples per partition.
 

/*
 * No more than this many tuples per CopyMultiInsertBuffer
 *
 * Caution: Don't make this too big, as we could end up with this many
 * CopyMultiInsertBuffer items stored in CopyMultiInsertInfo's
 * multiInsertBuffers list.  Increasing this can cause quadratic growth in
 * memory requirements during copies into partitioned tables with a large
 * number of partitions.
 */
#define MAX_BUFFERED_TUPLES     1000


Regards
Takayuki Tsunakawa


RE: POC: postgres_fdw insert batching

From
Tim.Colles@ed.ac.uk
Date:
On Wed, 11 Nov 2020, tsunakawa.takay@fujitsu.com wrote:

> This email was sent to you by someone outside of the University.
> You should only click on links or attachments if you are certain that the email is genuine and the content is safe.
>
> From: Tomas Vondra <tomas.vondra@enterprisedb.com>
>> I see the patch builds the "bulk" query in execute_foreign_modify. IMO
>> that's something we should do earlier, when we're building the simple
>> query (for 1-row inserts). I'd understand if you were concerned about
>> overhead in case of 1-row inserts, trying to not plan the bulk query
>> until necessary, but I'm not sure this actually helps.
>>
>> Or was the goal to build a query for every possible number of slots? I
>> don't think that's really useful, considering it requires deallocating
>> the old plan, preparing a new one, etc. IMO it should be sufficient to
>> have two queries - one for 1-row inserts, one for the full batch. The
>> last incomplete batch can be inserted using a loop of 1-row queries.
>>
>> That's what my patch was doing, but I'm not insisting on that - it just
>> seems like a better approach to me. So feel free to argue why this is
>> better.
>
> Don't be concerned, the processing is not changed for 1-row inserts: the INSERT query string is built in
PlanForeignModify(),and the remote statement is prepared in execute_foreign_modify() during the first call to
ExecForeignInsert()and it's reused for subsequent ExecForeignInsert() calls.
 
>
> The re-creation of INSERT query string and its corresponding PREPARE happen when the number of tuples to be inserted
isdifferent from the previous call to ExecForeignInsert()/ExecForeignBulkInsert().  That's because we don't know how
manytuples will be inserted during planning (PlanForeignModify) or execution (until the scan ends for SELECT).  For
example,if we insert 10,030 rows with the bulk size 100, the flow is:
 
>
>  PlanForeignModify():
>    build the INSERT query string for 1 row
>  ExecForeignBulkInsert(100):
>    drop the INSERT query string and prepared statement for 1 row
>    build the query string and prepare statement for 100 row INSERT
>    execute it
>  ExecForeignBulkInsert(100):
>    reuse the prepared statement for 100 row INSERT and execute it
> ...
>  ExecForeignBulkInsert(30):
>    drop the INSERT query string and prepared statement for 100 row
>    build the query string and prepare statement for 30 row INSERT
>    execute it
>
>
>> I think it'd be good to have such GUC, even if only for testing and
>> development. We should probably have a way to disable the batching,
>> which the GUC could also do, I think. So +1 to have the GUC.
>
> OK, I'll add it.  The name would be max_bulk_insert_tuples, because a) it might cover bulk insert for local relations
inthe future, and b) "tuple" is used in cpu_(index_)tuple_cost and parallel_tuple_cost, while "row" or "record" is not
usedin GUC (except for row_security).
 
>
> The valid range would be between 1 and 1,000 (I said 10,000 previously, but I think it's overreaction and am a bit
worriedabout unforseen trouble too many tuples might cause.)  1 disables the bulk processing and uses the traditonal
ExecForeignInsert(). The default value is 100 (would 1 be sensible as a default value to avoid surprising users by
increasedmemory usage?)
 
>
>
>>> 2. Should we accumulate records per relation in ResultRelInfo
>>> instead? That is, when inserting into a partitioned table that has
>>> foreign partitions, delay insertion until a certain number of input
>>> records accumulate, and then insert accumulated records per relation
>>> (e.g., 50 records to relation A, 30 records to relation B, and 20
>>> records to relation C.)  If we do that,
>>>
>>
>> I think there's a chunk of text missing here? If we do that, then what?
>
> Sorry, the two bullets below there are what follows.  Perhaps I should have written ":" instead of ",".
>
>
>> Anyway, I don't see why accumulating the records in ResultRelInfo would
>> be better than what the patch does now. It seems to me like fairly
>> specific to FDWs, so keeping it int FDW state seems appropriate. What
>> would be the advantage of stashing it in ResultRelInfo?
>
> I thought of distributing input records to their corresponding partitions' ResultRelInfos.  For example, input record
forpartition 1 comes, store it in the ResultRelInfo for partition 1, then input record for partition 2 comes, store it
inthe ResultRelInfo for partition 2.  When a ResultRelInfo accumulates some number of rows, insert the accumulated rows
thereininto the partition.  When the input endds, perform bulk inserts for ResultRelInfos that have accumulated rows.
 
>
>
>
>> I think that's OK for most use cases, and if it's not (e.g. when there's
>> something requiring the exact order of writes) then it's not possible to
>> use batching. That's one of the reasons why I think we should have a GUC
>> to disable the batching.
>
> Agreed.
>
>
>>> * Should the maximum count of accumulated records be applied per
>>> relation or the query? When many foreign partitions belong to a
>>> partitioned table, if the former is chosen, it may use much memory in
>>> total.  If the latter is chosen, the records per relation could be
>>> few and thus the benefit of bulk insert could be small.
>>>
>>
>> I think it needs to be applied per relation, because that's the level at
>> which we can do it easily and consistently. The whole point is to send
>> data in sufficiently large chunks to minimize the communication overhead
>> (latency etc.), but if you enforce it "per query" that seems hard.
>>
>> Imagine you're inserting data into a table with many partitions - how do
>> you pick the number of rows to accumulate? The table may have 10 or 1000
>> partitions, we may be inserting into all partitions or just a small
>> subset, not all partitions may be foreign, etc. It seems pretty
>> difficult to pick and enforce a reliable limit at the query level. But
>> maybe I'm missing something and it's easier than I think?
>>
>> Of course, you're entirely correct enforcing this at the partition level
>> may require a lot of memory. Sadly, I don't see a way around that,
>> except for (a) disabling batching or (b) ordering the data to insert
>> data into one partition at a time.
>
> OK, I think I'll try doing like that, after waiting for other opinions some days.
>
>
>> Two more comments regarding this:
>>
>> 1) If we want to be more strict about the memory consumption, we should
>> probably set the limit in terms of memory, not number of rows. Currently
>> the 100 rows may be 10kB or 10MB, there's no way to know. Of course,
>> this is not the only place with this issue.
>>
>> 2) I wonder what the COPY FROM patch [1] does in this regard. I don't
>> have time to check right now, but I suggest we try to do the same thing,
>> if only to be consistent.
>>
>> [1]
>> https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-909
>> 86e55489f%40postgrespro.ru
>
> That COPY FROM patch uses the tuple accumulation mechanism for local tables as-is.  That is, it accumulates at most
1,000tuples per partition.
 
>
> /*
> * No more than this many tuples per CopyMultiInsertBuffer
> *
> * Caution: Don't make this too big, as we could end up with this many
> * CopyMultiInsertBuffer items stored in CopyMultiInsertInfo's
> * multiInsertBuffers list.  Increasing this can cause quadratic growth in
> * memory requirements during copies into partitioned tables with a large
> * number of partitions.
> */
> #define MAX_BUFFERED_TUPLES     1000
>
>
> Regards
> Takayuki Tsunakawa
>
>

Does this patch affect trigger semantics on the base table?

At the moment when I insert 1000 rows into a postgres_fdw table using a
single insert statement (e.g. INSERT INTO fdw_foo SELECT ... FROM bar) I
naively expect a "statement level" trigger on the base table to trigger
once. But this is not the case. The postgres_fdw implements this
operation as 1000 separate insert statements on the base table, so the
trigger happens 1000 times instead of once. Hence there is no
distinction between using a statement level and a row level trigger on
the base table in this context.

So would this patch change the behaviour so only 10 separate insert
statements (each of 100 rows) would be made against the base table?
If so thats useful as it means improving performance using statement
level triggers becomes possible. But it would also result in more
obscure semantics and might break user processes dependent on the
existing behaviour after the patch is applied.

BTW is this subtlety documented, I haven't found anything but happy
to be proved wrong?

Tim

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
From: timc@corona.is.ed.ac.uk <timc@corona.is.ed.ac.uk> On Behalf Of
> Does this patch affect trigger semantics on the base table?
>
> At the moment when I insert 1000 rows into a postgres_fdw table using a
> single insert statement (e.g. INSERT INTO fdw_foo SELECT ... FROM bar) I
> naively expect a "statement level" trigger on the base table to trigger
> once. But this is not the case. The postgres_fdw implements this
> operation as 1000 separate insert statements on the base table, so the
> trigger happens 1000 times instead of once. Hence there is no
> distinction between using a statement level and a row level trigger on
> the base table in this context.
>
> So would this patch change the behaviour so only 10 separate insert
> statements (each of 100 rows) would be made against the base table?
> If so thats useful as it means improving performance using statement
> level triggers becomes possible. But it would also result in more
> obscure semantics and might break user processes dependent on the
> existing behaviour after the patch is applied.

Yes, the times the statement trigger defined on the base (remote) table will be reduced, as you said.


> BTW is this subtlety documented, I haven't found anything but happy
> to be proved wrong?

Unfortunately, there doesn't seem to be any description on triggers on base tables.  For example, if the local foreign
tablehas an AFTER ROW trigger and its remote base table has a BEFORE ROW trigger that modifies the input record, it
seemsthat the AFTER ROW trigger doesn't see the modified record. 


Regards
Takayuki Tsunakawa




RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
Hello,


Modified the patch as I talked with Tomas-san.  The performance results of loading one million records into a
hash-partitionedtable with 8 partitions are as follows:
 

    unpatched, local: 8.6 seconds
        unpatched, fdw: 113.7 seconds
    patched, fdw: 12.5 seconds  (9x improvement)

The test scripts are also attached.  Run prepare.sql once to set up tables and source data.  Run local_part.sql and
fdw_part.sqlto load source data into a partitioned table with local partitions and a partitioned table with foreign
tablesrespectively.
 


Regards
Takayuki Tsunakawa


Attachment

Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:

On 11/17/20 10:11 AM, tsunakawa.takay@fujitsu.com wrote:
> Hello,
> 
> 
> Modified the patch as I talked with Tomas-san.  The performance
> results of loading one million records into a hash-partitioned table
> with 8 partitions are as follows:
> 
> unpatched, local: 8.6 seconds unpatched, fdw: 113.7 seconds patched,
> fdw: 12.5 seconds  (9x improvement)
> 
> The test scripts are also attached.  Run prepare.sql once to set up
> tables and source data.  Run local_part.sql and fdw_part.sql to load
> source data into a partitioned table with local partitions and a
> partitioned table with foreign tables respectively.
> 

Unfortunately, this does not compile for me, because nodeModifyTable
calls ExecGetTouchedPartitions, which is not defined anywhere. Not sure
what's that about, so I simply commented-out this. That probably fails
the partitioned cases, but it allowed me to do some review and testing.

As for the patch, I have a couple of comments

1) As I mentioned before, I really don't think we should be doing
deparsing in execute_foreign_modify - that's something that should
happen earlier, and should be in a deparse.c function.

2) I think the GUC should be replaced with an server/table option,
similar to fetch_size.

The attached patch tries to address both of these points.

Firstly, it adds a new deparseBulkInsertSql function, that builds a
query for the "full" batch, and then uses those two queries - when we
get a full batch we use the bulk query, otherwise we use the single-row
query in a loop. IMO this is cleaner than deparsing queries ad hoc in
the execute_foreign_modify.

Of course, this might be worse when we don't have a full batch, e.g. for
a query that insert only 50 rows with batch_size=100. If this case is
common, one option would be lowering the batch_size accordingly. If we
really want to improve this case too, I suggest we pass more info than
just a position of the VALUES clause - that seems a bit too hackish.


Secondly, it adds the batch_size option to server/foreign table, and
uses that. This is not complete, though. postgresPlanForeignModify
currently passes a hard-coded value at the moment, it needs to lookup
the correct value for the server/table from RelOptInfo or something. And
I suppose ModifyTable inftractructure will need to determine the value
in order to pass the correct number of slots to the FDW API.

The are a couple other smaller changes. E.g. it undoes changes to
finish_foreign_modify, and instead calls separate functions to prepare
the bulk statement. It also adds list_make5/list_make6 macros, so as to
not have to do strange stuff with the parameter lists.


A finally, this should probably add a bunch of regression tests.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
> Unfortunately, this does not compile for me, because nodeModifyTable calls
> ExecGetTouchedPartitions, which is not defined anywhere. Not sure what's
> that about, so I simply commented-out this. That probably fails the partitioned
> cases, but it allowed me to do some review and testing.

Ouch, sorry.  I'm ashamed to have forgotten including execPartition.c.


> The are a couple other smaller changes. E.g. it undoes changes to
> finish_foreign_modify, and instead calls separate functions to prepare the bulk
> statement. It also adds list_make5/list_make6 macros, so as to not have to do
> strange stuff with the parameter lists.

Thanks, I'll take them thankfully!  I wonder why I didn't think of separating deallocate_query() from
finish_foreign_modify()... perhaps my brain was dying.  As for list_make5/6(), I saw your first patch avoid adding
them,so I thought you found them ugly (and I felt so, too.)  But thinking now, there's no reason to hesitate it.
 


> A finally, this should probably add a bunch of regression tests.

Sure.


> 1) As I mentioned before, I really don't think we should be doing deparsing in
> execute_foreign_modify - that's something that should happen earlier, and
> should be in a deparse.c function.
...
> The attached patch tries to address both of these points.
> 
> Firstly, it adds a new deparseBulkInsertSql function, that builds a query for the
> "full" batch, and then uses those two queries - when we get a full batch we use
> the bulk query, otherwise we use the single-row query in a loop. IMO this is
> cleaner than deparsing queries ad hoc in the execute_foreign_modify.
...
> Of course, this might be worse when we don't have a full batch, e.g. for a query
> that insert only 50 rows with batch_size=100. If this case is common, one
> option would be lowering the batch_size accordingly. If we really want to
> improve this case too, I suggest we pass more info than just a position of the
> VALUES clause - that seems a bit too hackish.
...
> Secondly, it adds the batch_size option to server/foreign table, and uses that.
> This is not complete, though. postgresPlanForeignModify currently passes a
> hard-coded value at the moment, it needs to lookup the correct value for the
> server/table from RelOptInfo or something. And I suppose ModifyTable
> inftractructure will need to determine the value in order to pass the correct
> number of slots to the FDW API.

I can sort of understand your feeling, but I'd like to reconstruct the query and prepare it in execute_foreign_modify()
because:

* Some of our customers use bulk insert in ECPG (INSERT ... VALUES(record1, (record2), ...) to insert variable number
ofrecords per query.  (Oracle's Pro*C has such a feature.)  So, I want to be prepared to enable such a thing with FDW.
 

* The number of records to insert is not known during planning (in general), so it feels natural to get prepared during
executionphase, or not unnatural at least.
 

* I wanted to avoid the overhead of building the full query string for 100-record insert statement during query
planning,because it may be a bit costly for usual 1-record inserts.  (The overhead may be hidden behind the high
communicationcost of postgres_fdw, though.)
 

So, in terms of code cleanness, how about moving my code for rebuilding query string from execute_foreign_modify() to
somenew function in deparse.c?
 


> 2) I think the GUC should be replaced with an server/table option, similar to
> fetch_size.

Hmm, batch_size differs from fetch_size.  fetch_size is a postgres_fdw-specific feature with no relevant FDW routine,
whilebatch_size is a configuration parameter for all FDWs that implement ExecForeignBulkInsert().  The ideas I can
thinkof are:
 

1. Follow JDBC/ODBC and add standard FDW properties.  For example, the JDBC standard defines standard connection pool
propertiessuch as maxPoolSize and minPoolSize.  JDBC drivers have to provide them with those defined names.  Likewise,
theFDW interface requires FDW implementors to handle the foreign server option name "max_bulk_insert_tuples" if he/she
wantsto provide bulk insert feature and implement ExecForeignBulkInsert().  The core executor gets that setting from
theFDW by calling a new FDW routine like GetMaxBulkInsertTuples().  Sigh...
 

2. Add a new max_bulk_insert_tuples reloption to CREATE/ALTER FOREIGN TABLE.  executor gets the value from Relation and
usesit.  (But is this a table-specific configuration?  I don't think so, sigh...)
 

3. Adopt the current USERSET GUC max_bulk_insert_tuples.  I think this is enough because the user can change the
settingper session, application, and database.
 



Regards
Takayuki Tsunakawa


Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:
On 11/19/20 3:43 AM, tsunakawa.takay@fujitsu.com wrote:
> From: Tomas Vondra <tomas.vondra@enterprisedb.com>
>> Unfortunately, this does not compile for me, because
>> nodeModifyTable calls ExecGetTouchedPartitions, which is not
>> defined anywhere. Not sure what's that about, so I simply
>> commented-out this. That probably fails the partitioned cases, but
>> it allowed me to do some review and testing.
> 
> Ouch, sorry.  I'm ashamed to have forgotten including
> execPartition.c.
> 

No reason to feel ashamed. Mistakes do happen from time to time.

> 
>> The are a couple other smaller changes. E.g. it undoes changes to 
>> finish_foreign_modify, and instead calls separate functions to
>> prepare the bulk statement. It also adds list_make5/list_make6
>> macros, so as to not have to do strange stuff with the parameter
>> lists.
> 
> Thanks, I'll take them thankfully!  I wonder why I didn't think of
> separating deallocate_query() from finish_foreign_modify() ...
> perhaps my brain was dying.  As for list_make5/6(), I saw your first
> patch avoid adding them, so I thought you found them ugly (and I felt
> so, too.)  But thinking now, there's no reason to hesitate it.
> 

I think it's often easier to look changes like deallocate_query with a
bit of distance, not while hacking on the patch and just trying to make
it work somehow.

For the list_make# stuff, I think I've decided to do the simplest thing
possible in extension, without having to recompile the server. But I
think for a proper patch it's better to keep it more readable.

> ...
> 
>> 1) As I mentioned before, I really don't think we should be doing
>> deparsing in execute_foreign_modify - that's something that should
>> happen earlier, and should be in a deparse.c function.
> ...
>> The attached patch tries to address both of these points.
>> 
>> Firstly, it adds a new deparseBulkInsertSql function, that builds a
>> query for the "full" batch, and then uses those two queries - when
>> we get a full batch we use the bulk query, otherwise we use the
>> single-row query in a loop. IMO this is cleaner than deparsing
>> queries ad hoc in the execute_foreign_modify.
> ...
>> Of course, this might be worse when we don't have a full batch,
>> e.g. for a query that insert only 50 rows with batch_size=100. If
>> this case is common, one option would be lowering the batch_size
>> accordingly. If we really want to improve this case too, I suggest
>> we pass more info than just a position of the VALUES clause - that
>> seems a bit too hackish.
> ...
>> Secondly, it adds the batch_size option to server/foreign table,
>> and uses that. This is not complete, though.
>> postgresPlanForeignModify currently passes a hard-coded value at
>> the moment, it needs to lookup the correct value for the 
>> server/table from RelOptInfo or something. And I suppose
>> ModifyTable inftractructure will need to determine the value in
>> order to pass the correct number of slots to the FDW API.
> 
> I can sort of understand your feeling, but I'd like to reconstruct
> the query and prepare it in execute_foreign_modify() because:
> 
> * Some of our customers use bulk insert in ECPG (INSERT ...
> VALUES(record1, (record2), ...) to insert variable number of records
> per query.  (Oracle's Pro*C has such a feature.)  So, I want to be
> prepared to enable such a thing with FDW.
> 
> * The number of records to insert is not known during planning (in
> general), so it feels natural to get prepared during execution phase,
> or not unnatural at least.
> 

I think we should differentiate between "deparsing" and "preparing".

> * I wanted to avoid the overhead of building the full query string
> for 100-record insert statement during query planning, because it may
> be a bit costly for usual 1-record inserts.  (The overhead may be
> hidden behind the high communication cost of postgres_fdw, though.)
> 

Hmm, ok. I haven't tried how expensive that would be, but my assumption
was it's much cheaper than the latency we save. But maybe I'm wrong.

> So, in terms of code cleanness, how about moving my code for
> rebuilding query string from execute_foreign_modify() to some new
> function in deparse.c?
> 

That might work, yeah. I suggest we do this:

1) try to use the same approach for both single-row inserts and larger
batches, to not have a lot of different branches

2) modify deparseInsertSql to produce not the "final" query but some
intermediate representation useful to generate queries inserting
arbitrary number of rows

3) in execute_foreign_modify remember the last number of rows, and only
rebuild/replan the query when it changes

> 
>> 2) I think the GUC should be replaced with an server/table option,
>> similar to fetch_size.
> 
> Hmm, batch_size differs from fetch_size.  fetch_size is a
> postgres_fdw-specific feature with no relevant FDW routine, while
> batch_size is a configuration parameter for all FDWs that implement
> ExecForeignBulkInsert().  The ideas I can think of are:
> 
> 1. Follow JDBC/ODBC and add standard FDW properties.  For example,
> the JDBC standard defines standard connection pool properties such as
> maxPoolSize and minPoolSize.  JDBC drivers have to provide them with
> those defined names.  Likewise, the FDW interface requires FDW
> implementors to handle the foreign server option name
> "max_bulk_insert_tuples" if he/she wants to provide bulk insert
> feature and implement ExecForeignBulkInsert().  The core executor
> gets that setting from the FDW by calling a new FDW routine like
> GetMaxBulkInsertTuples().  Sigh...
> 
> 2. Add a new max_bulk_insert_tuples reloption to CREATE/ALTER FOREIGN
> TABLE.  executor gets the value from Relation and uses it.  (But is
> this a table-specific configuration?  I don't think so, sigh...)
> 

I do agree there's a difference between fetch_size and batch_size. For
fetch_size, it's internal to postgres_fdw - no external code needs to
know about it. For batch_size that's not the case, the ModifyTable core
code needs to be aware of that.

That means the "batch_size" is becoming part of the API, and IMO the way
to do that is by exposing it as an explicit API method. So +1 to add
something like GetMaxBulkInsertTuples.

It still needs to be configurable at the server/table level, though. The
new API method should only inform ModifyTable about the final max batch
size the FDW decided to use.

> 3. Adopt the current USERSET GUC max_bulk_insert_tuples.  I think
> this is enough because the user can change the setting per session,
> application, and database.
> 

I don't think this is usable in practice, because a single session may
be using multiple FDW servers, with different implementations, latency
to the data nodes, etc. It's unlikely a single GUC value will be
suitable for all of them.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
> I don't think this is usable in practice, because a single session may
> be using multiple FDW servers, with different implementations, latency
> to the data nodes, etc. It's unlikely a single GUC value will be
> suitable for all of them.

That makes sense.  The row size varies from table to table, so the user may want to tune this option to reduce memory
consumption.

I think the attached patch has reflected all your comments.  I hope this will pass..


Regards
Takayuki Tsunakawa




Attachment

Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:
On 11/23/20 3:17 AM, tsunakawa.takay@fujitsu.com wrote:
> From: Tomas Vondra <tomas.vondra@enterprisedb.com>
>> I don't think this is usable in practice, because a single session
>> may be using multiple FDW servers, with different implementations,
>> latency to the data nodes, etc. It's unlikely a single GUC value
>> will be suitable for all of them.
> 
> That makes sense.  The row size varies from table to table, so the
> user may want to tune this option to reduce memory consumption.
> 
> I think the attached patch has reflected all your comments.  I hope
> this will pass..
> 

Thanks - I didn't have time for a thorough review at the moment, so I
only skimmed through the diff and did a couple  very simple tests. And I
think overall it looks quite nice.

A couple minor comments/questions:

1) We're calling it "batch_size" but the API function is named
postgresGetMaxBulkInsertTuples(). Perhaps we should rename the function
to postgresGetModifyBatchSize()? That has the advantage it'd work if we
ever add support for batching to UPDATE/DELETE.

2) Do we have to lookup the batch_size in create_foreign_modify (in
server/table options)? I'd have expected to look it up while planning
the modify and then pass it through the list, just like the other
FdwModifyPrivateIndex stuff. But maybe that's not possible.

3) That reminds me - should we show the batching info on EXPLAIN? That
seems like a fairly interesting thing to show to the user. Perhaps
showing the average batch size would also be useful? Or maybe not, we
create the batches as large as possible, with the last one smaller.

4) It seems that ExecInsert executes GetMaxBulkInsertTuples() over and
over for every tuple. I don't know it that has measurable impact, but it
seems a bit excessive IMO. I don't think we should support the batch
size changing during execution (seems tricky).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
> 1) We're calling it "batch_size" but the API function is named
> postgresGetMaxBulkInsertTuples(). Perhaps we should rename the function
> to postgresGetModifyBatchSize()? That has the advantage it'd work if we
> ever add support for batching to UPDATE/DELETE.

Actually, I was in two minds whether the term batch or bulk is better.  Because Oracle uses "bulk insert" and "bulk
fetch",like in FETCH cur BULK COLLECT INTO array and FORALL in array INSERT INTO, while JDBC uses batch as in "batch
updates"and its API method names (addBatch, executeBatch).
 

But it seems better or common to use batch according to the etymology and the following Stack Overflow page:

https://english.stackexchange.com/questions/141884/which-is-a-better-and-commonly-used-word-bulk-or-batch

OTOH, as for the name GetModifyBatchSize() you suggest, I think GetInsertBatchSize may be better.  That is, this API
dealswith multiple records in a single INSERT statement.  Your GetModifyBatchSize will be reserved for statement
batchingwhen libpq has supported batch/pipelining to execute multiple INSERT/UPDATE/DELETE statements, as in the
followingJDBC batch updates.  What do you think?
 

CODE EXAMPLE 14-1 Creating and executing a batch of insert statements 
--------------------------------------------------
Statement stmt = con.createStatement(); 
stmt.addBatch("INSERT INTO employees VALUES (1000, 'Joe Jones')"); 
stmt.addBatch("INSERT INTO departments VALUES (260, 'Shoe')"); 
stmt.addBatch("INSERT INTO emp_dept VALUES (1000, 260)"); 

// submit a batch of update commands for execution 
int[] updateCounts = stmt.executeBatch(); 
--------------------------------------------------


> 2) Do we have to lookup the batch_size in create_foreign_modify (in
> server/table options)? I'd have expected to look it up while planning
> the modify and then pass it through the list, just like the other
> FdwModifyPrivateIndex stuff. But maybe that's not possible.

Don't worry, create_foreign_modify() is called from PlanForeignModify() during planning.  Unfortunately, it's also
calledfrom BeginForeignInsert(), but other stuff passed to create_foreign_modify() including the query string is
constructedthere.
 


> 3) That reminds me - should we show the batching info on EXPLAIN? That
> seems like a fairly interesting thing to show to the user. Perhaps
> showing the average batch size would also be useful? Or maybe not, we
> create the batches as large as possible, with the last one smaller.

Hmm, maybe batch_size is not for EXPLAIN because its value doesn't change dynamically based on the planning or system
stateunlike shared buffers and parallel workers.  OTOH, I sometimes want to see what configuration parameter values the
userset, such as work_mem, enable_*, and shared_buffers, together with the query plan (EXPLAIN and auto_explain).  For
example,it'd be nice if EXPLAIN (parameters on) could do that.  Some relevant FDW-related parameters could be included
inthat output.
 

> 4) It seems that ExecInsert executes GetMaxBulkInsertTuples() over and
> over for every tuple. I don't know it that has measurable impact, but it
> seems a bit excessive IMO. I don't think we should support the batch
> size changing during execution (seems tricky).

Don't worry about this, too.  GetMaxBulkInsertTuples() just returns a value that was already saved in a struct in
create_foreign_modify().


Regards
Takayuki Tsunakawa


Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:

On 11/24/20 9:45 AM, tsunakawa.takay@fujitsu.com wrote:
> From: Tomas Vondra <tomas.vondra@enterprisedb.com>
>> 1) We're calling it "batch_size" but the API function is named
>> postgresGetMaxBulkInsertTuples(). Perhaps we should rename the function
>> to postgresGetModifyBatchSize()? That has the advantage it'd work if we
>> ever add support for batching to UPDATE/DELETE.
> 
> Actually, I was in two minds whether the term batch or bulk is better.  Because Oracle uses "bulk insert" and "bulk
fetch",like in FETCH cur BULK COLLECT INTO array and FORALL in array INSERT INTO, while JDBC uses batch as in "batch
updates"and its API method names (addBatch, executeBatch).
 
> 
> But it seems better or common to use batch according to the etymology and the following Stack Overflow page:
> 
> https://english.stackexchange.com/questions/141884/which-is-a-better-and-commonly-used-word-bulk-or-batch
> 
> OTOH, as for the name GetModifyBatchSize() you suggest, I think GetInsertBatchSize may be better.  That is, this API
dealswith multiple records in a single INSERT statement.  Your GetModifyBatchSize will be reserved for statement
batchingwhen libpq has supported batch/pipelining to execute multiple INSERT/UPDATE/DELETE statements, as in the
followingJDBC batch updates.  What do you think?
 
> 

I don't know. I was really only thinking about batching in the context
of a single DML command, not about batching of multiple commands at the
protocol level. IMHO it's far more likely we'll add support for batching
for DELETE/UPDATE than libpq pipelining, which seems rather different
from how the FDW API works. Which is why I was suggesting to use a name
that would work for all DML commands, not just for inserts.

> CODE EXAMPLE 14-1 Creating and executing a batch of insert statements 
> --------------------------------------------------
> Statement stmt = con.createStatement(); 
> stmt.addBatch("INSERT INTO employees VALUES (1000, 'Joe Jones')"); 
> stmt.addBatch("INSERT INTO departments VALUES (260, 'Shoe')"); 
> stmt.addBatch("INSERT INTO emp_dept VALUES (1000, 260)"); 
> 
> // submit a batch of update commands for execution 
> int[] updateCounts = stmt.executeBatch(); 
> --------------------------------------------------
> 

Sure. We already have a patch to support something like this at the
libpq level, IIRC. But I'm not sure how well that matches the FDW API
approach in general.

> 
>> 2) Do we have to lookup the batch_size in create_foreign_modify (in
>> server/table options)? I'd have expected to look it up while planning
>> the modify and then pass it through the list, just like the other
>> FdwModifyPrivateIndex stuff. But maybe that's not possible.
> 
> Don't worry, create_foreign_modify() is called from PlanForeignModify() during planning.  Unfortunately, it's also
calledfrom BeginForeignInsert(), but other stuff passed to create_foreign_modify() including the query string is
constructedthere.
 
> 

Hmm, ok.

> 
>> 3) That reminds me - should we show the batching info on EXPLAIN? That
>> seems like a fairly interesting thing to show to the user. Perhaps
>> showing the average batch size would also be useful? Or maybe not, we
>> create the batches as large as possible, with the last one smaller.
> 
> Hmm, maybe batch_size is not for EXPLAIN because its value doesn't change dynamically based on the planning or system
stateunlike shared buffers and parallel workers.  OTOH, I sometimes want to see what configuration parameter values the
userset, such as work_mem, enable_*, and shared_buffers, together with the query plan (EXPLAIN and auto_explain).  For
example,it'd be nice if EXPLAIN (parameters on) could do that.  Some relevant FDW-related parameters could be included
inthat output.
 
> 

Not sure, but I'd guess knowing whether batching is used would be
useful. We only print the single-row SQL query, which kinda gives the
impression that there's no batching.

>> 4) It seems that ExecInsert executes GetMaxBulkInsertTuples() over and
>> over for every tuple. I don't know it that has measurable impact, but it
>> seems a bit excessive IMO. I don't think we should support the batch
>> size changing during execution (seems tricky).
> 
> Don't worry about this, too.  GetMaxBulkInsertTuples() just returns a value that was already saved in a struct in
create_foreign_modify().
> 

Well, I do worry for two reasons.

Firstly, the fact that in postgres_fdw the call is cheap does not mean
it'll be like that in every other FDW. Presumably, the other FDWs might
cache it in the struct and do the same thing, of course.

But the fact that we're calling it over and over for each row kinda
seems like we allow the value to change during execution, but I very
much doubt the code is expecting that. I haven't tried, but assume the
function first returns 10 and then 100. ISTM the code will allocate
ri_Slots with 25 slots, but then we'll try stashing 100 tuples there.
That can't end well. Sure, we can claim it's a bug in the FDW extension,
but it's also due to the API design.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: POC: postgres_fdw insert batching

From
Craig Ringer
Date:
On Thu, Oct 8, 2020 at 10:40 AM tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com> wrote:

Thank you for picking up this.  I'm interested in this topic, too.  (As an aside, we'd like to submit a bulk insert patch for ECPG in the near future.)

As others referred, Andrey-san's fast COPY to foreign partitions is also promising.  But I think your bulk INSERT is a separate feature and offers COPY cannot do -- data transformation during loading with INSERT SELECT and CREATE TABLE AS SELECT.

Is there anything that makes you worry and stops development?  Could I give it a try to implement this (I'm not sure I can, sorry.  I'm worried if we can change the executor's call chain easily.)


I suggest that when developing this, you keep in mind the ongoing work on the libpq pipelining/batching enhancements, and also the way many interfaces to foreign data sources support asynchronous, concurrent operations.

Best results with postgres_fdw insert batching would be achieved if it can also send its batches as asynchronous queries and only block when it's required to report on the results of the work. This will also be true of any other FDW where the backing remote interface can support asynchronous concurrent or pipelined operation.

I'd argue it's pretty much vital for decent performance when talking to a cloud database from an on-prem server for example, or any other time that round-trip-time reduction is important.

The most important characteristic of an FDW API to permit this would be decoupling of request and response into separate non-blocking calls that don't have to occur in ordered pairs. Instead of "insert_foo(foo) -> insert_result", have "queue_insert_foo(foo) -> future_result", "get_result_if_available(future_result) -> maybe result" and "get_result_blocking(future_result) -> result". Permit multiple queue_insert_foo(...)s without a/b interleaving with result fetches being required.

Ideally it'd be able to accumulate small batches of inserts locally and send a batch to the remote end once it's accumulated enough. But instead of blocking waiting for the result, return control to the executor after sending, without forcing a socket flush (which might block) and without waiting to learn what the outcome was. Allow new batches to be accumulated and sent before the results of the first batch are received, so long as it's within the same executor node so we don't make any unfortunate mistakes with mixing things up in compound statements or functions etc. Only report outcomes like rowcounts lazily when results are received, or when required to do so.

If now we have

REQUEST -> [block] -> RESULT
~~ round trip delay ~~
REQUEST -> [block] -> RESULT
~~ round trip delay ~~
REQUEST -> [block] -> RESULT
~~ round trip delay ~~
REQUEST -> [block] -> RESULT

and batching would give us

{ REQUEST, REQUEST} -> [block] -> { RESULT, RESULT }
~~ round trip delay ~~
{ REQUEST, REQUEST} -> [block] -> { RESULT, RESULT }

consider if room can be left in the batching API to permit:

{ REQUEST, REQUEST} -> [nonblocking send...]
{ REQUEST, REQUEST} -> [nonblocking send...]
~~ round trip delay ~~
[....] -> RESULT, RESULT
[....] -> RESULT, RESULT


... where we only actually block at the point where the result is required as input into the next node.

Honestly I don't know the executor structure well enough to say if this is even remotely feasible right now. Maybe Andres may be able to comment. But please keep it in mind if you're thinking of making FDW API changes.

RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
> On 11/24/20 9:45 AM, tsunakawa.takay@fujitsu.com wrote:
> > OTOH, as for the name GetModifyBatchSize() you suggest, I think
> GetInsertBatchSize may be better.  That is, this API deals with multiple
> records in a single INSERT statement.  Your GetModifyBatchSize will be
> reserved for statement batching when libpq has supported batch/pipelining to
> execute multiple INSERT/UPDATE/DELETE statements, as in the following
> JDBC batch updates.  What do you think?
> >
> 
> I don't know. I was really only thinking about batching in the context
> of a single DML command, not about batching of multiple commands at the
> protocol level. IMHO it's far more likely we'll add support for batching
> for DELETE/UPDATE than libpq pipelining, which seems rather different
> from how the FDW API works. Which is why I was suggesting to use a name
> that would work for all DML commands, not just for inserts.

Right, I can't imagine now how the interaction among the client, server core and FDWs would be regarding the statement
batching. So I'll take your suggested name.
 


> Not sure, but I'd guess knowing whether batching is used would be
> useful. We only print the single-row SQL query, which kinda gives the
> impression that there's no batching.

Added in postgres_fdw like "Remote SQL" when EXPLAIN VERBOSE is run.


> > Don't worry about this, too.  GetMaxBulkInsertTuples() just returns a value
> that was already saved in a struct in create_foreign_modify().
> >
> 
> Well, I do worry for two reasons.
> 
> Firstly, the fact that in postgres_fdw the call is cheap does not mean
> it'll be like that in every other FDW. Presumably, the other FDWs might
> cache it in the struct and do the same thing, of course.
> 
> But the fact that we're calling it over and over for each row kinda
> seems like we allow the value to change during execution, but I very
> much doubt the code is expecting that. I haven't tried, but assume the
> function first returns 10 and then 100. ISTM the code will allocate
> ri_Slots with 25 slots, but then we'll try stashing 100 tuples there.
> That can't end well. Sure, we can claim it's a bug in the FDW extension,
> but it's also due to the API design.

You worried about other FDWs than postgres_fdw.  That's reasonable.  I insisted in other threads that PG developers
careonly about postgres_fdw, not other FDWs, when designing the FDW interface, but I myself made the same mistake.  I
madechanges so that the executor calls GetModifyBatchSize() once per relation per statement.
 


Regards
Takayuki Tsunakawa


Attachment

RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Craig Ringer <craig.ringer@enterprisedb.com> 
> I suggest that when developing this, you keep in mind the ongoing work on the libpq pipelining/batching enhancements,
andalso the way many interfaces to foreign data sources support asynchronous, concurrent operations.
 

Yes, thank you, I bear it in mind.  I understand it's a feature for batching multiple kinds of SQL statements like
DBC'sbatch updates.
 


> I'd argue it's pretty much vital for decent performance when talking to a cloud database from an on-prem server for
example,or any other time that round-trip-time reduction is important.
 

Yeah, I'm thinking of the data migration and integration as the prominent use case.


Regards
Takayuki Tsunakawa


Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:
On 11/25/20 7:31 AM, tsunakawa.takay@fujitsu.com wrote:
> From: Craig Ringer <craig.ringer@enterprisedb.com>
>> I suggest that when developing this, you keep in mind the ongoing
>> work on the libpq pipelining/batching enhancements, and also the
>> way many interfaces to foreign data sources support asynchronous,
>> concurrent operations.
> 
> Yes, thank you, I bear it in mind.  I understand it's a feature for
> batching multiple kinds of SQL statements like DBC's batch updates.
> 

I haven't followed the libpq pipelining thread very closely. It does
seem related, but I'm not sure if it's a good match for this patch, or
how far is it from being committable ...

> 
>> I'd argue it's pretty much vital for decent performance when
>> talking to a cloud database from an on-prem server for example, or
>> any other time that round-trip-time reduction is important.
> 
> Yeah, I'm thinking of the data migration and integration as the
> prominent use case.
> 

Well, good that we all agree this is a useful feature to have (in
general). The question is whether postgres_fdw should be doing batching
on it's onw (per this thread) or rely on some other feature (libpq
pipelining). I haven't followed the other thread, so I don't have an
opinion on that.

Note however we're doing two things here, actually - we're implementing
custom batching for postgres_fdw, but we're also extending the FDW API
to allow other implementations do the same thing. And most of them won't
be able to rely on the connection library providing that, I believe.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
> Well, good that we all agree this is a useful feature to have (in
> general). The question is whether postgres_fdw should be doing batching
> on it's onw (per this thread) or rely on some other feature (libpq
> pipelining). I haven't followed the other thread, so I don't have an
> opinion on that.

Well, as someone said in this thread, I think bulk insert is much more common than updates/deletes.  Thus, major DBMSs
haveINSERT VALUES(record1), (record2)... and INSERT SELECT.  Oracle has direct path INSERT in addition.  As for the
comparisonof INSERT with multiple records and libpq batching (= multiple INSERTs), I think the former is more efficient
becausethe amount of data transfer is less and the parsing-planning of INSERT for each record is eliminated.
 

I never deny the usefulness of libpq batch/pipelining, but I'm not sure if app developers would really use it.  If they
wantto reduce the client-server round-trips, won't they use traditional stored procedures?  Yes, the stored procedure
languageis very DBMS-specific.  Then, I'd like to know what kind of well-known applications are using standard batching
APIlike JDBC's batch updates.  (Sorry, I think that should be discussed in libpq batch/pipelining thread and this
threadshould not be polluted.)
 


> Note however we're doing two things here, actually - we're implementing
> custom batching for postgres_fdw, but we're also extending the FDW API
> to allow other implementations do the same thing. And most of them won't
> be able to rely on the connection library providing that, I believe.

I'm afraid so, too.  Then, postgres_fdw would be an example that other FDW developers would look at when they use
INSERTwith multiple records.
 


Regards
Takayuki Tsunakawa





Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:

On 11/26/20 2:48 AM, tsunakawa.takay@fujitsu.com wrote:
> From: Tomas Vondra <tomas.vondra@enterprisedb.com>
>> Well, good that we all agree this is a useful feature to have (in 
>> general). The question is whether postgres_fdw should be doing 
>> batching on it's onw (per this thread) or rely on some other 
>> feature (libpq pipelining). I haven't followed the other thread,
>> so I don't have an opinion on that.
> 
> Well, as someone said in this thread, I think bulk insert is much 
> more common than updates/deletes.  Thus, major DBMSs have INSERT 
> VALUES(record1), (record2)... and INSERT SELECT.  Oracle has direct 
> path INSERT in addition.  As for the comparison of INSERT with 
> multiple records and libpq batching (= multiple INSERTs), I think
> the former is more efficient because the amount of data transfer is
> less and the parsing-planning of INSERT for each record is
> eliminated.
> 
> I never deny the usefulness of libpq batch/pipelining, but I'm not 
> sure if app developers would really use it.  If they want to reduce 
> the client-server round-trips, won't they use traditional stored 
> procedures?  Yes, the stored procedure language is very 
> DBMS-specific.  Then, I'd like to know what kind of well-known 
> applications are using standard batching API like JDBC's batch 
> updates.  (Sorry, I think that should be discussed in libpq 
> batch/pipelining thread and this thread should not be polluted.)
> 

Not sure how is this related to app developers? I think the idea was
that the libpq features might be useful between the two PostgreSQL
instances. I.e. the postgres_fdw would use the libpq batching to send
chunks of data to the other side.

> 
>> Note however we're doing two things here, actually - we're 
>> implementing custom batching for postgres_fdw, but we're also 
>> extending the FDW API to allow other implementations do the same 
>> thing. And most of them won't be able to rely on the connection 
>> library providing that, I believe.
> 
> I'm afraid so, too.  Then, postgres_fdw would be an example that 
> other FDW developers would look at when they use INSERT with
> multiple records.
> 

Well, my point was that we could keep the API, but maybe it should be
implemented using the proposed libpq batching. They could still use the
postgres_fdw example how to use the API, but the internals would need to
be different, of course.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: POC: postgres_fdw insert batching

From
Craig Ringer
Date:
On Fri, Nov 27, 2020 at 3:34 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
 
Not sure how is this related to app developers? I think the idea was
that the libpq features might be useful between the two PostgreSQL
instances. I.e. the postgres_fdw would use the libpq batching to send
chunks of data to the other side.

Right. Or at least, when designing the FDW API, do so in a way that doesn't strictly enforce Request/Response alternation without interleaving, so you can benefit from it in the future.

It's hardly just libpq after all. A *lot* of client libraries and drivers will be capable of non-blocking reads or writes with multiple ones in flight at once. Any REST-like API generally can, for example. So for performance reasons we should if possible avoid baking the assumption that a request cannot be made until the response from the previous request is received, and instead have a wait interface to use for when a new request requires the prior response's result before it can proceed.

Well, my point was that we could keep the API, but maybe it should be
implemented using the proposed libpq batching. They could still use the
postgres_fdw example how to use the API, but the internals would need to
be different, of course.

Sure. Or just allow room for it in the FDW API, though using the pipelining support natively would be nice.

If the FDW interface allows Pg to

Insert A
Insert B
Insert C
Wait for outcome of insert A
...

then that'll be useful for using libpq pipelining, but also FDWs for all sorts of other DBs, especially cloud-y ones where latency is a big concern.


RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
> Not sure how is this related to app developers? I think the idea was
> that the libpq features might be useful between the two PostgreSQL
> instances. I.e. the postgres_fdw would use the libpq batching to send
> chunks of data to the other side.

> Well, my point was that we could keep the API, but maybe it should be
> implemented using the proposed libpq batching. They could still use the
> postgres_fdw example how to use the API, but the internals would need to
> be different, of course.

Yes, I understand them.  I just wondered if app developers use the statement batching API for libpq or JDBC in what
kindof apps.  That is, I talked about the batching API itself, not related to FDW.  (So, I mentioned I think I should
asksuch a question in the libpq batching thread.)
 

I expect postgresExecForeignBatchInsert() would be able to use the libpq batching API, because it receives an array of
tuplesand can generate and issue INSERT statement for each tuple.  But I'm not sure either if the libpq batching is
likelyto be committed in the near future.  (The thread looks too long...)  Anyway, this thread's batch insert can be
progressed(and hopefully committed), and once the libpq batching has been committed, we can give it a try to use it and
modifypostgres_fdw to see if we can get further performance boost.
 


Regards
Takayuki Tsunakawa



Re: POC: postgres_fdw insert batching

From
Craig Ringer
Date:
On Fri, Nov 27, 2020 at 10:47 AM tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com> wrote:

Covering this one first:

I expect postgresExecForeignBatchInsert() would be able to use the libpq batching API, because it receives an array of tuples and can generate and issue INSERT statement for each tuple.

Sure, you can generate big multi-inserts. Or even do a COPY. But you still have to block for a full round-trip until the foreign server replies. So if you have 6000 calls to postgresExecForeignBatchInsert() during a single query, and a 100ms round trip time to the foreign server, you're going to waste 6000*0.1 = 600s = 10 min blocked in  postgresExecForeignBatchInsert() waiting for results from the foreign server.

Such batches have some major downsides:

* The foreign server cannot start executing the first query in the batch until the last query in the batch has been accumulated and the whole batch has been sent to the foreign server;
* The FDW has to block waiting for the batch to execute on the foreign server and for a full network round-trip before it can start another batch or let the backend do other work
This means RTTs get multiplied by batch counts. Still a lot better than individual statements, but plenty slow for high latency connections.

* Prepare 1000 rows to insert [10ms]
* INSERT 1000 values [100ms RTT + 50ms foreign server execution time]
* Prepare 1000 rows to insert [10ms]
* INSERT 1000 values [100ms RTT + 50ms foreign server execution time]
* ...

If you can instead send new inserts (or sets of inserts) to the foreign server without having to wait for the result of the previous batch to arrive, you can spend 100ms total waiting for results instead of 10 mins. You can start the execution of the first query earlier, spend less time blocked waiting on network, and let the local backend continue doing other work while the foreign server is busy executing the statements.

The time spent preparing local rows to insert now overlaps with the RTT and remote execution time, instead of happening serially. And there only has to be one RTT wait, assuming the foreign server and network can keep up with the rate we are generating requests at.

I can throw together some diagrams if it'll help. But in the libpq pipelining patch I demonstrated a 300 times (3000%) performance improvement on a test workload...

Anyway, this thread's batch insert can be progressed (and hopefully committed), and once the libpq batching has been committed, we can give it a try to use it and modify postgres_fdw to see if we can get further performance boost.

My point is that you should seriously consider whether batching is the appropriate interface here, or whether the FDW can expose a pipeline-like "queue work" then "wait for results" interface. That can be used to implement batching exactly as currently proposed, it does not have to wait for any libpq pipelining features. But it can *also* be used to implement concurrent async requests in other FDWs, and to implement pipelining in postgres_fdw once the needed libpq support is available.

I don't know the FDW to postgres API well enough, and it's possible I'm talking entirely out of my hat here.
 
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
> Not sure how is this related to app developers? I think the idea was
> that the libpq features might be useful between the two PostgreSQL
> instances. I.e. the postgres_fdw would use the libpq batching to send
> chunks of data to the other side.

> Well, my point was that we could keep the API, but maybe it should be
> implemented using the proposed libpq batching. They could still use the
> postgres_fdw example how to use the API, but the internals would need to
> be different, of course.

Yes, I understand them.  I just wondered if app developers use the statement batching API for libpq or JDBC in what kind of apps.

For JDBC, yes, it's used very heavily and has been for a long time, because PgJDBC doesn't rely on libpq - it implements the protocol directly and isn't bound by libpq's limitations. The application interface for it in JDBC is a batch interface [1][2], not a pipelined interface, so that's what PgJDBC users interact with [3] but batch execution is implemented using protocol pipelining support inside PgJDBC [4]. A while ago I did some work on deadlock prevention to work around issues with PgJDBC's implementation [5] which was needed because the feature was so heavily used. Both were to address customer needs in real world applications. The latter increased application performance over 50x through round-trip elimination.

For libpq, no, batching and pipelining are not yet used by anybody because application authors have to write to the libpq API and there hasn't been any in-core support for batching. We've had async / non-blocking support for a while, but it still enforces strict request/response ordering without interleaving, so application authors cannot make use of the same postgres server and protocol capabilities as PgJDBC. Most other drivers (like psqlODBC and psycopg2) are implemented on top of libpq, so they inherit the same limitations.

I don't expect most application authors to adopt pipelining directly, mainly because hardly anyone writes application code against libpq anyway. But drivers written on top of libpq will be able to adopt it to expose the batching, pipeline, or async/callback/event driven interfaces supported by their client database language interface specifications, or expose their own extension interfaces to give users callback-driven or batched query capabilities. In particular, psqlODBC will be able to implement ODBC batch query [6] efficiently. Right now psqlODBC can't execute batches efficiently via libpq, since it must perform one round-trip per query. It will be able to use the libpq pipelining API to greatly reduce round trips.

 
  But I'm not sure either if the libpq batching is likely to be committed in the near future.  (The thread looks too long...) 

I think it's getting there tbh.




Regards
Takayuki Tsunakawa


RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Craig Ringer <craig.ringer@enterprisedb.com> 
> But in the libpq pipelining patch I demonstrated a 300 times (3000%) performance improvement on a test workload...

Wow, impressive  number.  I've just seen it in the beginning of the libpq pipelining thread (oh, already four years
ago..!) Could you share the workload and the network latency (ping time)?  I'm sorry I'm just overlooking it.
 

Thank you for your (always) concise explanation.  I'd like to check other DBMSs and your rich references for the FDW
interface. (My first intuition is that many major DBMSs might not have client C APIs that can be used to implement an
asyncpipelining FDW interface.  Also, I'm afraid it requires major surgery or reform of executor.  I don't want it to
delaythe release of reasonably good (10x) improvement with the synchronous interface.)
 

(It'd be kind of you to send emails in text format.  I've changed the format of this reply from HTML to text.)


Regards
Takayuki Tsunakawa
 

Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:

On 11/27/20 7:05 AM, tsunakawa.takay@fujitsu.com wrote:
> From: Craig Ringer <craig.ringer@enterprisedb.com>
>> But in the libpq pipelining patch I demonstrated a 300 times
>> (3000%) performance improvement on a test workload...
> 
> Wow, impressive  number.  I've just seen it in the beginning of the
> libpq pipelining thread (oh, already four years ago..!)  Could you
> share the workload and the network latency (ping time)?  I'm sorry
> I'm just overlooking it.
> 
> Thank you for your (always) concise explanation.  I'd like to check
> other DBMSs and your rich references for the FDW interface.  (My
> first intuition is that many major DBMSs might not have client C APIs
> that can be used to implement an async pipelining FDW interface.
> Also, I'm afraid it requires major surgery or reform of executor.  I
> don't want it to delay the release of reasonably good (10x)
> improvement with the synchronous interface.)
> 

I do agree that pipelining is nice, and can bring huge improvements.

However, the FDW interface as it's implemented today is not designed to
allow that, I believe (we pretty much just invoke the FWD callbacks as
if it was a local AM). It assumes the calls are synchronous, and
redesigning it to work in async way is a much larger/complex patch than
what's being discussed here.

I do think the FDW extension proposed here (adding the bulk-insert
callback) is useful in general, for two reasons: (a) even if most client
libraries support some sort of pipelining, some don't, and (b) I'd bet
it's still more efficient to send one large insert than pipelining many
individual inserts.

That being said, I'm against expanding the scope of this patch to also
require redesign of the whole FDW infrastructure - that would likely
mean no such improvement landing in PG14. If the libpq pipelining patch
seems likely to get committed, we can try using it for the bulk insert
callback (instead of the current multi-value stuff).


> (It'd be kind of you to send emails in text format.  I've changed the
> format of this reply from HTML to text.)
> 

Craig's client is sending messages in both text/plain and text/html. You
probably need to tell your client to prefer that over html, somehow.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: POC: postgres_fdw insert batching

From
Craig Ringer
Date:


On Sat, 28 Nov 2020, 10:10 Tomas Vondra, <tomas.vondra@enterprisedb.com> wrote:


On 11/27/20 7:05 AM, tsunakawa.takay@fujitsu.com wrote:

However, the FDW interface as it's implemented today is not designed to
allow that, I believe (we pretty much just invoke the FWD callbacks as
if it was a local AM). It assumes the calls are synchronous, and
redesigning it to work in async way is a much larger/complex patch than
what's being discussed here.

I do think the FDW extension proposed here (adding the bulk-insert
callback) is useful in general, for two reasons: (a) even if most client
libraries support some sort of pipelining, some don't, and (b) I'd bet
it's still more efficient to send one large insert than pipelining many
individual inserts.

That being said, I'm against expanding the scope of this patch to also
require redesign of the whole FDW infrastructure - that would likely
mean no such improvement landing in PG14. If the libpq pipelining patch
seems likely to get committed, we can try using it for the bulk insert
callback (instead of the current multi-value stuff).

I totally agree on all points. It was not my intent to expand the scope of this significantly and I really don't want to hold it back.

I raised the interface consideration in case it was something easy to accommodate. It's not, so that's done, topic over.

Re: POC: postgres_fdw insert batching

From
Craig Ringer
Date:
On Fri, 27 Nov 2020, 14:06 tsunakawa.takay@fujitsu.com,
<tsunakawa.takay@fujitsu.com> wrote:
>
>
Also, I'm afraid it requires major surgery or reform of executor.  I
don't want it to delay the release of reasonably good (10x)
improvement with the synchronous interface.)


Totally sensible. If it isn't feasible without significant executor
change that's all that needs to be said.

I was afraid that'd be the case given the executor's pull flow but
just didn't know enough.

It was not my intention to hold this patch up or greatly expand its
scope. I'll spend some time testing it out and have a closer read soon
to see if I can help progress it.

I know Andres did some initial work on executor parallelism and
pipelining. I should take a look.

> > But in the libpq pipelining patch I demonstrated a 300 times (3000%) performance improvement on a test workload...
>
> Wow, impressive  number.  I've just seen it in the beginning of the libpq pipelining thread (oh, already four years
ago..!)

Yikes.

> Could you share the workload and the network latency (ping time)?  I'm sorry I'm just overlooking it.

I thought I gave it at the time, and a demo program. IIRC it was just
doing small multi row inserts or single row inserts. Latency would've
been a couple of hundred ms probably, I think I did something like
running on my laptop (Australia, ADSL) to a server on AWS US or EU.

> Thank you for your (always) concise explanation.

You joke! I am many things but despite my best efforts concise is
rarely one of them.

> I'd like to check other DBMSs and your rich references for the FDW interface.  (My first intuition is that many major
DBMSsmight not have client C APIs that can be used to implement an async pipelining FDW interface.
 

Likely correct for C APIs of other traditional DBMSes. I'd be less
sure about newer non SQL ones, especially cloud oriented. For example
DynamoDB supports at least async requests in the Java client [3] and
C++ client [4]; it's not immediately clear if requests can be
pipelined, but the API suggests they can.

Most things with a REST-like API can do a fair bit of concurrency
though. Multiple async nonblocking HTTP connections can be serviced at
once. Or HTTP/1.1 pipelining can be used [1], or even better HTTP/2.0
streams [2]. This is relevant for any REST-like API.

> (It'd be kind of you to send emails in text format.  I've changed the format of this reply from HTML to text.)

I try to remember. Stupid Gmail.  Sorry. On mobile it offers very
little control over format but I'll do my best when I can.

[1] https://en.wikipedia.org/wiki/HTTP_pipelining
[2] https://blog.restcase.com/http2-benefits-for-rest-apis/
[3] https://aws.amazon.com/blogs/developer/asynchronous-requests-with-the-aws-sdk-for-java/
[4]
https://sdk.amazonaws.com/cpp/api/LATEST/class_aws_1_1_dynamo_d_b_1_1_dynamo_d_b_client.html#ab631edaccca5f3f8988af15e7e9aa4f0



Re: POC: postgres_fdw insert batching

From
David Fetter
Date:
On Wed, Nov 25, 2020 at 05:04:36AM +0000, tsunakawa.takay@fujitsu.com wrote:
> From: Tomas Vondra <tomas.vondra@enterprisedb.com>
> > On 11/24/20 9:45 AM, tsunakawa.takay@fujitsu.com wrote:
> > > OTOH, as for the name GetModifyBatchSize() you suggest, I think
> > GetInsertBatchSize may be better.  That is, this API deals with multiple
> > records in a single INSERT statement.  Your GetModifyBatchSize will be
> > reserved for statement batching when libpq has supported batch/pipelining to
> > execute multiple INSERT/UPDATE/DELETE statements, as in the following
> > JDBC batch updates.  What do you think?
> > >
> > 
> > I don't know. I was really only thinking about batching in the context
> > of a single DML command, not about batching of multiple commands at the
> > protocol level. IMHO it's far more likely we'll add support for batching
> > for DELETE/UPDATE than libpq pipelining, which seems rather different
> > from how the FDW API works. Which is why I was suggesting to use a name
> > that would work for all DML commands, not just for inserts.
> 
> Right, I can't imagine now how the interaction among the client, server core and FDWs would be regarding the
statementbatching.  So I'll take your suggested name.
 
> 
> 
> > Not sure, but I'd guess knowing whether batching is used would be
> > useful. We only print the single-row SQL query, which kinda gives the
> > impression that there's no batching.
> 
> Added in postgres_fdw like "Remote SQL" when EXPLAIN VERBOSE is run.
> 
> 
> > > Don't worry about this, too.  GetMaxBulkInsertTuples() just returns a value
> > that was already saved in a struct in create_foreign_modify().
> > >
> > 
> > Well, I do worry for two reasons.
> > 
> > Firstly, the fact that in postgres_fdw the call is cheap does not mean
> > it'll be like that in every other FDW. Presumably, the other FDWs might
> > cache it in the struct and do the same thing, of course.
> > 
> > But the fact that we're calling it over and over for each row kinda
> > seems like we allow the value to change during execution, but I very
> > much doubt the code is expecting that. I haven't tried, but assume the
> > function first returns 10 and then 100. ISTM the code will allocate
> > ri_Slots with 25 slots, but then we'll try stashing 100 tuples there.
> > That can't end well. Sure, we can claim it's a bug in the FDW extension,
> > but it's also due to the API design.
> 
> You worried about other FDWs than postgres_fdw.  That's reasonable.  I insisted in other threads that PG developers
careonly about postgres_fdw, not other FDWs, when designing the FDW interface, but I myself made the same mistake.  I
madechanges so that the executor calls GetModifyBatchSize() once per relation per statement.
 

Please pardon me for barging in late in this discussion, but if we're
going to be using a bulk API here, wouldn't it make more sense to use
COPY, except where RETURNING is specified, in place of INSERT?

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Craig Ringer <craig.ringer@enterprisedb.com>
> It was not my intention to hold this patch up or greatly expand its
> scope. I'll spend some time testing it out and have a closer read soon
> to see if I can help progress it.

Thank you, I'm relieved to hear that.  Last weekend, I was scared of a possible mood that's something like "We won't
acceptthe insert speedup patch for foreign tables unless you take full advantage of pipelining and achieve maximum
conceivablespeed!"
 


> I thought I gave it at the time, and a demo program. IIRC it was just
> doing small multi row inserts or single row inserts. Latency would've
> been a couple of hundred ms probably, I think I did something like
> running on my laptop (Australia, ADSL) to a server on AWS US or EU.

a couple of hundred ms, so that would be dominant in each prepare-send-execute-receive, possibly even for batch insert
withhundreds of rows in each batch.  Then, the synchronous batch insert of the current patch may achieve a few hundreds
timesspeedup compared to a single row inserts when the batch size is hundreds or more.
 


> > I'd like to check other DBMSs and your rich references for the FDW interface.
> (My first intuition is that many major DBMSs might not have client C APIs that
> can be used to implement an async pipelining FDW interface.
> 
> Likely correct for C APIs of other traditional DBMSes. I'd be less
> sure about newer non SQL ones, especially cloud oriented. For example
> DynamoDB supports at least async requests in the Java client [3] and
> C++ client [4]; it's not immediately clear if requests can be
> pipelined, but the API suggests they can.

I've checked ODBC, MySQL, Microsoft Synapse Analytics, Redshift, and BigQuery, guessing that the data warehouse may
haveasynchronous/pipelining API that enables efficient data integration/migration.  But none of them had one.  (I seem
tohave spent too long and am a bit tired... but it was a bit fun as well.)  They all support INSERT with multiple
recordsin its VALUES clause.  So, it will be useful to provide a synchronous batch insert FDW API.  I guess Oracle's
OCIhas an asynchronous API, but I didn't check it.
 

As an aside, MySQL 8.0.16 added support for asynchronous execution in its C API, but it allows only one active SQL
statementin each connection.  Likewise, although the ODBC standard defines asynchronous execution
(SQLSetStmtAttr(SQL_ASYNC_ENABLE)and SQLCompleteAsync), SQL Server and Synapse Analytics only allows only one active
statementper connection.  psqlODBC doesn't support asynchronous execution.
 


> Most things with a REST-like API can do a fair bit of concurrency
> though. Multiple async nonblocking HTTP connections can be serviced at
> once. Or HTTP/1.1 pipelining can be used [1], or even better HTTP/2.0
> streams [2]. This is relevant for any REST-like API.

I'm not sure if this is related, Google deprecated Batch HTTP API [1].


[1]
https://cloud.google.com/bigquery/batch


Regards
Takayuki Tsunakawa


RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
From: David Fetter <david@fetter.org>
> Please pardon me for barging in late in this discussion, but if we're
> going to be using a bulk API here, wouldn't it make more sense to use
> COPY, except where RETURNING is specified, in place of INSERT?

Please do not hesitate.  I mentioned earlier in this thread that I think INSERT is better because:


--------------------------------------------------
* When the user executed INSERT statements, it would look strange to the user if the remote SQL is displayed as COPY.

* COPY doesn't invoke rules unlike INSERT.  (I don't think the rule is a feature what users care about, though.)  Also,
I'ma bit concerned that there might be, or will be, other differences between INSERT and COPY. 
--------------------------------------------------


Also, COPY to foreign tables currently uses INSERTs, the improvement of using COPY instead of INSERT is in progress
[1]. Keeping "COPY uses COPY, INSERT uses INSERT" correspondence seems natural, and it makes COPY's high-speed
advantagestand out. 


[1]
Fast COPY FROM command for the table with foreign partitions
https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru


Regards
Takayuki Tsunakawa




Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:
Hi,

Attached is a v6 of this patch, rebased to current master and with some 
minor improvements (mostly comments and renaming the "end" struct field 
to "values_end" which I think is more descriptive).

The one thing that keeps bugging me is convert_prep_stmt_params - it 
dies the right thing, but the code is somewhat confusing.


AFAICS the discussions about making this use COPY and/or libpq 
pipelining (neither of which is committed yet) ended with the conclusion 
that those changes are somewhat independent, and that it's worth getting 
this committed in the current form. Barring objections, I'll push this 
within the next couple days.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
> Attached is a v6 of this patch, rebased to current master and with some minor
> improvements (mostly comments and renaming the "end" struct field to
> "values_end" which I think is more descriptive).

Thank you very much.  In fact, my initial patches used values_end, and I changed it to len considering that it may be
usedfor bulk UPDATEand DELETE in the future.  But I think values_end is easier to understand its role, too.
 


Regards
Takayuki Tsunakawa


Re: POC: postgres_fdw insert batching

From
Amit Langote
Date:
Hi Tomas, Tsunakawa-san,

Thanks for your work on this.

On Tue, Jan 12, 2021 at 11:06 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> AFAICS the discussions about making this use COPY and/or libpq
> pipelining (neither of which is committed yet) ended with the conclusion
> that those changes are somewhat independent, and that it's worth getting
> this committed in the current form. Barring objections, I'll push this
> within the next couple days.

I was trying this out today (been meaning to do so for a while) and
noticed that this fails when there are AFTER ROW triggers on the
foreign table.  Here's an example:

create extension postgres_fdw ;
create server lb foreign data wrapper postgres_fdw ;
create user mapping for current_user server lb;
create table p (a numeric primary key);
create foreign table fp (a int) server lb options (table_name 'p');
create function print_row () returns trigger as $$ begin raise notice
'%', new; return null; end; $$ language plpgsql;
create trigger after_insert_trig after insert on fp for each row
execute function print_row();
insert into fp select generate_series (1, 10);
<crashes>

Apparently, the new code seems to assume that batching wouldn't be
active when the original query contains RETURNING clause but some
parts fail to account for the case where RETURNING is added to the
query to retrieve the tuple to pass to the AFTER TRIGGER.
Specifically, the Assert in the following block in
execute_foreign_modify() is problematic:

    /* Check number of rows affected, and fetch RETURNING tuple if any */
    if (fmstate->has_returning)
    {
        Assert(*numSlots == 1);
        n_rows = PQntuples(res);
        if (n_rows > 0)
            store_returning_result(fmstate, slots[0], res);
    }

-- 
Amit Langote
EDB: http://www.enterprisedb.com



Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:
On 1/13/21 10:15 AM, Amit Langote wrote:
> Hi Tomas, Tsunakawa-san,
> 
> Thanks for your work on this.
> 
> On Tue, Jan 12, 2021 at 11:06 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> AFAICS the discussions about making this use COPY and/or libpq
>> pipelining (neither of which is committed yet) ended with the conclusion
>> that those changes are somewhat independent, and that it's worth getting
>> this committed in the current form. Barring objections, I'll push this
>> within the next couple days.
> 
> I was trying this out today (been meaning to do so for a while) and
> noticed that this fails when there are AFTER ROW triggers on the
> foreign table.  Here's an example:
> 
> create extension postgres_fdw ;
> create server lb foreign data wrapper postgres_fdw ;
> create user mapping for current_user server lb;
> create table p (a numeric primary key);
> create foreign table fp (a int) server lb options (table_name 'p');
> create function print_row () returns trigger as $$ begin raise notice
> '%', new; return null; end; $$ language plpgsql;
> create trigger after_insert_trig after insert on fp for each row
> execute function print_row();
> insert into fp select generate_series (1, 10);
> <crashes>
> 
> Apparently, the new code seems to assume that batching wouldn't be
> active when the original query contains RETURNING clause but some
> parts fail to account for the case where RETURNING is added to the
> query to retrieve the tuple to pass to the AFTER TRIGGER.
> Specifically, the Assert in the following block in
> execute_foreign_modify() is problematic:
> 
>     /* Check number of rows affected, and fetch RETURNING tuple if any */
>     if (fmstate->has_returning)
>     {
>         Assert(*numSlots == 1);
>         n_rows = PQntuples(res);
>         if (n_rows > 0)
>             store_returning_result(fmstate, slots[0], res);
>     }
> 

Thanks for the report. Yeah, I think there's a missing check in
ExecInsert. Adding

  (!resultRelInfo->ri_TrigDesc->trig_insert_after_row)

solves this. But now I'm wondering if this is the wrong place to make
this decision. I mean, why should we make the decision here, when the
decision whether to have a RETURNING clause is made in postgres_fdw in
deparseReturningList? We don't really know what the other FDWs will do,
for example.

So I think we should just move all of this into GetModifyBatchSize. We
can start with ri_BatchSize = 0. And then do

  if (resultRelInfo->ri_BatchSize == 0)
    resultRelInfo->ri_BatchSize =
      resultRelInfo->ri_FdwRoutine->GetModifyBatchSize(resultRelInfo);

  if (resultRelInfo->ri_BatchSize > 1)
  {
    ... do batching ...
  }

The GetModifyBatchSize would always return value > 0, so either 1 (no
batching) or >1 (batching).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:
On 1/13/21 3:43 PM, Tomas Vondra wrote:
>
> ...
>
> Thanks for the report. Yeah, I think there's a missing check in
> ExecInsert. Adding
> 
>   (!resultRelInfo->ri_TrigDesc->trig_insert_after_row)
> 
> solves this. But now I'm wondering if this is the wrong place to make
> this decision. I mean, why should we make the decision here, when the
> decision whether to have a RETURNING clause is made in postgres_fdw in
> deparseReturningList? We don't really know what the other FDWs will do,
> for example.
> 
> So I think we should just move all of this into GetModifyBatchSize. We
> can start with ri_BatchSize = 0. And then do
> 
>   if (resultRelInfo->ri_BatchSize == 0)
>     resultRelInfo->ri_BatchSize =
>       resultRelInfo->ri_FdwRoutine->GetModifyBatchSize(resultRelInfo);
> 
>   if (resultRelInfo->ri_BatchSize > 1)
>   {
>     ... do batching ...
>   }
> 
> The GetModifyBatchSize would always return value > 0, so either 1 (no
> batching) or >1 (batching).
> 

FWIW the attached v8 patch does this - most of the conditions are moved
to the GetModifyBatchSize() callback. I've removed the check for the
BatchInsert callback, though - the FDW knows whether it supports that,
and it seems a bit pointless at the moment as there are no other batch
callbacks. Maybe we should add an Assert somewhere, though?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
> FWIW the attached v8 patch does this - most of the conditions are moved to the
> GetModifyBatchSize() callback. I've removed the check for the BatchInsert
> callback, though - the FDW knows whether it supports that, and it seems a bit
> pointless at the moment as there are no other batch callbacks. Maybe we
> should add an Assert somewhere, though?

Thank you.  I'm in favor this idea that the decision to support RETURNING and trigger is left to the FDW.  I don' think
ofthe need for another Assert, as the caller has one for the returned batch size.
 


Regards
Takayuki Tsunakawa


Re: POC: postgres_fdw insert batching

From
Amit Langote
Date:
Hi,

On Thu, Jan 14, 2021 at 2:41 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> On 1/13/21 3:43 PM, Tomas Vondra wrote:
> > Thanks for the report. Yeah, I think there's a missing check in
> > ExecInsert. Adding
> >
> >   (!resultRelInfo->ri_TrigDesc->trig_insert_after_row)
> >
> > solves this. But now I'm wondering if this is the wrong place to make
> > this decision. I mean, why should we make the decision here, when the
> > decision whether to have a RETURNING clause is made in postgres_fdw in
> > deparseReturningList? We don't really know what the other FDWs will do,
> > for example.
> >
> > So I think we should just move all of this into GetModifyBatchSize. We
> > can start with ri_BatchSize = 0. And then do
> >
> >   if (resultRelInfo->ri_BatchSize == 0)
> >     resultRelInfo->ri_BatchSize =
> >       resultRelInfo->ri_FdwRoutine->GetModifyBatchSize(resultRelInfo);
> >
> >   if (resultRelInfo->ri_BatchSize > 1)
> >   {
> >     ... do batching ...
> >   }
> >
> > The GetModifyBatchSize would always return value > 0, so either 1 (no
> > batching) or >1 (batching).
> >
>
> FWIW the attached v8 patch does this - most of the conditions are moved
> to the GetModifyBatchSize() callback.

Thanks.  A few comments:

* I agree with leaving it up to an FDW to look at the properties of
the table and of the operation being performed to decide whether or
not to use batching, although maybe BeginForeignModify() is a better
place for putting that logic instead of GetModifyBatchSize()?  So, in
create_foreign_modify(), instead of PgFdwModifyState.batch_size simply
being set to match the table's or the server's value for the
batch_size option, make it also consider the things that prevent
batching and set the execution state's batch_size based on that.
GetModifyBatchSize() simply returns that value.

* Regarding the timing of calling GetModifyBatchSize() to set
ri_BatchSize, I wonder if it wouldn't be better to call it just once,
say from ExecInitModifyTable(), right after BeginForeignModify()
returns?  I don't quite understand why it is being called from
ExecInsert().  Can the batch size change once the execution starts?

* Lastly, how about calling it GetForeignModifyBatchSize() to be
consistent with other nearby callbacks?

> I've removed the check for the
> BatchInsert callback, though - the FDW knows whether it supports that,
> and it seems a bit pointless at the moment as there are no other batch
> callbacks. Maybe we should add an Assert somewhere, though?

Hmm, not checking whether BatchInsert() exists may not be good idea,
because if an FDW's GetModifyBatchSize() returns a value > 1 but
there's no BatchInsert() function to call, ExecBatchInsert() would
trip.  I don't see the newly added documentation telling FDW authors
to either define both or none.

Regarding how this plays with partitions, I don't think we need
ExecGetTouchedPartitions(), because you can get the routed-to
partitions using es_tuple_routing_result_relations.  Also, perhaps
it's a good idea to put the "finishing" ExecBatchInsert() calls into a
function ExecFinishBatchInsert().  Maybe the logic to choose the
relations to perform the finishing calls on will get complicated in
the future as batching is added for updates/deletes too and it seems
better to encapsulate that in the separate function than have it out
in the open in ExecModifyTable().

(Sorry about being so late reviewing this.)

--
Amit Langote
EDB: http://www.enterprisedb.com



Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:
On 1/14/21 9:58 AM, Amit Langote wrote:
> Hi,
> 
> On Thu, Jan 14, 2021 at 2:41 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> On 1/13/21 3:43 PM, Tomas Vondra wrote:
>>> Thanks for the report. Yeah, I think there's a missing check in
>>> ExecInsert. Adding
>>>
>>>   (!resultRelInfo->ri_TrigDesc->trig_insert_after_row)
>>>
>>> solves this. But now I'm wondering if this is the wrong place to make
>>> this decision. I mean, why should we make the decision here, when the
>>> decision whether to have a RETURNING clause is made in postgres_fdw in
>>> deparseReturningList? We don't really know what the other FDWs will do,
>>> for example.
>>>
>>> So I think we should just move all of this into GetModifyBatchSize. We
>>> can start with ri_BatchSize = 0. And then do
>>>
>>>   if (resultRelInfo->ri_BatchSize == 0)
>>>     resultRelInfo->ri_BatchSize =
>>>       resultRelInfo->ri_FdwRoutine->GetModifyBatchSize(resultRelInfo);
>>>
>>>   if (resultRelInfo->ri_BatchSize > 1)
>>>   {
>>>     ... do batching ...
>>>   }
>>>
>>> The GetModifyBatchSize would always return value > 0, so either 1 (no
>>> batching) or >1 (batching).
>>>
>>
>> FWIW the attached v8 patch does this - most of the conditions are moved
>> to the GetModifyBatchSize() callback.
> 
> Thanks.  A few comments:
> 
> * I agree with leaving it up to an FDW to look at the properties of
> the table and of the operation being performed to decide whether or
> not to use batching, although maybe BeginForeignModify() is a better
> place for putting that logic instead of GetModifyBatchSize()?  So, in
> create_foreign_modify(), instead of PgFdwModifyState.batch_size simply
> being set to match the table's or the server's value for the
> batch_size option, make it also consider the things that prevent
> batching and set the execution state's batch_size based on that.
> GetModifyBatchSize() simply returns that value.
> 
> * Regarding the timing of calling GetModifyBatchSize() to set
> ri_BatchSize, I wonder if it wouldn't be better to call it just once,
> say from ExecInitModifyTable(), right after BeginForeignModify()
> returns?  I don't quite understand why it is being called from
> ExecInsert().  Can the batch size change once the execution starts?
> 

But it should be called just once. The idea is that initially we have
batch_size=0, and the fist call returns value that is >= 1. So we never
call it again. But maybe it could be called from BeginForeignModify, in
which case we'd not need this logic with first setting it to 0 etc.

> * Lastly, how about calling it GetForeignModifyBatchSize() to be
> consistent with other nearby callbacks?
> 

Yeah, good point.

>> I've removed the check for the
>> BatchInsert callback, though - the FDW knows whether it supports that,
>> and it seems a bit pointless at the moment as there are no other batch
>> callbacks. Maybe we should add an Assert somewhere, though?
> 
> Hmm, not checking whether BatchInsert() exists may not be good idea,
> because if an FDW's GetModifyBatchSize() returns a value > 1 but
> there's no BatchInsert() function to call, ExecBatchInsert() would
> trip.  I don't see the newly added documentation telling FDW authors
> to either define both or none.
> 

Hmm. The BatchInsert check seemed somewhat unnecessary to me, but OTOH
it can't hurt, I guess. I'll ad it back.

> Regarding how this plays with partitions, I don't think we need
> ExecGetTouchedPartitions(), because you can get the routed-to
> partitions using es_tuple_routing_result_relations.  Also, perhaps

I'm not very familiar with es_tuple_routing_result_relations, but that
doesn't seem to work. I've replaced the flushing code at the end of
ExecModifyTable with a loop over es_tuple_routing_result_relations, but
then some of the rows are missing (i.e. not flushed).

> it's a good idea to put the "finishing" ExecBatchInsert() calls into a
> function ExecFinishBatchInsert().  Maybe the logic to choose the
> relations to perform the finishing calls on will get complicated in
> the future as batching is added for updates/deletes too and it seems
> better to encapsulate that in the separate function than have it out
> in the open in ExecModifyTable().
> 

IMO that'd be an over-engineering at this point. We don't need such
separate function yet, so why complicate the API? If we need it in the
future, we can add it.

> (Sorry about being so late reviewing this.)

thanks

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: POC: postgres_fdw insert batching

From
Amit Langote
Date:
On Thu, Jan 14, 2021 at 21:57 Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
On 1/14/21 9:58 AM, Amit Langote wrote:
> Hi,
>
> On Thu, Jan 14, 2021 at 2:41 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> On 1/13/21 3:43 PM, Tomas Vondra wrote:
>>> Thanks for the report. Yeah, I think there's a missing check in
>>> ExecInsert. Adding
>>>
>>>   (!resultRelInfo->ri_TrigDesc->trig_insert_after_row)
>>>
>>> solves this. But now I'm wondering if this is the wrong place to make
>>> this decision. I mean, why should we make the decision here, when the
>>> decision whether to have a RETURNING clause is made in postgres_fdw in
>>> deparseReturningList? We don't really know what the other FDWs will do,
>>> for example.
>>>
>>> So I think we should just move all of this into GetModifyBatchSize. We
>>> can start with ri_BatchSize = 0. And then do
>>>
>>>   if (resultRelInfo->ri_BatchSize == 0)
>>>     resultRelInfo->ri_BatchSize =
>>>       resultRelInfo->ri_FdwRoutine->GetModifyBatchSize(resultRelInfo);
>>>
>>>   if (resultRelInfo->ri_BatchSize > 1)
>>>   {
>>>     ... do batching ...
>>>   }
>>>
>>> The GetModifyBatchSize would always return value > 0, so either 1 (no
>>> batching) or >1 (batching).
>>>
>>
>> FWIW the attached v8 patch does this - most of the conditions are moved
>> to the GetModifyBatchSize() callback.
>
> Thanks.  A few comments:
>
> * I agree with leaving it up to an FDW to look at the properties of
> the table and of the operation being performed to decide whether or
> not to use batching, although maybe BeginForeignModify() is a better
> place for putting that logic instead of GetModifyBatchSize()?  So, in
> create_foreign_modify(), instead of PgFdwModifyState.batch_size simply
> being set to match the table's or the server's value for the
> batch_size option, make it also consider the things that prevent
> batching and set the execution state's batch_size based on that.
> GetModifyBatchSize() simply returns that value.
>
> * Regarding the timing of calling GetModifyBatchSize() to set
> ri_BatchSize, I wonder if it wouldn't be better to call it just once,
> say from ExecInitModifyTable(), right after BeginForeignModify()
> returns?  I don't quite understand why it is being called from
> ExecInsert().  Can the batch size change once the execution starts?
>

But it should be called just once. The idea is that initially we have
batch_size=0, and the fist call returns value that is >= 1. So we never
call it again. But maybe it could be called from BeginForeignModify, in
which case we'd not need this logic with first setting it to 0 etc.

Right, although I was thinking that maybe ri_BatchSize itself is not to be written to by the FDW.  Not to say that’s doing anything wrong though.

> * Lastly, how about calling it GetForeignModifyBatchSize() to be
> consistent with other nearby callbacks?
>

Yeah, good point.

>> I've removed the check for the
>> BatchInsert callback, though - the FDW knows whether it supports that,
>> and it seems a bit pointless at the moment as there are no other batch
>> callbacks. Maybe we should add an Assert somewhere, though?
>
> Hmm, not checking whether BatchInsert() exists may not be good idea,
> because if an FDW's GetModifyBatchSize() returns a value > 1 but
> there's no BatchInsert() function to call, ExecBatchInsert() would
> trip.  I don't see the newly added documentation telling FDW authors
> to either define both or none.
>

Hmm. The BatchInsert check seemed somewhat unnecessary to me, but OTOH
it can't hurt, I guess. I'll ad it back.

> Regarding how this plays with partitions, I don't think we need
> ExecGetTouchedPartitions(), because you can get the routed-to
> partitions using es_tuple_routing_result_relations.  Also, perhaps

I'm not very familiar with es_tuple_routing_result_relations, but that
doesn't seem to work. I've replaced the flushing code at the end of
ExecModifyTable with a loop over es_tuple_routing_result_relations, but
then some of the rows are missing (i.e. not flushed).

I should’ve mentioned es_opened_result_relations too which contain non-routing result relations.  So I really meant if (proute) then use es_tuple_routing_result_relations, else es_opened_result_relations.  This should work as long as batching is only used for inserts.


> it's a good idea to put the "finishing" ExecBatchInsert() calls into a
> function ExecFinishBatchInsert().  Maybe the logic to choose the
> relations to perform the finishing calls on will get complicated in
> the future as batching is added for updates/deletes too and it seems
> better to encapsulate that in the separate function than have it out
> in the open in ExecModifyTable().
>

IMO that'd be an over-engineering at this point. We don't need such
separate function yet, so why complicate the API? If we need it in the
future, we can add it.

Fair enough.
--

Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:

On 1/14/21 2:57 PM, Amit Langote wrote:
> On Thu, Jan 14, 2021 at 21:57 Tomas Vondra
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
> wrote:
> 
>     On 1/14/21 9:58 AM, Amit Langote wrote:
>     > Hi,
>     >
>     > On Thu, Jan 14, 2021 at 2:41 AM Tomas Vondra
>     > <tomas.vondra@enterprisedb.com
>     <mailto:tomas.vondra@enterprisedb.com>> wrote:
>     >> On 1/13/21 3:43 PM, Tomas Vondra wrote:
>     >>> Thanks for the report. Yeah, I think there's a missing check in
>     >>> ExecInsert. Adding
>     >>>
>     >>>   (!resultRelInfo->ri_TrigDesc->trig_insert_after_row)
>     >>>
>     >>> solves this. But now I'm wondering if this is the wrong place to
>     make
>     >>> this decision. I mean, why should we make the decision here,
>     when the
>     >>> decision whether to have a RETURNING clause is made in
>     postgres_fdw in
>     >>> deparseReturningList? We don't really know what the other FDWs
>     will do,
>     >>> for example.
>     >>>
>     >>> So I think we should just move all of this into
>     GetModifyBatchSize. We
>     >>> can start with ri_BatchSize = 0. And then do
>     >>>
>     >>>   if (resultRelInfo->ri_BatchSize == 0)
>     >>>     resultRelInfo->ri_BatchSize =
>     >>>     
>      resultRelInfo->ri_FdwRoutine->GetModifyBatchSize(resultRelInfo);
>     >>>
>     >>>   if (resultRelInfo->ri_BatchSize > 1)
>     >>>   {
>     >>>     ... do batching ...
>     >>>   }
>     >>>
>     >>> The GetModifyBatchSize would always return value > 0, so either
>     1 (no
>     >>> batching) or >1 (batching).
>     >>>
>     >>
>     >> FWIW the attached v8 patch does this - most of the conditions are
>     moved
>     >> to the GetModifyBatchSize() callback.
>     >
>     > Thanks.  A few comments:
>     >
>     > * I agree with leaving it up to an FDW to look at the properties of
>     > the table and of the operation being performed to decide whether or
>     > not to use batching, although maybe BeginForeignModify() is a better
>     > place for putting that logic instead of GetModifyBatchSize()?  So, in
>     > create_foreign_modify(), instead of PgFdwModifyState.batch_size simply
>     > being set to match the table's or the server's value for the
>     > batch_size option, make it also consider the things that prevent
>     > batching and set the execution state's batch_size based on that.
>     > GetModifyBatchSize() simply returns that value.
>     >
>     > * Regarding the timing of calling GetModifyBatchSize() to set
>     > ri_BatchSize, I wonder if it wouldn't be better to call it just once,
>     > say from ExecInitModifyTable(), right after BeginForeignModify()
>     > returns?  I don't quite understand why it is being called from
>     > ExecInsert().  Can the batch size change once the execution starts?
>     >
> 
>     But it should be called just once. The idea is that initially we have
>     batch_size=0, and the fist call returns value that is >= 1. So we never
>     call it again. But maybe it could be called from BeginForeignModify, in
>     which case we'd not need this logic with first setting it to 0 etc.
> 
> 
> Right, although I was thinking that maybe ri_BatchSize itself is not to
> be written to by the FDW.  Not to say that’s doing anything wrong though.
> 
>     > * Lastly, how about calling it GetForeignModifyBatchSize() to be
>     > consistent with other nearby callbacks?
>     >
> 
>     Yeah, good point.
> 
>     >> I've removed the check for the
>     >> BatchInsert callback, though - the FDW knows whether it supports
>     that,
>     >> and it seems a bit pointless at the moment as there are no other
>     batch
>     >> callbacks. Maybe we should add an Assert somewhere, though?
>     >
>     > Hmm, not checking whether BatchInsert() exists may not be good idea,
>     > because if an FDW's GetModifyBatchSize() returns a value > 1 but
>     > there's no BatchInsert() function to call, ExecBatchInsert() would
>     > trip.  I don't see the newly added documentation telling FDW authors
>     > to either define both or none.
>     >
> 
>     Hmm. The BatchInsert check seemed somewhat unnecessary to me, but OTOH
>     it can't hurt, I guess. I'll ad it back.
> 
>     > Regarding how this plays with partitions, I don't think we need
>     > ExecGetTouchedPartitions(), because you can get the routed-to
>     > partitions using es_tuple_routing_result_relations.  Also, perhaps
> 
>     I'm not very familiar with es_tuple_routing_result_relations, but that
>     doesn't seem to work. I've replaced the flushing code at the end of
>     ExecModifyTable with a loop over es_tuple_routing_result_relations, but
>     then some of the rows are missing (i.e. not flushed).
> 
> 
> I should’ve mentioned es_opened_result_relations too which contain
> non-routing result relations.  So I really meant if (proute) then use
> es_tuple_routing_result_relations, else es_opened_result_relations. 
> This should work as long as batching is only used for inserts.
> 

Ah, right. That did the trick.

Attached is v9 with all of those tweaks, except for moving the BatchSize
call to BeginForeignModify - I tried that, but it did not seem like an
improvement, because we'd still need the checks for API callbacks in
ExecInsert for example. So I decided not to do that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
> Attached is v9 with all of those tweaks, except for moving the BatchSize call to
> BeginForeignModify - I tried that, but it did not seem like an improvement,
> because we'd still need the checks for API callbacks in ExecInsert for example.
> So I decided not to do that.

Thanks, Tomas-san.  The patch looks good again.

Amit-san, thank you for teaching us about es_tuple_routing_result_relations and es_opened_result_relations.


Regards
Takayuki Tsunakawa


Re: POC: postgres_fdw insert batching

From
Amit Langote
Date:
On Fri, Jan 15, 2021 at 12:05 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> Attached is v9 with all of those tweaks,

Thanks.

> except for moving the BatchSize
> call to BeginForeignModify - I tried that, but it did not seem like an
> improvement, because we'd still need the checks for API callbacks in
> ExecInsert for example. So I decided not to do that.

Okay, so maybe not moving the whole logic into the FDW's
BeginForeignModify(), but at least if we move this...

@@ -441,6 +449,72 @@ ExecInsert(ModifyTableState *mtstate,
+       /*
+        * Determine if the FDW supports batch insert and determine the batch
+        * size (a FDW may support batching, but it may be disabled for the
+        * server/table). Do this only once, at the beginning - we don't want
+        * the batch size to change during execution.
+        */
+       if (resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize &&
+           resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert &&
+           resultRelInfo->ri_BatchSize == 0)
+           resultRelInfo->ri_BatchSize =
+
resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(resultRelInfo);

...into ExecInitModifyTable(), ExecInsert() only needs the following block:

       /*
+        * If the FDW supports batching, and batching is requested, accumulate
+        * rows and insert them in batches. Otherwise use the per-row inserts.
+        */
+       if (resultRelInfo->ri_BatchSize > 1)
+       {
+ ...

AFAICS, I don't see anything that will cause ri_BatchSize to become 0
once set so don't see the point of checking whether it needs to be set
again on every ExecInsert() call.  Also, maybe not that important, but
shaving off 3 comparisons for every tuple would add up nicely IMHO
especially given that we're targeting bulk loads.

--
Amit Langote
EDB: http://www.enterprisedb.com



RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Amit Langote <amitlangote09@gmail.com>
> Okay, so maybe not moving the whole logic into the FDW's
> BeginForeignModify(), but at least if we move this...
> 
> @@ -441,6 +449,72 @@ ExecInsert(ModifyTableState *mtstate,
> +       /*
> +        * Determine if the FDW supports batch insert and determine the
> batch
> +        * size (a FDW may support batching, but it may be disabled for the
> +        * server/table). Do this only once, at the beginning - we don't want
> +        * the batch size to change during execution.
> +        */
> +       if (resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize &&
> +           resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert &&
> +           resultRelInfo->ri_BatchSize == 0)
> +           resultRelInfo->ri_BatchSize =
> +
> resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(resultRelInfo);
> 
> ...into ExecInitModifyTable(), ExecInsert() only needs the following block:

Does ExecInitModifyTable() know all leaf partitions where the tuples produced by VALUES or SELECT go?  ExecInsert()
doesn'tfind the target leaf partition for the first time through the call to ExecPrepareTupleRouting()?  Leaf
partitionscan have different batch_size settings.
 


Regards
Takayuki Tsunakawa


Re: POC: postgres_fdw insert batching

From
Amit Langote
Date:
On Sat, Jan 16, 2021 at 12:00 AM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
> From: Amit Langote <amitlangote09@gmail.com>
> > Okay, so maybe not moving the whole logic into the FDW's
> > BeginForeignModify(), but at least if we move this...
> >
> > @@ -441,6 +449,72 @@ ExecInsert(ModifyTableState *mtstate,
> > +       /*
> > +        * Determine if the FDW supports batch insert and determine the
> > batch
> > +        * size (a FDW may support batching, but it may be disabled for the
> > +        * server/table). Do this only once, at the beginning - we don't want
> > +        * the batch size to change during execution.
> > +        */
> > +       if (resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize &&
> > +           resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert &&
> > +           resultRelInfo->ri_BatchSize == 0)
> > +           resultRelInfo->ri_BatchSize =
> > +
> > resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(resultRelInfo);
> >
> > ...into ExecInitModifyTable(), ExecInsert() only needs the following block:
>
> Does ExecInitModifyTable() know all leaf partitions where the tuples produced by VALUES or SELECT go?  ExecInsert()
doesn'tfind the target leaf partition for the first time through the call to ExecPrepareTupleRouting()?  Leaf
partitionscan have different batch_size settings. 

Good thing you reminded me that this is about inserts, and in that
case no, ExecInitModifyTable() doesn't know all leaf partitions, it
only sees the root table whose batch_size doesn't really matter.  So
it's really ExecInitRoutingInfo() that I would recommend to set
ri_BatchSize; right after this block:

/*
 * If the partition is a foreign table, let the FDW init itself for
 * routing tuples to the partition.
 */
if (partRelInfo->ri_FdwRoutine != NULL &&
    partRelInfo->ri_FdwRoutine->BeginForeignInsert != NULL)
    partRelInfo->ri_FdwRoutine->BeginForeignInsert(mtstate, partRelInfo);

Note that ExecInitRoutingInfo() is called only once for a partition
when it is initialized after being inserted into for the first time.

For a non-partitioned targets, I'd still say set ri_BatchSize in
ExecInitModifyTable().

--
Amit Langote
EDB: http://www.enterprisedb.com



RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
Tomas-san,

From: Amit Langote <amitlangote09@gmail.com>
> Good thing you reminded me that this is about inserts, and in that
> case no, ExecInitModifyTable() doesn't know all leaf partitions, it
> only sees the root table whose batch_size doesn't really matter.  So
> it's really ExecInitRoutingInfo() that I would recommend to set
> ri_BatchSize; right after this block:
> 
> /*
>  * If the partition is a foreign table, let the FDW init itself for
>  * routing tuples to the partition.
>  */
> if (partRelInfo->ri_FdwRoutine != NULL &&
>     partRelInfo->ri_FdwRoutine->BeginForeignInsert != NULL)
>     partRelInfo->ri_FdwRoutine->BeginForeignInsert(mtstate, partRelInfo);
> 
> Note that ExecInitRoutingInfo() is called only once for a partition
> when it is initialized after being inserted into for the first time.
> 
> For a non-partitioned targets, I'd still say set ri_BatchSize in
> ExecInitModifyTable().

Attached is the patch that added call to GetModifyBatchSize() to the above two places.  The regression test passes.

(FWIW, frankly, I prefer the previous version because the code is a bit smaller...  Maybe we should refactor the code
somedayto reduce similar processings in both the partitioned case and non-partitioned case.)
 


Regards
Takayuki Tsunakawa


Attachment

Re: POC: postgres_fdw insert batching

From
Zhihong Yu
Date:
Hi, Takayuki-san:

+           if (batch_size <= 0)
+               ereport(ERROR,
+                       (errcode(ERRCODE_SYNTAX_ERROR),
+                        errmsg("%s requires a non-negative integer value",

It seems the message doesn't match the check w.r.t. the batch size of 0.

+   int         numInserted = numSlots;

Since numInserted is filled by ExecForeignBatchInsert(), the initialization can be done with 0.

Cheers

On Sun, Jan 17, 2021 at 10:52 PM tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com> wrote:
Tomas-san,

From: Amit Langote <amitlangote09@gmail.com>
> Good thing you reminded me that this is about inserts, and in that
> case no, ExecInitModifyTable() doesn't know all leaf partitions, it
> only sees the root table whose batch_size doesn't really matter.  So
> it's really ExecInitRoutingInfo() that I would recommend to set
> ri_BatchSize; right after this block:
>
> /*
>  * If the partition is a foreign table, let the FDW init itself for
>  * routing tuples to the partition.
>  */
> if (partRelInfo->ri_FdwRoutine != NULL &&
>     partRelInfo->ri_FdwRoutine->BeginForeignInsert != NULL)
>     partRelInfo->ri_FdwRoutine->BeginForeignInsert(mtstate, partRelInfo);
>
> Note that ExecInitRoutingInfo() is called only once for a partition
> when it is initialized after being inserted into for the first time.
>
> For a non-partitioned targets, I'd still say set ri_BatchSize in
> ExecInitModifyTable().

Attached is the patch that added call to GetModifyBatchSize() to the above two places.  The regression test passes.

(FWIW, frankly, I prefer the previous version because the code is a bit smaller...  Maybe we should refactor the code someday to reduce similar processings in both the partitioned case and non-partitioned case.)


Regards
Takayuki Tsunakawa

Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:

On 1/18/21 7:51 AM, tsunakawa.takay@fujitsu.com wrote:
> Tomas-san,
> 
> From: Amit Langote <amitlangote09@gmail.com>
>> Good thing you reminded me that this is about inserts, and in that 
>> case no, ExecInitModifyTable() doesn't know all leaf partitions,
>> it only sees the root table whose batch_size doesn't really matter.
>> So it's really ExecInitRoutingInfo() that I would recommend to set 
>> ri_BatchSize; right after this block:
>> 
>> /* * If the partition is a foreign table, let the FDW init itself
>> for * routing tuples to the partition. */ if
>> (partRelInfo->ri_FdwRoutine != NULL && 
>> partRelInfo->ri_FdwRoutine->BeginForeignInsert != NULL) 
>> partRelInfo->ri_FdwRoutine->BeginForeignInsert(mtstate,
>> partRelInfo);
>> 
>> Note that ExecInitRoutingInfo() is called only once for a
>> partition when it is initialized after being inserted into for the
>> first time.
>> 
>> For a non-partitioned targets, I'd still say set ri_BatchSize in 
>> ExecInitModifyTable().
> 
> Attached is the patch that added call to GetModifyBatchSize() to the
> above two places.  The regression test passes.
> 
> (FWIW, frankly, I prefer the previous version because the code is a
> bit smaller...  Maybe we should refactor the code someday to reduce
> similar processings in both the partitioned case and non-partitioned
> case.)
> 

Less code would be nice, but it's not always the right thing to do, 
unfortunately :-(

I took a look at this - there's a bit of bitrot due to 708d165ddb92c, so 
attached is a rebased patch (0001) fixing that.

0002 adds a couple comments and minor tweaks

0003 addresses a couple shortcomings related to explain - we haven't 
been showing the batch size for EXPLAIN (VERBOSE), because there'd be no 
FdwState, so this tries to fix that. Furthermore, there were no tests 
for EXPLAIN output with batch size, so I added a couple.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
> I took a look at this - there's a bit of bitrot due to 708d165ddb92c, so attached is
> a rebased patch (0001) fixing that.
> 
> 0002 adds a couple comments and minor tweaks
> 
> 0003 addresses a couple shortcomings related to explain - we haven't been
> showing the batch size for EXPLAIN (VERBOSE), because there'd be no
> FdwState, so this tries to fix that. Furthermore, there were no tests for EXPLAIN
> output with batch size, so I added a couple.

Thank you, good additions.  They all look good.
Only one point: I think the code for retrieving batch_size in create_foreign_modify() can be replaced with a call to
thenew function in 0003.
 

God bless us.


Regards
Takayuki Tsunakawa


RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
Tomas-san, Zhihong-san,

From: Zhihong Yu <zyu@yugabyte.com> 
> +           if (batch_size <= 0)
> +               ereport(ERROR,
> +                       (errcode(ERRCODE_SYNTAX_ERROR),
> +                        errmsg("%s requires a non-negative integer value",
> 
> It seems the message doesn't match the check w.r.t. the batch size of 0.

Ah, "non-negative" should be "positive".  The message for the existing fetch_size should be fixed too.  Tomas-san,
couldyou include this as well?  I'm sorry to trouble you.
 


> +   int         numInserted = numSlots;
> 
> Since numInserted is filled by ExecForeignBatchInsert(), the initialization can be done with 0.

No, the code is correct, since the batch function requires the number of rows to insert as input.


Regards
Takayuki Tsunakawa


Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:
On 1/19/21 2:28 AM, tsunakawa.takay@fujitsu.com wrote:
> From: Tomas Vondra <tomas.vondra@enterprisedb.com>
>> I took a look at this - there's a bit of bitrot due to
>> 708d165ddb92c, so attached is a rebased patch (0001) fixing that.
>> 
>> 0002 adds a couple comments and minor tweaks
>> 
>> 0003 addresses a couple shortcomings related to explain - we
>> haven't been showing the batch size for EXPLAIN (VERBOSE), because
>> there'd be no FdwState, so this tries to fix that. Furthermore,
>> there were no tests for EXPLAIN output with batch size, so I added
>> a couple.
> 
> Thank you, good additions.  They all look good. Only one point: I
> think the code for retrieving batch_size in create_foreign_modify()
> can be replaced with a call to the new function in 0003.
> 

OK. Can you prepare a final patch, squashing all the commits into a 
single one, and perhaps use the function in create_foreign_modify?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
> OK. Can you prepare a final patch, squashing all the commits into a
> single one, and perhaps use the function in create_foreign_modify?

Attached, including the message fix pointed by Zaihong-san.


Regards
Takayuki Tsunakawa


Attachment

Re: POC: postgres_fdw insert batching

From
Amit Langote
Date:
Tsunakawa-san,

On Tue, Jan 19, 2021 at 12:50 PM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
> From: Tomas Vondra <tomas.vondra@enterprisedb.com>
> > OK. Can you prepare a final patch, squashing all the commits into a
> > single one, and perhaps use the function in create_foreign_modify?
>
> Attached, including the message fix pointed by Zaihong-san.

Thanks for adopting my suggestions regarding GetForeignModifyBatchSize().

I apologize in advance for being maybe overly pedantic, but I noticed
that, in ExecInitModifyTable(), you decided to place the call outside
the loop that goes over resultRelations (shown below), although my
intent was to ask to place it next to the BeginForeignModify() in that
loop.

    resultRelInfo = mtstate->resultRelInfo;
    i = 0;
    forboth(l, node->resultRelations, l1, node->plans)
    {
        ...
        /* Also let FDWs init themselves for foreign-table result rels */
        if (!resultRelInfo->ri_usesFdwDirectModify &&
            resultRelInfo->ri_FdwRoutine != NULL &&
            resultRelInfo->ri_FdwRoutine->BeginForeignModify != NULL)
        {
            List       *fdw_private = (List *) list_nth(node->fdwPrivLists, i);

            resultRelInfo->ri_FdwRoutine->BeginForeignModify(mtstate,
                                                             resultRelInfo,
                                                             fdw_private,
                                                             i,
                                                             eflags);
        }

Maybe it's fine today because we only care about inserts and there's
always only one entry in the resultRelations list in that case, but
that may not remain the case in the future.

-- 
Amit Langote
EDB: http://www.enterprisedb.com



RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Amit Langote <amitlangote09@gmail.com>
> I apologize in advance for being maybe overly pedantic, but I noticed
> that, in ExecInitModifyTable(), you decided to place the call outside
> the loop that goes over resultRelations (shown below), although my
> intent was to ask to place it next to the BeginForeignModify() in that
> loop.

Actually, I tried to do it (adding the GetModifyBatchSize() call after BeginForeignModify()), but it failed.  Because
postgresfdwGetModifyBatchSize()wants to know if RETURNING is specified, and ResultRelInfo->projectReturning is created
afterthe above part.  Considering the context where GetModifyBatchSize() implementations may want to know the
environment,I placed the call as late as possible in the initialization phase.  As for the future(?) multi-target DML
statements,I think we can change this together with other many(?) parts that assume a single target table.
 


Regards
Takayuki Tsunakawa


Re: POC: postgres_fdw insert batching

From
Amit Langote
Date:
On Tue, Jan 19, 2021 at 2:06 PM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
> From: Amit Langote <amitlangote09@gmail.com>
> > I apologize in advance for being maybe overly pedantic, but I noticed
> > that, in ExecInitModifyTable(), you decided to place the call outside
> > the loop that goes over resultRelations (shown below), although my
> > intent was to ask to place it next to the BeginForeignModify() in that
> > loop.
>
> Actually, I tried to do it (adding the GetModifyBatchSize() call after BeginForeignModify()), but it failed.  Because
postgresfdwGetModifyBatchSize()wants to know if RETURNING is specified, and ResultRelInfo->projectReturning is created
afterthe above part.  Considering the context where GetModifyBatchSize() implementations may want to know the
environment,I placed the call as late as possible in the initialization phase.  As for the future(?) multi-target DML
statements,I think we can change this together with other many(?) parts that assume a single target table. 

Okay, sometime later then.

I wasn't sure if bringing it up here would be appropriate, but there's
a patch by me to refactor ModfiyTable result relation allocation that
will have to remember to move this code along to an appropriate place
[1].  Thanks for the tip about the dependency on how RETURNING is
handled.  I will remember it when rebasing my patch over this.

--
Amit Langote
EDB: http://www.enterprisedb.com

[1] https://commitfest.postgresql.org/31/2621/



Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:

On 1/19/21 7:23 AM, Amit Langote wrote:
> On Tue, Jan 19, 2021 at 2:06 PM tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
>> From: Amit Langote <amitlangote09@gmail.com>
>>> I apologize in advance for being maybe overly pedantic, but I noticed
>>> that, in ExecInitModifyTable(), you decided to place the call outside
>>> the loop that goes over resultRelations (shown below), although my
>>> intent was to ask to place it next to the BeginForeignModify() in that
>>> loop.
>>
>> Actually, I tried to do it (adding the GetModifyBatchSize() call after BeginForeignModify()), but it failed.
BecausepostgresfdwGetModifyBatchSize() wants to know if RETURNING is specified, and ResultRelInfo->projectReturning is
createdafter the above part.  Considering the context where GetModifyBatchSize() implementations may want to know the
environment,I placed the call as late as possible in the initialization phase.  As for the future(?) multi-target DML
statements,I think we can change this together with other many(?) parts that assume a single target table.
 
> 
> Okay, sometime later then.
> 
> I wasn't sure if bringing it up here would be appropriate, but there's
> a patch by me to refactor ModfiyTable result relation allocation that
> will have to remember to move this code along to an appropriate place
> [1].  Thanks for the tip about the dependency on how RETURNING is
> handled.  I will remember it when rebasing my patch over this.
> 

Thanks. The last version (v12) should be addressing all the comments and 
seems fine to me, so barring objections I'll get that pushed shortly.

One thing that seems a bit annoying is that with the partitioned table 
the explain (verbose) looks like this:

                      QUERY PLAN
-----------------------------------------------------
  Insert on public.batch_table
    ->  Function Scan on pg_catalog.generate_series i
          Output: i.i
          Function Call: generate_series(1, 66)
(4 rows)

That is, there's no information about the batch size :-( But AFAICS 
that's due to how explain shows (or rather does not) partitions in this 
type of plan.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: POC: postgres_fdw insert batching

From
Amit Langote
Date:
On Wed, Jan 20, 2021 at 1:01 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> On 1/19/21 7:23 AM, Amit Langote wrote:
> > On Tue, Jan 19, 2021 at 2:06 PM tsunakawa.takay@fujitsu.com
> >> Actually, I tried to do it (adding the GetModifyBatchSize() call after BeginForeignModify()), but it failed.
BecausepostgresfdwGetModifyBatchSize() wants to know if RETURNING is specified, and ResultRelInfo->projectReturning is
createdafter the above part.  Considering the context where GetModifyBatchSize() implementations may want to know the
environment,I placed the call as late as possible in the initialization phase.  As for the future(?) multi-target DML
statements,I think we can change this together with other many(?) parts that assume a single target table. 
> >
> > Okay, sometime later then.
> >
> > I wasn't sure if bringing it up here would be appropriate, but there's
> > a patch by me to refactor ModfiyTable result relation allocation that
> > will have to remember to move this code along to an appropriate place
> > [1].  Thanks for the tip about the dependency on how RETURNING is
> > handled.  I will remember it when rebasing my patch over this.
> >
>
> Thanks. The last version (v12) should be addressing all the comments and
> seems fine to me, so barring objections I'll get that pushed shortly.

+1

> One thing that seems a bit annoying is that with the partitioned table
> the explain (verbose) looks like this:
>
>                       QUERY PLAN
> -----------------------------------------------------
>   Insert on public.batch_table
>     ->  Function Scan on pg_catalog.generate_series i
>           Output: i.i
>           Function Call: generate_series(1, 66)
> (4 rows)
>
> That is, there's no information about the batch size :-( But AFAICS
> that's due to how explain shows (or rather does not) partitions in this
> type of plan.

Yeah.  Partition result relations are always lazily allocated for
INSERT, so EXPLAIN (without ANALYZE) has no idea what to show for
them, nor does it know which partitions will be used in the first
place.  With ANALYZE however, you could get them from
es_tuple_routing_result_relations and maybe list them if you want, but
that sounds like a project on its own.

--
Amit Langote
EDB: http://www.enterprisedb.com



Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:
OK, pushed after a little bit of additional polishing (mostly comments).

Thanks everyone!

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:
Hmm, seems that florican doesn't like this :-(

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2021-01-20%2023%3A08%3A15

It's a i386 machine running FreeBSD, so not sure what exactly it's picky 
about. But when I tried running this under valgrind, I get some strange 
failures in the new chunk in ExecInitModifyTable:

   /*
    * Determine if the FDW supports batch insert and determine the batch
    * size (a FDW may support batching, but it may be disabled for the
    * server/table).
    */
   if (!resultRelInfo->ri_usesFdwDirectModify &&
       operation == CMD_INSERT &&
       resultRelInfo->ri_FdwRoutine != NULL &&
       resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize &&
       resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert)
       resultRelInfo->ri_BatchSize =
 
resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(resultRelInfo);
   else
       resultRelInfo->ri_BatchSize = 1;

   Assert(resultRelInfo->ri_BatchSize >= 1);

It seems as if the resultRelInfo is not initialized, or something like 
that. I wouldn't be surprised if the 32-bit machine was pickier and 
failing because of that.

A sample of the valgrind log is attached. It's pretty much just 
repetitions of these three reports.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: POC: postgres_fdw insert batching

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> OK, pushed after a little bit of additional polishing (mostly comments).
> Thanks everyone!

florican reports this is seriously broken on 32-bit hardware:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2021-01-20%2023%3A08%3A15

First guess is incorrect memory-allocation computations ...

            regards, tom lane



Re: POC: postgres_fdw insert batching

From
Zhihong Yu
Date:
Hi,
The assignment to resultRelInfo is done when junk_filter_needed is true:

        if (junk_filter_needed)
        {
            resultRelInfo = mtstate->resultRelInfo;

Should the code for determining batch size access mtstate->resultRelInfo directly ?

diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 9c36860704..a6a814454d 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -2798,17 +2798,17 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
      * size (a FDW may support batching, but it may be disabled for the
      * server/table).
      */
-    if (!resultRelInfo->ri_usesFdwDirectModify &&
+    if (!mtstate->resultRelInfo->ri_usesFdwDirectModify &&
         operation == CMD_INSERT &&
-        resultRelInfo->ri_FdwRoutine != NULL &&
-        resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize &&
-        resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert)
-        resultRelInfo->ri_BatchSize =
-            resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(resultRelInfo);
+        mtstate->resultRelInfo->ri_FdwRoutine != NULL &&
+        mtstate->resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize &&
+        mtstate->resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert)
+        mtstate->resultRelInfo->ri_BatchSize =
+            mtstate->resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(mtstate->resultRelInfo);
     else
-        resultRelInfo->ri_BatchSize = 1;
+        mtstate->resultRelInfo->ri_BatchSize = 1;

-    Assert(resultRelInfo->ri_BatchSize >= 1);
+    Assert(mtstate->resultRelInfo->ri_BatchSize >= 1);

     /*
      * Lastly, if this is not the primary (canSetTag) ModifyTable node, add it

Cheers

On Wed, Jan 20, 2021 at 3:52 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
Hmm, seems that florican doesn't like this :-(

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2021-01-20%2023%3A08%3A15

It's a i386 machine running FreeBSD, so not sure what exactly it's picky
about. But when I tried running this under valgrind, I get some strange
failures in the new chunk in ExecInitModifyTable:

   /*
    * Determine if the FDW supports batch insert and determine the batch
    * size (a FDW may support batching, but it may be disabled for the
    * server/table).
    */
   if (!resultRelInfo->ri_usesFdwDirectModify &&
       operation == CMD_INSERT &&
       resultRelInfo->ri_FdwRoutine != NULL &&
       resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize &&
       resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert)
       resultRelInfo->ri_BatchSize =

resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(resultRelInfo);
   else
       resultRelInfo->ri_BatchSize = 1;

   Assert(resultRelInfo->ri_BatchSize >= 1);

It seems as if the resultRelInfo is not initialized, or something like
that. I wouldn't be surprised if the 32-bit machine was pickier and
failing because of that.

A sample of the valgrind log is attached. It's pretty much just
repetitions of these three reports.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:
On 1/21/21 12:52 AM, Tomas Vondra wrote:
> Hmm, seems that florican doesn't like this :-(
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2021-01-20%2023%3A08%3A15 
> 
> 
> It's a i386 machine running FreeBSD, so not sure what exactly it's picky 
> about. But when I tried running this under valgrind, I get some strange 
> failures in the new chunk in ExecInitModifyTable:
> 
>    /*
>     * Determine if the FDW supports batch insert and determine the batch
>     * size (a FDW may support batching, but it may be disabled for the
>     * server/table).
>     */
>    if (!resultRelInfo->ri_usesFdwDirectModify &&
>        operation == CMD_INSERT &&
>        resultRelInfo->ri_FdwRoutine != NULL &&
>        resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize &&
>        resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert)
>        resultRelInfo->ri_BatchSize =
> 
> resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(resultRelInfo);
>    else
>        resultRelInfo->ri_BatchSize = 1;
> 
>    Assert(resultRelInfo->ri_BatchSize >= 1);
> 
> It seems as if the resultRelInfo is not initialized, or something like 
> that. I wouldn't be surprised if the 32-bit machine was pickier and 
> failing because of that.
> 
> A sample of the valgrind log is attached. It's pretty much just 
> repetitions of these three reports.
> 

OK, it's definitely accessing uninitialized memory, because the 
resultRelInfo (on line 2801, i.e. the "if" condition) looks like this:

(gdb) p resultRelInfo
$1 = (ResultRelInfo *) 0xe595988
(gdb) p *resultRelInfo
$2 = {type = 2139062142, ri_RangeTableIndex = 2139062143, 
ri_RelationDesc = 0x7f7f7f7f7f7f7f7f, ri_NumIndices = 2139062143, 
ri_IndexRelationDescs = 0x7f7f7f7f7f7f7f7f, ri_IndexRelationInfo = 
0x7f7f7f7f7f7f7f7f,
   ri_TrigDesc = 0x7f7f7f7f7f7f7f7f, ri_TrigFunctions = 
0x7f7f7f7f7f7f7f7f, ri_TrigWhenExprs = 0x7f7f7f7f7f7f7f7f, 
ri_TrigInstrument = 0x7f7f7f7f7f7f7f7f, ri_ReturningSlot = 
0x7f7f7f7f7f7f7f7f, ri_TrigOldSlot = 0x7f7f7f7f7f7f7f7f,
   ri_TrigNewSlot = 0x7f7f7f7f7f7f7f7f, ri_FdwRoutine = 
0x7f7f7f7f7f7f7f7f, ri_FdwState = 0x7f7f7f7f7f7f7f7f, 
ri_usesFdwDirectModify = 127, ri_NumSlots = 2139062143, ri_BatchSize = 
2139062143, ri_Slots = 0x7f7f7f7f7f7f7f7f,
   ri_PlanSlots = 0x7f7f7f7f7f7f7f7f, ri_WithCheckOptions = 
0x7f7f7f7f7f7f7f7f, ri_WithCheckOptionExprs = 0x7f7f7f7f7f7f7f7f, 
ri_ConstraintExprs = 0x7f7f7f7f7f7f7f7f, ri_GeneratedExprs = 
0x7f7f7f7f7f7f7f7f,
   ri_NumGeneratedNeeded = 2139062143, ri_junkFilter = 
0x7f7f7f7f7f7f7f7f, ri_returningList = 0x7f7f7f7f7f7f7f7f, 
ri_projectReturning = 0x7f7f7f7f7f7f7f7f, ri_onConflictArbiterIndexes = 
0x7f7f7f7f7f7f7f7f,
   ri_onConflict = 0x7f7f7f7f7f7f7f7f, ri_PartitionCheckExpr = 
0x7f7f7f7f7f7f7f7f, ri_PartitionRoot = 0x7f7f7f7f7f7f7f7f, 
ri_RootToPartitionMap = 0x8, ri_PartitionTupleSlot = 0x8, 
ri_ChildToRootMap = 0xe5952b0,
   ri_CopyMultiInsertBuffer = 0xe596740}
(gdb)

I may be wrong, but the most likely explanation seems to be this is due 
to the junk filter initialization, which simply moves past the end of 
the mtstate->resultRelInfo array.

It kinda seems the GetForeignModifyBatchSize call should happen before 
that block. The attached patch fixes this for me (i.e. regression tests 
pass with no valgrind reports.

Or did I get that wrong?

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:
On 1/21/21 12:59 AM, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
>> OK, pushed after a little bit of additional polishing (mostly comments).
>> Thanks everyone!
> 
> florican reports this is seriously broken on 32-bit hardware:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2021-01-20%2023%3A08%3A15
> 
> First guess is incorrect memory-allocation computations ...
> 

I know, although it seems more like an access to unitialized memory. 
I've already posted a patch that resolves that for me on 64-bits (per 
valgrind, I suppose it's the same issue).

I'm working on reproducing it on 32-bits, hopefully it won't take long.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:

On 1/21/21 1:17 AM, Zhihong Yu wrote:
> Hi,
> The assignment to resultRelInfo is done when junk_filter_needed is true:
> 
>          if (junk_filter_needed)
>          {
>              resultRelInfo = mtstate->resultRelInfo;
> 
> Should the code for determining batch size access mtstate->resultRelInfo 
> directly ?
> 

IMO the issue is that code iterates over all plans and moves to the next 
for each one:

     resultRelInfo++;

so it ends up pointing past the last element, hence the failures. So 
yeah, either the code needs to move before the loop (per my patch), or 
we need to access mtstate->resultRelInfo directly.

I'm pretty amazed this did not crash during any of the many regression 
runs I did recently.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: POC: postgres_fdw insert batching

From
Zhihong Yu
Date:
Hi, Tomas:
In my opinion, my patch is a little better.
Suppose one of the conditions in the if block changes in between the start of loop and the end of the loop:

     * Determine if the FDW supports batch insert and determine the batch
     * size (a FDW may support batching, but it may be disabled for the
     * server/table).

My patch would reflect that change. I guess this was the reason the if / else block was placed there in the first place.

Cheers

On Wed, Jan 20, 2021 at 4:56 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:


On 1/21/21 1:17 AM, Zhihong Yu wrote:
> Hi,
> The assignment to resultRelInfo is done when junk_filter_needed is true:
>
>          if (junk_filter_needed)
>          {
>              resultRelInfo = mtstate->resultRelInfo;
>
> Should the code for determining batch size access mtstate->resultRelInfo
> directly ?
>

IMO the issue is that code iterates over all plans and moves to the next
for each one:

     resultRelInfo++;

so it ends up pointing past the last element, hence the failures. So
yeah, either the code needs to move before the loop (per my patch), or
we need to access mtstate->resultRelInfo directly.

I'm pretty amazed this did not crash during any of the many regression
runs I did recently.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:

On 1/21/21 2:02 AM, Zhihong Yu wrote:
> Hi, Tomas:
> In my opinion, my patch is a little better.
> Suppose one of the conditions in the if block changes in between the 
> start of loop and the end of the loop:
> 
>       * Determine if the FDW supports batch insert and determine the batch
>       * size (a FDW may support batching, but it may be disabled for the
>       * server/table).
> 
> My patch would reflect that change. I guess this was the reason the if / 
> else block was placed there in the first place.
> 

But can it change? All the loop does is extracting junk attributes from 
the plans, it does not modify anything related to the batching. Or maybe 
I just don't understand what you mean.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: POC: postgres_fdw insert batching

From
Zhihong Yu
Date:
Hi,
Do we need to consider how this part of code inside ExecInitModifyTable() would evolve ?

I think placing the compound condition toward the end of ExecInitModifyTable() is reasonable because it checks the latest information.

Regards

On Wed, Jan 20, 2021 at 5:11 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:


On 1/21/21 2:02 AM, Zhihong Yu wrote:
> Hi, Tomas:
> In my opinion, my patch is a little better.
> Suppose one of the conditions in the if block changes in between the
> start of loop and the end of the loop:
>
>       * Determine if the FDW supports batch insert and determine the batch
>       * size (a FDW may support batching, but it may be disabled for the
>       * server/table).
>
> My patch would reflect that change. I guess this was the reason the if /
> else block was placed there in the first place.
>

But can it change? All the loop does is extracting junk attributes from
the plans, it does not modify anything related to the batching. Or maybe
I just don't understand what you mean.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: postgres_fdw insert batching

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> I may be wrong, but the most likely explanation seems to be this is due 
> to the junk filter initialization, which simply moves past the end of 
> the mtstate->resultRelInfo array.

resultRelInfo is certainly pointing at garbage at that point.

> It kinda seems the GetForeignModifyBatchSize call should happen before 
> that block. The attached patch fixes this for me (i.e. regression tests 
> pass with no valgrind reports.

> Or did I get that wrong?

Don't we need to initialize ri_BatchSize for *each* resultrelinfo,
not merely the first one?  That is, this new code needs to be
somewhere inside a loop over the result rels.

            regards, tom lane



Re: POC: postgres_fdw insert batching

From
Amit Langote
Date:
On Thu, Jan 21, 2021 at 9:56 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> On 1/21/21 1:17 AM, Zhihong Yu wrote:
> > Hi,
> > The assignment to resultRelInfo is done when junk_filter_needed is true:
> >
> >          if (junk_filter_needed)
> >          {
> >              resultRelInfo = mtstate->resultRelInfo;
> >
> > Should the code for determining batch size access mtstate->resultRelInfo
> > directly ?
> >
>
> IMO the issue is that code iterates over all plans and moves to the next
> for each one:
>
>      resultRelInfo++;
>
> so it ends up pointing past the last element, hence the failures. So
> yeah, either the code needs to move before the loop (per my patch), or
> we need to access mtstate->resultRelInfo directly.

Accessing mtstate->resultRelInfo directly would do.  The only
constraint on where this block should be placed is that
ri_projectReturning must be valid as of calling
GetForeignModifyBatchSize(), as Tsunakawa-san pointed out upthread.
So, after this block in ExecInitModifyTable:

    /*
     * Initialize RETURNING projections if needed.
     */
    if (node->returningLists)
    {
        ....
        /*
         * Build a projection for each result rel.
         */
        resultRelInfo = mtstate->resultRelInfo;
        foreach(l, node->returningLists)
        {
            List       *rlist = (List *) lfirst(l);

            resultRelInfo->ri_returningList = rlist;
            resultRelInfo->ri_projectReturning =
                ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
                                        resultRelInfo->ri_RelationDesc->rd_att);
            resultRelInfo++;
        }
    }

-- 
Amit Langote
EDB: http://www.enterprisedb.com



RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:

From: Zhihong Yu <zyu@yugabyte.com>

> Do we need to consider how this part of code inside ExecInitModifyTable() would evolve ?

 

> I think placing the compound condition toward the end of ExecInitModifyTable() is reasonable because it checks the latest information.

 

+1 for Zaihong-san's idea.  But instead of rewriting every relsultRelInfo to mtstate->resultRelInfo, which makes it a bit harder to read, I'd like to suggest just adding "resultRelInfo = mtstate->resultRelInfo;" immediately before the if block.

 

Thanks a lot, all for helping to solve the problem quickly!

 

 

Regards

Takayuki Tsunakawa

 

Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:

On 1/21/21 2:24 AM, Amit Langote wrote:
> On Thu, Jan 21, 2021 at 9:56 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> On 1/21/21 1:17 AM, Zhihong Yu wrote:
>>> Hi,
>>> The assignment to resultRelInfo is done when junk_filter_needed is true:
>>>
>>>           if (junk_filter_needed)
>>>           {
>>>               resultRelInfo = mtstate->resultRelInfo;
>>>
>>> Should the code for determining batch size access mtstate->resultRelInfo
>>> directly ?
>>>
>>
>> IMO the issue is that code iterates over all plans and moves to the next
>> for each one:
>>
>>       resultRelInfo++;
>>
>> so it ends up pointing past the last element, hence the failures. So
>> yeah, either the code needs to move before the loop (per my patch), or
>> we need to access mtstate->resultRelInfo directly.
> 
> Accessing mtstate->resultRelInfo directly would do.  The only
> constraint on where this block should be placed is that
> ri_projectReturning must be valid as of calling
> GetForeignModifyBatchSize(), as Tsunakawa-san pointed out upthread.
> So, after this block in ExecInitModifyTable:
> 
>      /*
>       * Initialize RETURNING projections if needed.
>       */
>      if (node->returningLists)
>      {
>          ....
>          /*
>           * Build a projection for each result rel.
>           */
>          resultRelInfo = mtstate->resultRelInfo;
>          foreach(l, node->returningLists)
>          {
>              List       *rlist = (List *) lfirst(l);
> 
>              resultRelInfo->ri_returningList = rlist;
>              resultRelInfo->ri_projectReturning =
>                  ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
>                                          resultRelInfo->ri_RelationDesc->rd_att);
>              resultRelInfo++;
>          }
>      }
> 

Right. But I think Tom is right this should initialize ri_BatchSize for 
all the resultRelInfo elements, not just the first one. Per the attached 
patch, which resolves the issue both on x86_64 and armv7l for me.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:

On 1/21/21 2:22 AM, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
>> I may be wrong, but the most likely explanation seems to be this is due
>> to the junk filter initialization, which simply moves past the end of
>> the mtstate->resultRelInfo array.
> 
> resultRelInfo is certainly pointing at garbage at that point.
> 

Yup. It's pretty amazing the x86 machines seem to be mostly OK with it.

>> It kinda seems the GetForeignModifyBatchSize call should happen before
>> that block. The attached patch fixes this for me (i.e. regression tests
>> pass with no valgrind reports.
> 
>> Or did I get that wrong?
> 
> Don't we need to initialize ri_BatchSize for *each* resultrelinfo,
> not merely the first one?  That is, this new code needs to be
> somewhere inside a loop over the result rels.
> 

Yeah, I think you're right. That's an embarrassing oversight :-(


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: POC: postgres_fdw insert batching

From
Zhihong Yu
Date:
Hi, Takayuki-san:
My first name is Zhihong.

You can call me Ted if you want to save some typing :-)

Cheers

On Wed, Jan 20, 2021 at 5:37 PM tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com> wrote:

From: Zhihong Yu <zyu@yugabyte.com>

> Do we need to consider how this part of code inside ExecInitModifyTable() would evolve ?

 

> I think placing the compound condition toward the end of ExecInitModifyTable() is reasonable because it checks the latest information.

 

+1 for Zaihong-san's idea.  But instead of rewriting every relsultRelInfo to mtstate->resultRelInfo, which makes it a bit harder to read, I'd like to suggest just adding "resultRelInfo = mtstate->resultRelInfo;" immediately before the if block.

 

Thanks a lot, all for helping to solve the problem quickly!

 

 

Regards

Takayuki Tsunakawa

 

RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
> Right. But I think Tom is right this should initialize ri_BatchSize for all the
> resultRelInfo elements, not just the first one. Per the attached patch, which
> resolves the issue both on x86_64 and armv7l for me.

I think Your patch is perfect in the sense that it's ready for the future multi-target DML support.  +1

Just for learning, could anyone tell me what this loop for?  I thought current Postgres's DML supports a single target
table,so it's enough to handle the first element of mtstate->resultRelInfo.  In that sense, Amit-san and I agreed that
wedon't put the if block in the for loop yet.
 


Regards
Takayuki Tsunakawa


Re: POC: postgres_fdw insert batching

From
Amit Langote
Date:
On Thu, Jan 21, 2021 at 10:42 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> On 1/21/21 2:24 AM, Amit Langote wrote:
> > On Thu, Jan 21, 2021 at 9:56 AM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >> On 1/21/21 1:17 AM, Zhihong Yu wrote:
> >>> Hi,
> >>> The assignment to resultRelInfo is done when junk_filter_needed is true:
> >>>
> >>>           if (junk_filter_needed)
> >>>           {
> >>>               resultRelInfo = mtstate->resultRelInfo;
> >>>
> >>> Should the code for determining batch size access mtstate->resultRelInfo
> >>> directly ?
> >>>
> >>
> >> IMO the issue is that code iterates over all plans and moves to the next
> >> for each one:
> >>
> >>       resultRelInfo++;
> >>
> >> so it ends up pointing past the last element, hence the failures. So
> >> yeah, either the code needs to move before the loop (per my patch), or
> >> we need to access mtstate->resultRelInfo directly.
> >
> > Accessing mtstate->resultRelInfo directly would do.  The only
> > constraint on where this block should be placed is that
> > ri_projectReturning must be valid as of calling
> > GetForeignModifyBatchSize(), as Tsunakawa-san pointed out upthread.
> > So, after this block in ExecInitModifyTable:
> >
> >      /*
> >       * Initialize RETURNING projections if needed.
> >       */
> >      if (node->returningLists)
> >      {
> >          ....
> >          /*
> >           * Build a projection for each result rel.
> >           */
> >          resultRelInfo = mtstate->resultRelInfo;
> >          foreach(l, node->returningLists)
> >          {
> >              List       *rlist = (List *) lfirst(l);
> >
> >              resultRelInfo->ri_returningList = rlist;
> >              resultRelInfo->ri_projectReturning =
> >                  ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
> >                                          resultRelInfo->ri_RelationDesc->rd_att);
> >              resultRelInfo++;
> >          }
> >      }
> >
>
> Right. But I think Tom is right this should initialize ri_BatchSize for
> all the resultRelInfo elements, not just the first one. Per the attached
> patch, which resolves the issue both on x86_64 and armv7l for me.

+1 in general.  To avoid looping uselessly in the case of
UPDATE/DELETE where batching can't be used today, I'd suggest putting
if (operation == CMD_INSERT) around the loop.

-- 
Amit Langote
EDB: http://www.enterprisedb.com



RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:

From: Zhihong Yu <zyu@yugabyte.com>
> My first name is Zhihong.

> You can call me Ted if you want to save some typing :-)

 

Ah, I'm very sorry.  Thank you, let me call you Ted then.  That can't be mistaken.

 

Regards

Takayuki Tsunakawa

 

 

Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:

On 1/21/21 2:53 AM, Amit Langote wrote:
> On Thu, Jan 21, 2021 at 10:42 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> On 1/21/21 2:24 AM, Amit Langote wrote:
>>> On Thu, Jan 21, 2021 at 9:56 AM Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>> On 1/21/21 1:17 AM, Zhihong Yu wrote:
>>>>> Hi,
>>>>> The assignment to resultRelInfo is done when junk_filter_needed is true:
>>>>>
>>>>>            if (junk_filter_needed)
>>>>>            {
>>>>>                resultRelInfo = mtstate->resultRelInfo;
>>>>>
>>>>> Should the code for determining batch size access mtstate->resultRelInfo
>>>>> directly ?
>>>>>
>>>>
>>>> IMO the issue is that code iterates over all plans and moves to the next
>>>> for each one:
>>>>
>>>>        resultRelInfo++;
>>>>
>>>> so it ends up pointing past the last element, hence the failures. So
>>>> yeah, either the code needs to move before the loop (per my patch), or
>>>> we need to access mtstate->resultRelInfo directly.
>>>
>>> Accessing mtstate->resultRelInfo directly would do.  The only
>>> constraint on where this block should be placed is that
>>> ri_projectReturning must be valid as of calling
>>> GetForeignModifyBatchSize(), as Tsunakawa-san pointed out upthread.
>>> So, after this block in ExecInitModifyTable:
>>>
>>>       /*
>>>        * Initialize RETURNING projections if needed.
>>>        */
>>>       if (node->returningLists)
>>>       {
>>>           ....
>>>           /*
>>>            * Build a projection for each result rel.
>>>            */
>>>           resultRelInfo = mtstate->resultRelInfo;
>>>           foreach(l, node->returningLists)
>>>           {
>>>               List       *rlist = (List *) lfirst(l);
>>>
>>>               resultRelInfo->ri_returningList = rlist;
>>>               resultRelInfo->ri_projectReturning =
>>>                   ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps,
>>>                                           resultRelInfo->ri_RelationDesc->rd_att);
>>>               resultRelInfo++;
>>>           }
>>>       }
>>>
>>
>> Right. But I think Tom is right this should initialize ri_BatchSize for
>> all the resultRelInfo elements, not just the first one. Per the attached
>> patch, which resolves the issue both on x86_64 and armv7l for me.
> 
> +1 in general.  To avoid looping uselessly in the case of
> UPDATE/DELETE where batching can't be used today, I'd suggest putting
> if (operation == CMD_INSERT) around the loop.
> 

Right, that's pretty much what I ended up doing (without the CMD_INSERT 
check it'd add batching info to explain for updates too, for example). 
I'll do a bit more testing on the attached patch, but I think that's the 
right fix to push.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: POC: postgres_fdw insert batching

From
Tom Lane
Date:
"tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> writes:
> Just for learning, could anyone tell me what this loop for?  I thought current Postgres's DML supports a single
targettable, so it's enough to handle the first element of mtstate->resultRelInfo. 

The "single target table" could be partitioned, in which case there'll be
multiple resultrelinfos, some of which could be foreign tables.

            regards, tom lane



RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
> Right, that's pretty much what I ended up doing (without the CMD_INSERT
> check it'd add batching info to explain for updates too, for example).
> I'll do a bit more testing on the attached patch, but I think that's the right fix to
> push.

Thanks to the outer check for operation ==  CMD_INSERT, the inner one became unnecessary.


Regards
Takayuki Tsunakawa


RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Tom Lane <tgl@sss.pgh.pa.us>
> The "single target table" could be partitioned, in which case there'll be
> multiple resultrelinfos, some of which could be foreign tables.

Thank you.  I thought so at first, but later I found that ExecInsert() only handles one element in
mtstate->resultRelInfo. So I thought just the first element is processed in INSERT case. 

I understood (guessed) the for loop is for UPDATE and DELETE.  EXPLAIN without ANALYZE UPDATE/DELETE on a partitioned
tableshows partitions, which would be mtstate->resultRelInfo.  EXPLAIN on INSERT doesn't show partitions, so I think
INSERTwill find relevant partitions based on input rows during execution. 


Regards
Takayuki Tsunakawa




Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:
On 1/21/21 3:09 AM, tsunakawa.takay@fujitsu.com wrote:
> From: Tomas Vondra <tomas.vondra@enterprisedb.com>
>> Right, that's pretty much what I ended up doing (without the CMD_INSERT
>> check it'd add batching info to explain for updates too, for example).
>> I'll do a bit more testing on the attached patch, but I think that's the right fix to
>> push.
> 
> Thanks to the outer check for operation ==  CMD_INSERT, the inner one became unnecessary.
> 

Right. I've pushed the fix, hopefully buildfarm will get happy again.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: POC: postgres_fdw insert batching

From
Ian Lawrence Barwick
Date:
Hi

2021年1月21日(木) 8:00 Tomas Vondra <tomas.vondra@enterprisedb.com>:
OK, pushed after a little bit of additional polishing (mostly comments).

Thanks everyone!

There's a minor typo in the doc's version of the ExecForeignBatchInsert() declaration;
is:

    TupleTableSlot **
    ExecForeignBatchInsert(EState *estate,
                      ResultRelInfo *rinfo,
                      TupleTableSlot **slots,
                      TupleTableSlot *planSlots,
                      int *numSlots);

should be:

    TupleTableSlot **
    ExecForeignBatchInsert(EState *estate,
                      ResultRelInfo *rinfo,
                      TupleTableSlot **slots,
                      TupleTableSlot **planSlots,
                      int *numSlots);

(Trivial patch attached).


Regards

Ian Barwick

--
Attachment

Re: POC: postgres_fdw insert batching

From
Amit Langote
Date:
On Thu, Jan 21, 2021 at 11:36 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> On 1/21/21 3:09 AM, tsunakawa.takay@fujitsu.com wrote:
> > From: Tomas Vondra <tomas.vondra@enterprisedb.com>
> >> Right, that's pretty much what I ended up doing (without the CMD_INSERT
> >> check it'd add batching info to explain for updates too, for example).
> >> I'll do a bit more testing on the attached patch, but I think that's the right fix to
> >> push.
> >
> > Thanks to the outer check for operation ==  CMD_INSERT, the inner one became unnecessary.
> >
>
> Right. I've pushed the fix, hopefully buildfarm will get happy again.

I was looking at this and it looks like we've got a problematic case
where postgresGetForeignModifyBatchSize() is called from
ExecInitRoutingInfo().

That case is when the insert is performed as part of a cross-partition
update of a partitioned table containing postgres_fdw foreign table
partitions.  Because we don't check the operation in
ExecInitRoutingInfo() when calling GetForeignModifyBatchSize(), such
inserts attempt to use batching.  However the ResultRelInfo may be one
for the original update operation, so ri_FdwState contains a
PgFdwModifyState with batch_size set to 0, because updates don't
support batching.  As things stand now,
postgresGetForeignModifyBatchSize() simply returns that, tripping the
following Assert in the caller.

Assert(partRelInfo->ri_BatchSize >= 1);

Use this example to see the crash:

create table p (a int) partition by list (a);
create table p1 (like p);
create extension postgres_fdw;
create server lb foreign data wrapper postgres_fdw ;
create user mapping for current_user server lb;
create foreign table fp1 (a int) server lb options (table_name 'p1');
alter table p attach partition fp1 for values in (1);
create or replace function report_trig_details() returns trigger as $$
begin raise notice '% % on %', tg_when, tg_op, tg_relname; if tg_op =
'DELETE' then return old; end if; return new; end; $$ language
plpgsql;
create trigger trig before update on fp1 for each row execute function
report_trig_details();
create table p2 partition of p for values in (2);
insert into p values (2);
update p set a = 1;  -- crashes

So we let's check mtstate->operation == CMD_INSERT in
ExecInitRoutingInfo() to prevent calling GetForeignModifyBatchSize()
in cross-update situations where mtstate->operation would be
CMD_UPDATE.

I've attached a patch.




--
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: POC: postgres_fdw insert batching

From
Zhihong Yu
Date:
Amit:
Good catch.

bq. ExecInitRoutingInfo() that is in the charge of initialing

Should be 'ExecInitRoutingInfo() that is in charge of initializing'

Cheers

On Sat, Jan 23, 2021 at 12:31 AM Amit Langote <amitlangote09@gmail.com> wrote:
On Thu, Jan 21, 2021 at 11:36 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> On 1/21/21 3:09 AM, tsunakawa.takay@fujitsu.com wrote:
> > From: Tomas Vondra <tomas.vondra@enterprisedb.com>
> >> Right, that's pretty much what I ended up doing (without the CMD_INSERT
> >> check it'd add batching info to explain for updates too, for example).
> >> I'll do a bit more testing on the attached patch, but I think that's the right fix to
> >> push.
> >
> > Thanks to the outer check for operation ==  CMD_INSERT, the inner one became unnecessary.
> >
>
> Right. I've pushed the fix, hopefully buildfarm will get happy again.

I was looking at this and it looks like we've got a problematic case
where postgresGetForeignModifyBatchSize() is called from
ExecInitRoutingInfo().

That case is when the insert is performed as part of a cross-partition
update of a partitioned table containing postgres_fdw foreign table
partitions.  Because we don't check the operation in
ExecInitRoutingInfo() when calling GetForeignModifyBatchSize(), such
inserts attempt to use batching.  However the ResultRelInfo may be one
for the original update operation, so ri_FdwState contains a
PgFdwModifyState with batch_size set to 0, because updates don't
support batching.  As things stand now,
postgresGetForeignModifyBatchSize() simply returns that, tripping the
following Assert in the caller.

Assert(partRelInfo->ri_BatchSize >= 1);

Use this example to see the crash:

create table p (a int) partition by list (a);
create table p1 (like p);
create extension postgres_fdw;
create server lb foreign data wrapper postgres_fdw ;
create user mapping for current_user server lb;
create foreign table fp1 (a int) server lb options (table_name 'p1');
alter table p attach partition fp1 for values in (1);
create or replace function report_trig_details() returns trigger as $$
begin raise notice '% % on %', tg_when, tg_op, tg_relname; if tg_op =
'DELETE' then return old; end if; return new; end; $$ language
plpgsql;
create trigger trig before update on fp1 for each row execute function
report_trig_details();
create table p2 partition of p for values in (2);
insert into p values (2);
update p set a = 1;  -- crashes

So we let's check mtstate->operation == CMD_INSERT in
ExecInitRoutingInfo() to prevent calling GetForeignModifyBatchSize()
in cross-update situations where mtstate->operation would be
CMD_UPDATE.

I've attached a patch.




--
Amit Langote
EDB: http://www.enterprisedb.com

Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:

On 1/23/21 9:31 AM, Amit Langote wrote:
> On Thu, Jan 21, 2021 at 11:36 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> On 1/21/21 3:09 AM, tsunakawa.takay@fujitsu.com wrote:
>>> From: Tomas Vondra <tomas.vondra@enterprisedb.com>
>>>> Right, that's pretty much what I ended up doing (without the CMD_INSERT
>>>> check it'd add batching info to explain for updates too, for example).
>>>> I'll do a bit more testing on the attached patch, but I think that's the right fix to
>>>> push.
>>>
>>> Thanks to the outer check for operation ==  CMD_INSERT, the inner one became unnecessary.
>>>
>>
>> Right. I've pushed the fix, hopefully buildfarm will get happy again.
> 
> I was looking at this and it looks like we've got a problematic case
> where postgresGetForeignModifyBatchSize() is called from
> ExecInitRoutingInfo().
> 
> That case is when the insert is performed as part of a cross-partition
> update of a partitioned table containing postgres_fdw foreign table
> partitions.  Because we don't check the operation in
> ExecInitRoutingInfo() when calling GetForeignModifyBatchSize(), such
> inserts attempt to use batching.  However the ResultRelInfo may be one
> for the original update operation, so ri_FdwState contains a
> PgFdwModifyState with batch_size set to 0, because updates don't
> support batching.  As things stand now,
> postgresGetForeignModifyBatchSize() simply returns that, tripping the
> following Assert in the caller.
> 
> Assert(partRelInfo->ri_BatchSize >= 1);
> 
> Use this example to see the crash:
> 
> create table p (a int) partition by list (a);
> create table p1 (like p);
> create extension postgres_fdw;
> create server lb foreign data wrapper postgres_fdw ;
> create user mapping for current_user server lb;
> create foreign table fp1 (a int) server lb options (table_name 'p1');
> alter table p attach partition fp1 for values in (1);
> create or replace function report_trig_details() returns trigger as $$
> begin raise notice '% % on %', tg_when, tg_op, tg_relname; if tg_op =
> 'DELETE' then return old; end if; return new; end; $$ language
> plpgsql;
> create trigger trig before update on fp1 for each row execute function
> report_trig_details();
> create table p2 partition of p for values in (2);
> insert into p values (2);
> update p set a = 1;  -- crashes
> 
> So we let's check mtstate->operation == CMD_INSERT in
> ExecInitRoutingInfo() to prevent calling GetForeignModifyBatchSize()
> in cross-update situations where mtstate->operation would be
> CMD_UPDATE.
> 
> I've attached a patch.
> 

Thanks for catching this. I think it'd be good if the fix included a 
regression test. The example seems like a good starting point, not sure 
if it can be simplified further.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: POC: postgres_fdw insert batching

From
Amit Langote
Date:
On Sun, Jan 24, 2021 at 2:17 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> On 1/23/21 9:31 AM, Amit Langote wrote:
> > I was looking at this and it looks like we've got a problematic case
> > where postgresGetForeignModifyBatchSize() is called from
> > ExecInitRoutingInfo().
> >
> > That case is when the insert is performed as part of a cross-partition
> > update of a partitioned table containing postgres_fdw foreign table
> > partitions.  Because we don't check the operation in
> > ExecInitRoutingInfo() when calling GetForeignModifyBatchSize(), such
> > inserts attempt to use batching.  However the ResultRelInfo may be one
> > for the original update operation, so ri_FdwState contains a
> > PgFdwModifyState with batch_size set to 0, because updates don't
> > support batching.  As things stand now,
> > postgresGetForeignModifyBatchSize() simply returns that, tripping the
> > following Assert in the caller.
> >
> > Assert(partRelInfo->ri_BatchSize >= 1);
> >
> > Use this example to see the crash:
> >
> > create table p (a int) partition by list (a);
> > create table p1 (like p);
> > create extension postgres_fdw;
> > create server lb foreign data wrapper postgres_fdw ;
> > create user mapping for current_user server lb;
> > create foreign table fp1 (a int) server lb options (table_name 'p1');
> > alter table p attach partition fp1 for values in (1);
> > create or replace function report_trig_details() returns trigger as $$
> > begin raise notice '% % on %', tg_when, tg_op, tg_relname; if tg_op =
> > 'DELETE' then return old; end if; return new; end; $$ language
> > plpgsql;
> > create trigger trig before update on fp1 for each row execute function
> > report_trig_details();
> > create table p2 partition of p for values in (2);
> > insert into p values (2);
> > update p set a = 1;  -- crashes
> >
> > So we let's check mtstate->operation == CMD_INSERT in
> > ExecInitRoutingInfo() to prevent calling GetForeignModifyBatchSize()
> > in cross-update situations where mtstate->operation would be
> > CMD_UPDATE.
> >
> > I've attached a patch.
>
> Thanks for catching this. I think it'd be good if the fix included a
> regression test. The example seems like a good starting point, not sure
> if it can be simplified further.

Yes, it can be simplified by using a local join to prevent the update
of the foreign partition from being pushed to the remote server, for
which my example in the previous email used a local trigger.  Note
that the update of the foreign partition to be done locally is a
prerequisite for this bug to occur.

I've added that simplified test case in the attached updated patch.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Amit Langote <amitlangote09@gmail.com>
> Yes, it can be simplified by using a local join to prevent the update of the foreign
> partition from being pushed to the remote server, for which my example in the
> previous email used a local trigger.  Note that the update of the foreign
> partition to be done locally is a prerequisite for this bug to occur.

Thank you, I was aware that UPDATE calls ExecInsert() but forgot about it partway.  Good catch (and my bad miss.)


+    PgFdwModifyState *fmstate = resultRelInfo->ri_FdwState ?
+                            (PgFdwModifyState *) resultRelInfo->ri_FdwState :
+                            NULL;

This can be written as:

+    PgFdwModifyState *fmstate = (PgFdwModifyState *) resultRelInfo->ri_FdwState;


Regards
Takayuki Tsunakawa


Re: POC: postgres_fdw insert batching

From
Ian Lawrence Barwick
Date:


2021年1月22日(金) 14:50 Ian Lawrence Barwick <barwick@gmail.com>:
Hi

2021年1月21日(木) 8:00 Tomas Vondra <tomas.vondra@enterprisedb.com>:
OK, pushed after a little bit of additional polishing (mostly comments).

Thanks everyone!

There's a minor typo in the doc's version of the ExecForeignBatchInsert() declaration;
is:

    TupleTableSlot **
    ExecForeignBatchInsert(EState *estate,
                      ResultRelInfo *rinfo,
                      TupleTableSlot **slots,
                      TupleTableSlot *planSlots,
                      int *numSlots);

should be:

    TupleTableSlot **
    ExecForeignBatchInsert(EState *estate,
                      ResultRelInfo *rinfo,
                      TupleTableSlot **slots,
                      TupleTableSlot **planSlots,
                      int *numSlots);

(Trivial patch attached).


Regards

Ian Barwick

--


--

Re: POC: postgres_fdw insert batching

From
Ian Lawrence Barwick
Date:


2021年1月22日(金) 14:50 Ian Lawrence Barwick <barwick@gmail.com>:
Hi

2021年1月21日(木) 8:00 Tomas Vondra <tomas.vondra@enterprisedb.com>:
OK, pushed after a little bit of additional polishing (mostly comments).

Thanks everyone!

There's a minor typo in the doc's version of the ExecForeignBatchInsert() declaration;
is:

    TupleTableSlot **
    ExecForeignBatchInsert(EState *estate,
                      ResultRelInfo *rinfo,
                      TupleTableSlot **slots,
                      TupleTableSlot *planSlots,
                      int *numSlots);

should be:

    TupleTableSlot **
    ExecForeignBatchInsert(EState *estate,
                      ResultRelInfo *rinfo,
                      TupleTableSlot **slots,
                      TupleTableSlot **planSlots,
                      int *numSlots);

(Trivial patch attached).


Forgot to mention the relevant doc link:


Regards

Ian Barwick

--

Re: POC: postgres_fdw insert batching

From
Amit Langote
Date:
Tsunakwa-san,

On Mon, Jan 25, 2021 at 1:21 PM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
> From: Amit Langote <amitlangote09@gmail.com>
> > Yes, it can be simplified by using a local join to prevent the update of the foreign
> > partition from being pushed to the remote server, for which my example in the
> > previous email used a local trigger.  Note that the update of the foreign
> > partition to be done locally is a prerequisite for this bug to occur.
>
> Thank you, I was aware that UPDATE calls ExecInsert() but forgot about it partway.  Good catch (and my bad miss.)

It appears I had missed your reply, sorry.

> +       PgFdwModifyState *fmstate = resultRelInfo->ri_FdwState ?
> +                                                       (PgFdwModifyState *) resultRelInfo->ri_FdwState :
> +                                                       NULL;
>
> This can be written as:
>
> +       PgFdwModifyState *fmstate = (PgFdwModifyState *) resultRelInfo->ri_FdwState;

Facepalm, yes.

Patch updated.  Thanks for the review.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

RE: POC: postgres_fdw insert batching

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Amit Langote <amitlangote09@gmail.com>
> It appears I had missed your reply, sorry.
> 
> > +       PgFdwModifyState *fmstate = resultRelInfo->ri_FdwState ?
> > +
> (PgFdwModifyState *) resultRelInfo->ri_FdwState :
> > +                                                       NULL;
> >
> > This can be written as:
> >
> > +       PgFdwModifyState *fmstate = (PgFdwModifyState *)
> > + resultRelInfo->ri_FdwState;
> 
> Facepalm, yes.
> 
> Patch updated.  Thanks for the review.

Thank you for picking this up.


Regards
Takayuki Tsunakawa


Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:
On 2/5/21 2:55 AM, Ian Lawrence Barwick wrote:
> ...
> 
>     There's a minor typo in the doc's version of the
>     ExecForeignBatchInsert() declaration;
>     is:
> 
>         TupleTableSlot **
>         ExecForeignBatchInsert(EState *estate,
>                           ResultRelInfo *rinfo,
>                           TupleTableSlot **slots,
>                           TupleTableSlot *planSlots,
>                           int *numSlots);
> 
>     should be:
> 
>         TupleTableSlot **
>         ExecForeignBatchInsert(EState *estate,
>                           ResultRelInfo *rinfo,
>                           TupleTableSlot **slots,
>                           TupleTableSlot **planSlots,
>                           int *numSlots);
> 
>     (Trivial patch attached).
> 
> 
> Forgot to mention the relevant doc link:
> 
>    
> https://www.postgresql.org/docs/devel/fdw-callbacks.html#FDW-CALLBACKS-UPDATE
> <https://www.postgresql.org/docs/devel/fdw-callbacks.html#FDW-CALLBACKS-UPDATE>
> 

Thanks, I'll get this fixed.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:
On 2/5/21 3:52 AM, Amit Langote wrote:
> Tsunakwa-san,
> 
> On Mon, Jan 25, 2021 at 1:21 PM tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
>> From: Amit Langote <amitlangote09@gmail.com>
>>> Yes, it can be simplified by using a local join to prevent the update of the foreign
>>> partition from being pushed to the remote server, for which my example in the
>>> previous email used a local trigger.  Note that the update of the foreign
>>> partition to be done locally is a prerequisite for this bug to occur.
>>
>> Thank you, I was aware that UPDATE calls ExecInsert() but forgot about it partway.  Good catch (and my bad miss.)
> 
> It appears I had missed your reply, sorry.
> 
>> +       PgFdwModifyState *fmstate = resultRelInfo->ri_FdwState ?
>> +                                                       (PgFdwModifyState *) resultRelInfo->ri_FdwState :
>> +                                                       NULL;
>>
>> This can be written as:
>>
>> +       PgFdwModifyState *fmstate = (PgFdwModifyState *) resultRelInfo->ri_FdwState;
> 
> Facepalm, yes.
> 
> Patch updated.  Thanks for the review.
> 

Thanks for the patch, it seems fine to me. I wonder it the commit
message needs some tweaks, though. At the moment it says:

    Prevent FDW insert batching during cross-partition updates

but what the patch seems to be doing is simply initializing the info
only for CMD_INSERT operations. Which does the trick, but it affects
everything, i.e. all updates, no? Not just cross-partition updates.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: POC: postgres_fdw insert batching

From
Amit Langote
Date:
On Tue, Feb 16, 2021 at 1:36 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> On 2/5/21 3:52 AM, Amit Langote wrote:
> > Tsunakwa-san,
> >
> > On Mon, Jan 25, 2021 at 1:21 PM tsunakawa.takay@fujitsu.com
> > <tsunakawa.takay@fujitsu.com> wrote:
> >> From: Amit Langote <amitlangote09@gmail.com>
> >>> Yes, it can be simplified by using a local join to prevent the update of the foreign
> >>> partition from being pushed to the remote server, for which my example in the
> >>> previous email used a local trigger.  Note that the update of the foreign
> >>> partition to be done locally is a prerequisite for this bug to occur.
> >>
> >> Thank you, I was aware that UPDATE calls ExecInsert() but forgot about it partway.  Good catch (and my bad miss.)
> >
> > It appears I had missed your reply, sorry.
> >
> >> +       PgFdwModifyState *fmstate = resultRelInfo->ri_FdwState ?
> >> +                                                       (PgFdwModifyState *) resultRelInfo->ri_FdwState :
> >> +                                                       NULL;
> >>
> >> This can be written as:
> >>
> >> +       PgFdwModifyState *fmstate = (PgFdwModifyState *) resultRelInfo->ri_FdwState;
> >
> > Facepalm, yes.
> >
> > Patch updated.  Thanks for the review.
> >
>
> Thanks for the patch, it seems fine to me.

Thanks for checking.

> I wonder it the commit
> message needs some tweaks, though. At the moment it says:
>
>     Prevent FDW insert batching during cross-partition updates
>
> but what the patch seems to be doing is simply initializing the info
> only for CMD_INSERT operations. Which does the trick, but it affects
> everything, i.e. all updates, no? Not just cross-partition updates.

You're right.  Please check the message in the updated patch.


--
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:

On 2/16/21 10:25 AM, Amit Langote wrote:
> On Tue, Feb 16, 2021 at 1:36 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> On 2/5/21 3:52 AM, Amit Langote wrote:
>>> Tsunakwa-san,
>>>
>>> On Mon, Jan 25, 2021 at 1:21 PM tsunakawa.takay@fujitsu.com
>>> <tsunakawa.takay@fujitsu.com> wrote:
>>>> From: Amit Langote <amitlangote09@gmail.com>
>>>>> Yes, it can be simplified by using a local join to prevent the update of the foreign
>>>>> partition from being pushed to the remote server, for which my example in the
>>>>> previous email used a local trigger.  Note that the update of the foreign
>>>>> partition to be done locally is a prerequisite for this bug to occur.
>>>>
>>>> Thank you, I was aware that UPDATE calls ExecInsert() but forgot about it partway.  Good catch (and my bad miss.)
>>>
>>> It appears I had missed your reply, sorry.
>>>
>>>> +       PgFdwModifyState *fmstate = resultRelInfo->ri_FdwState ?
>>>> +                                                       (PgFdwModifyState *) resultRelInfo->ri_FdwState :
>>>> +                                                       NULL;
>>>>
>>>> This can be written as:
>>>>
>>>> +       PgFdwModifyState *fmstate = (PgFdwModifyState *) resultRelInfo->ri_FdwState;
>>>
>>> Facepalm, yes.
>>>
>>> Patch updated.  Thanks for the review.
>>>
>>
>> Thanks for the patch, it seems fine to me.
> 
> Thanks for checking.
> 
>> I wonder it the commit
>> message needs some tweaks, though. At the moment it says:
>>
>>      Prevent FDW insert batching during cross-partition updates
>>
>> but what the patch seems to be doing is simply initializing the info
>> only for CMD_INSERT operations. Which does the trick, but it affects
>> everything, i.e. all updates, no? Not just cross-partition updates.
> 
> You're right.  Please check the message in the updated patch.
>

Thanks. I'm not sure I understand what "FDW may not be able to handle 
both the original update operation and the batched insert operation 
being performed at the same time" means. I mean, if we translate the 
UPDATE into DELETE+INSERT, then we don't run both the update and insert 
at the same time, right? What exactly is the problem with allowing 
batching for inserts in cross-partition updates?

On a closer look, it seems the problem actually lies in a small 
inconsistency between create_foreign_modify and ExecInitRoutingInfo. The 
former only set batch_size for CMD_INSERT while the latter called the 
BatchSize() for all operations, expecting >= 1 result. So we may either 
relax create_foreign_modify and set batch_size for all DML, or make 
ExecInitRoutingInfo stricter (which is what the patches here do).

Is there a reason not to do the first thing, allowing batching of 
inserts during cross-partition updates? I tried to do that, but it 
dawned on me that we can't mix batched and un-batched operations, e.g. 
DELETE + INSERT, because that'd break the order of execution, leading to 
bogus results in case the same row is modified repeatedly, etc.

Am I getting this right?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: POC: postgres_fdw insert batching

From
Amit Langote
Date:
On Wed, Feb 17, 2021 at 5:46 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Wed, Feb 17, 2021 at 12:04 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> > On 2/16/21 10:25 AM, Amit Langote wrote:
> > > On Tue, Feb 16, 2021 at 1:36 AM Tomas Vondra
> > >> Thanks for the patch, it seems fine to me.
> > >
> > > Thanks for checking.
> > >
> > >> I wonder it the commit
> > >> message needs some tweaks, though. At the moment it says:
> > >>
> > >>      Prevent FDW insert batching during cross-partition updates
> > >>
> > >> but what the patch seems to be doing is simply initializing the info
> > >> only for CMD_INSERT operations. Which does the trick, but it affects
> > >> everything, i.e. all updates, no? Not just cross-partition updates.
> > >
> > > You're right.  Please check the message in the updated patch.
> >
> > Thanks. I'm not sure I understand what "FDW may not be able to handle
> > both the original update operation and the batched insert operation
> > being performed at the same time" means. I mean, if we translate the
> > UPDATE into DELETE+INSERT, then we don't run both the update and insert
> > at the same time, right? What exactly is the problem with allowing
> > batching for inserts in cross-partition updates?
>
> Sorry, I hadn't shared enough details of my investigations when I
> originally ran into this.  Such as that I had considered implementing
> the use of batching for these inserts too but had given up.
>
> Now that you mention it, I think I gave a less convincing reason for
> why we should avoid doing it at all.  Maybe it would have been more
> right to say that it is the core code, not necessarily the FDWs, that
> currently fails to deal with the use of batching by the insert
> component of a cross-partition update.  Those failures could be
> addressed as I'll describe below.
>
> For postgres_fdw, postgresGetForeignModifyBatchSize() could be taught
> to simply use the PgFdwModifyTable that is installed to handle the
> insert component of a cross-partition update (one can get that one via
> aux_fmstate field of the original PgFdwModifyState).  However, even
> though that's fine for postgres_fdw to do, what worries (had worried)
> me is that it also results in scribbling on ri_BatchSize that the core
> code may see to determine what to do with a particular tuple, and I
> just have to hope that nodeModifyTable.c doesn't end up doing anything
> unwarranted with the original update based on seeing a non-zero
> ri_BatchSize.  AFAICS, we are fine on that front.
>
> That said, there are some deficiencies in the code that have to be
> addressed before we can let postgres_fdw do as mentioned above.  For
> example, the code in ExecModifyTable() that runs after breaking out of
> the loop to insert any remaining batched tuples appears to miss the
> tuples batched by such inserts.   Apparently, that is because the
> ResultRelInfos used by those inserts are not present in
> es_tuple_routing_result_relations.  Turns out I had forgotten that
> execPartition.c doesn't add the ResultRelInfos to that list if they
> are made by ExecInitModifyTable() for the original update operation
> and simply reused by ExecFindPartition() when tuples were routed to
> those partitions.  It can be "fixed" by reverting to the original
> design in Tsunakawa-san's patch where the tuple routing result
> relations were obtained from the PartitionTupleRouting data structure,
> which fortunately stores all tuple routing result relations.  (Sorry,
> I gave wrong advice in [1] in retrospect.)
>
> > On a closer look, it seems the problem actually lies in a small
> > inconsistency between create_foreign_modify and ExecInitRoutingInfo. The
> > former only set batch_size for CMD_INSERT while the latter called the
> > BatchSize() for all operations, expecting >= 1 result. So we may either
> > relax create_foreign_modify and set batch_size for all DML, or make
> > ExecInitRoutingInfo stricter (which is what the patches here do).
>
> I think we should be fine if we make
> postgresGetForeignModifyBatchSize() use the correct PgFdwModifyState
> as described above.  We can be sure that we are not mixing the
> information used by the batched insert with that of the original
> unbatched update.
>
> > Is there a reason not to do the first thing, allowing batching of
> > inserts during cross-partition updates? I tried to do that, but it
> > dawned on me that we can't mix batched and un-batched operations, e.g.
> > DELETE + INSERT, because that'd break the order of execution, leading to
> > bogus results in case the same row is modified repeatedly, etc.
>
> Actually, postgres_fdw only supports moving a row into a partition (as
> part of a cross-partition update that is) if it has already finished
> performing any updates on it.   So there is no worry of rows that are
> moved into a partition subsequently getting updated due to the
> original command.
>
> The attached patch implements the changes necessary to make these
> inserts use batching too.
>
> [1] https://www.postgresql.org/message-id/CA%2BHiwqEbnhwVJMsukTP-S9Kv1ynC7Da3yuqSPZC0Y7oWWOwoHQ%40mail.gmail.com

Oops, I had mistakenly not hit "Reply All".  Attaching the patch again.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:
On 2/17/21 9:51 AM, Amit Langote wrote:
> On Wed, Feb 17, 2021 at 5:46 PM Amit Langote <amitlangote09@gmail.com> wrote:
>> On Wed, Feb 17, 2021 at 12:04 AM Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>> On 2/16/21 10:25 AM, Amit Langote wrote:
>>>> On Tue, Feb 16, 2021 at 1:36 AM Tomas Vondra
>>>>> Thanks for the patch, it seems fine to me.
>>>>
>>>> Thanks for checking.
>>>>
>>>>> I wonder it the commit
>>>>> message needs some tweaks, though. At the moment it says:
>>>>>
>>>>>      Prevent FDW insert batching during cross-partition updates
>>>>>
>>>>> but what the patch seems to be doing is simply initializing the info
>>>>> only for CMD_INSERT operations. Which does the trick, but it affects
>>>>> everything, i.e. all updates, no? Not just cross-partition updates.
>>>>
>>>> You're right.  Please check the message in the updated patch.
>>>
>>> Thanks. I'm not sure I understand what "FDW may not be able to handle
>>> both the original update operation and the batched insert operation
>>> being performed at the same time" means. I mean, if we translate the
>>> UPDATE into DELETE+INSERT, then we don't run both the update and insert
>>> at the same time, right? What exactly is the problem with allowing
>>> batching for inserts in cross-partition updates?
>>
>> Sorry, I hadn't shared enough details of my investigations when I
>> originally ran into this.  Such as that I had considered implementing
>> the use of batching for these inserts too but had given up.
>>
>> Now that you mention it, I think I gave a less convincing reason for
>> why we should avoid doing it at all.  Maybe it would have been more
>> right to say that it is the core code, not necessarily the FDWs, that
>> currently fails to deal with the use of batching by the insert
>> component of a cross-partition update.  Those failures could be
>> addressed as I'll describe below.
>>
>> For postgres_fdw, postgresGetForeignModifyBatchSize() could be taught
>> to simply use the PgFdwModifyTable that is installed to handle the
>> insert component of a cross-partition update (one can get that one via
>> aux_fmstate field of the original PgFdwModifyState).  However, even
>> though that's fine for postgres_fdw to do, what worries (had worried)
>> me is that it also results in scribbling on ri_BatchSize that the core
>> code may see to determine what to do with a particular tuple, and I
>> just have to hope that nodeModifyTable.c doesn't end up doing anything
>> unwarranted with the original update based on seeing a non-zero
>> ri_BatchSize.  AFAICS, we are fine on that front.
>>
>> That said, there are some deficiencies in the code that have to be
>> addressed before we can let postgres_fdw do as mentioned above.  For
>> example, the code in ExecModifyTable() that runs after breaking out of
>> the loop to insert any remaining batched tuples appears to miss the
>> tuples batched by such inserts.   Apparently, that is because the
>> ResultRelInfos used by those inserts are not present in
>> es_tuple_routing_result_relations.  Turns out I had forgotten that
>> execPartition.c doesn't add the ResultRelInfos to that list if they
>> are made by ExecInitModifyTable() for the original update operation
>> and simply reused by ExecFindPartition() when tuples were routed to
>> those partitions.  It can be "fixed" by reverting to the original
>> design in Tsunakawa-san's patch where the tuple routing result
>> relations were obtained from the PartitionTupleRouting data structure,
>> which fortunately stores all tuple routing result relations.  (Sorry,
>> I gave wrong advice in [1] in retrospect.)
>>
>>> On a closer look, it seems the problem actually lies in a small
>>> inconsistency between create_foreign_modify and ExecInitRoutingInfo. The
>>> former only set batch_size for CMD_INSERT while the latter called the
>>> BatchSize() for all operations, expecting >= 1 result. So we may either
>>> relax create_foreign_modify and set batch_size for all DML, or make
>>> ExecInitRoutingInfo stricter (which is what the patches here do).
>>
>> I think we should be fine if we make
>> postgresGetForeignModifyBatchSize() use the correct PgFdwModifyState
>> as described above.  We can be sure that we are not mixing the
>> information used by the batched insert with that of the original
>> unbatched update.
>>
>>> Is there a reason not to do the first thing, allowing batching of
>>> inserts during cross-partition updates? I tried to do that, but it
>>> dawned on me that we can't mix batched and un-batched operations, e.g.
>>> DELETE + INSERT, because that'd break the order of execution, leading to
>>> bogus results in case the same row is modified repeatedly, etc.
>>
>> Actually, postgres_fdw only supports moving a row into a partition (as
>> part of a cross-partition update that is) if it has already finished
>> performing any updates on it.   So there is no worry of rows that are
>> moved into a partition subsequently getting updated due to the
>> original command.
>>
>> The attached patch implements the changes necessary to make these
>> inserts use batching too.
>>
>> [1] https://www.postgresql.org/message-id/CA%2BHiwqEbnhwVJMsukTP-S9Kv1ynC7Da3yuqSPZC0Y7oWWOwoHQ%40mail.gmail.com
> 
> Oops, I had mistakenly not hit "Reply All".  Attaching the patch again.
> 

Thanks. The patch seems reasonable, but it's a bit too large for a fix,
so I'll go ahead and push one of the previous fixes restricting batching
to plain INSERT commands. But this seems useful, so I suggest adding it
to the next commitfest.

One thing that surprised me is that we only move the rows *to* the
foreign partition, not from it (even on pg13, i.e. before the batching
etc.). I mean, using the example you posted earlier, with one foreign
and one local partition, consider this:

  delete from p;
  insert into p values (2);

  test=# select * from p2;
   a
  ---
   2
  (1 row)

  test=# update p set a = 1;
  UPDATE 1

  test=# select * from p1;
   a
  ---
   1
  (1 row)

OK, so it was moved to the foreign partition, which is for rows with
value in (1). So far so good. Let's do another update:

  test=# update p set a = 2;
  UPDATE 1
  test=# select * from p1;
   a
  ---
   2
  (1 row)

So now the foreign partition contains value (2), which is however wrong
with respect to the partitioning rules - this should be in p2, the local
partition. This however causes pretty annoying issue:

test=# explain analyze select * from p where a = 2;

                            QUERY PLAN
  ---------------------------------------------------------------
   Seq Scan on p2 p  (cost=0.00..41.88 rows=13 width=4)
                     (actual  time=0.024..0.028 rows=0 loops=1)
     Filter: (a = 2)
   Planning Time: 0.355 ms
   Execution Time: 0.089 ms
  (4 rows)

That is, we fail to find the row, because we eliminate the partition.

Now, maybe this is expected, but it seems like a rather mind-boggling
violation of POLA principle. I've checked if postgres_fdw mentions this
somewhere, but all I see about row movement is this:

   Note also that postgres_fdw supports row movement invoked by UPDATE
   statements executed on partitioned tables, but it currently does not
   handle the case where a remote partition chosen to insert a moved row
   into is also an UPDATE target partition that will be updated later.

and if I understand that correctly, that's about something different.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: POC: postgres_fdw insert batching

From
Tomas Vondra
Date:
On 2/17/21 8:36 PM, Tomas Vondra wrote:
>
> ...
> 
> Thanks. The patch seems reasonable, but it's a bit too large for a fix,
> so I'll go ahead and push one of the previous fixes restricting batching
> to plain INSERT commands. But this seems useful, so I suggest adding it
> to the next commitfest.

I've pushed the v4 fix, adding the CMD_INSERT to execPartition.

I think we may need to revise the relationship between FDW and places
that (may) call GetForeignModifyBatchSize. Currently, these places need
to be quite well synchronized - in a way, the issue was (partially) due
to postgres_fdw and core not agreeing on the details.

In particular, create_foreign_modify sets batch_size for CMD_INSERT and
leaves it 0 otherwise. So GetForeignModifyBatchSize() returned 0 later,
triggering an assert.

It's probably better to require GetForeignModifyBatchSize() to always
return a valid batch size (>= 1). If batching is not supported, just
return 1.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: POC: postgres_fdw insert batching

From
Amit Langote
Date:
On Thu, Feb 18, 2021 at 8:38 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> On 2/17/21 8:36 PM, Tomas Vondra wrote:
> > Thanks. The patch seems reasonable, but it's a bit too large for a fix,
> > so I'll go ahead and push one of the previous fixes restricting batching
> > to plain INSERT commands. But this seems useful, so I suggest adding it
> > to the next commitfest.
>
> I've pushed the v4 fix, adding the CMD_INSERT to execPartition.
>
> I think we may need to revise the relationship between FDW and places
> that (may) call GetForeignModifyBatchSize. Currently, these places need
> to be quite well synchronized - in a way, the issue was (partially) due
> to postgres_fdw and core not agreeing on the details.

Agreed.

> In particular, create_foreign_modify sets batch_size for CMD_INSERT and
> leaves it 0 otherwise. So GetForeignModifyBatchSize() returned 0 later,
> triggering an assert.
>
> It's probably better to require GetForeignModifyBatchSize() to always
> return a valid batch size (>= 1). If batching is not supported, just
> return 1.

That makes sense.

How about the attached?

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Attachment

Re: POC: postgres_fdw insert batching

From
Amit Langote
Date:
On Thu, Feb 18, 2021 at 4:36 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> On 2/17/21 9:51 AM, Amit Langote wrote:
> > On Wed, Feb 17, 2021 at 5:46 PM Amit Langote <amitlangote09@gmail.com> wrote:
> >> Sorry, I hadn't shared enough details of my investigations when I
> >> originally ran into this.  Such as that I had considered implementing
> >> the use of batching for these inserts too but had given up.
> >>
> >> Now that you mention it, I think I gave a less convincing reason for
> >> why we should avoid doing it at all.  Maybe it would have been more
> >> right to say that it is the core code, not necessarily the FDWs, that
> >> currently fails to deal with the use of batching by the insert
> >> component of a cross-partition update.  Those failures could be
> >> addressed as I'll describe below.
> >>
> >> For postgres_fdw, postgresGetForeignModifyBatchSize() could be taught
> >> to simply use the PgFdwModifyTable that is installed to handle the
> >> insert component of a cross-partition update (one can get that one via
> >> aux_fmstate field of the original PgFdwModifyState).  However, even
> >> though that's fine for postgres_fdw to do, what worries (had worried)
> >> me is that it also results in scribbling on ri_BatchSize that the core
> >> code may see to determine what to do with a particular tuple, and I
> >> just have to hope that nodeModifyTable.c doesn't end up doing anything
> >> unwarranted with the original update based on seeing a non-zero
> >> ri_BatchSize.  AFAICS, we are fine on that front.
> >>
> >> That said, there are some deficiencies in the code that have to be
> >> addressed before we can let postgres_fdw do as mentioned above.  For
> >> example, the code in ExecModifyTable() that runs after breaking out of
> >> the loop to insert any remaining batched tuples appears to miss the
> >> tuples batched by such inserts.   Apparently, that is because the
> >> ResultRelInfos used by those inserts are not present in
> >> es_tuple_routing_result_relations.  Turns out I had forgotten that
> >> execPartition.c doesn't add the ResultRelInfos to that list if they
> >> are made by ExecInitModifyTable() for the original update operation
> >> and simply reused by ExecFindPartition() when tuples were routed to
> >> those partitions.  It can be "fixed" by reverting to the original
> >> design in Tsunakawa-san's patch where the tuple routing result
> >> relations were obtained from the PartitionTupleRouting data structure,
> >> which fortunately stores all tuple routing result relations.  (Sorry,
> >> I gave wrong advice in [1] in retrospect.)
> >>
> >>> On a closer look, it seems the problem actually lies in a small
> >>> inconsistency between create_foreign_modify and ExecInitRoutingInfo. The
> >>> former only set batch_size for CMD_INSERT while the latter called the
> >>> BatchSize() for all operations, expecting >= 1 result. So we may either
> >>> relax create_foreign_modify and set batch_size for all DML, or make
> >>> ExecInitRoutingInfo stricter (which is what the patches here do).
> >>
> >> I think we should be fine if we make
> >> postgresGetForeignModifyBatchSize() use the correct PgFdwModifyState
> >> as described above.  We can be sure that we are not mixing the
> >> information used by the batched insert with that of the original
> >> unbatched update.
> >>
> >>> Is there a reason not to do the first thing, allowing batching of
> >>> inserts during cross-partition updates? I tried to do that, but it
> >>> dawned on me that we can't mix batched and un-batched operations, e.g.
> >>> DELETE + INSERT, because that'd break the order of execution, leading to
> >>> bogus results in case the same row is modified repeatedly, etc.
> >>
> >> Actually, postgres_fdw only supports moving a row into a partition (as
> >> part of a cross-partition update that is) if it has already finished
> >> performing any updates on it.   So there is no worry of rows that are
> >> moved into a partition subsequently getting updated due to the
> >> original command.
> >>
> >> The attached patch implements the changes necessary to make these
> >> inserts use batching too.
> >>
> >> [1] https://www.postgresql.org/message-id/CA%2BHiwqEbnhwVJMsukTP-S9Kv1ynC7Da3yuqSPZC0Y7oWWOwoHQ%40mail.gmail.com
> >
> > Oops, I had mistakenly not hit "Reply All".  Attaching the patch again.
> >
>
> Thanks. The patch seems reasonable, but it's a bit too large for a fix,
> so I'll go ahead and push one of the previous fixes restricting batching
> to plain INSERT commands. But this seems useful, so I suggest adding it
> to the next commitfest.

Okay will post the rebased patch to a new thread.

> One thing that surprised me is that we only move the rows *to* the
> foreign partition, not from it (even on pg13, i.e. before the batching
> etc.). I mean, using the example you posted earlier, with one foreign
> and one local partition, consider this:
>
>   delete from p;
>   insert into p values (2);
>
>   test=# select * from p2;
>    a
>   ---
>    2
>   (1 row)
>
>   test=# update p set a = 1;
>   UPDATE 1
>
>   test=# select * from p1;
>    a
>   ---
>    1
>   (1 row)
>
> OK, so it was moved to the foreign partition, which is for rows with
> value in (1). So far so good. Let's do another update:
>
>   test=# update p set a = 2;
>   UPDATE 1
>   test=# select * from p1;
>    a
>   ---
>    2
>   (1 row)
>
> So now the foreign partition contains value (2), which is however wrong
> with respect to the partitioning rules - this should be in p2, the local
> partition.
>
> This however causes pretty annoying issue:
>
> test=# explain analyze select * from p where a = 2;
>
>                             QUERY PLAN
>   ---------------------------------------------------------------
>    Seq Scan on p2 p  (cost=0.00..41.88 rows=13 width=4)
>                      (actual  time=0.024..0.028 rows=0 loops=1)
>      Filter: (a = 2)
>    Planning Time: 0.355 ms
>    Execution Time: 0.089 ms
>   (4 rows)
>
> That is, we fail to find the row, because we eliminate the partition.
>
> Now, maybe this is expected, but it seems like a rather mind-boggling
> violation of POLA principle.

Yeah, I think we knowingly allow this behavior.  The documentation
states that a foreign table's constraints are not enforced by the core
server nor by the FDW, but I suppose we still allow declaring them
mostly for the planner's consumption:

https://www.postgresql.org/docs/current/sql-createforeigntable.html

"Constraints on foreign tables (such as CHECK or NOT NULL clauses) are
not enforced by the core PostgreSQL system, and most foreign data
wrappers do not attempt to enforce them either; that is, the
constraint is simply assumed to hold true. There would be little point
in such enforcement since it would only apply to rows inserted or
updated via the foreign table, and not to rows modified by other
means, such as directly on the remote server. Instead, a constraint
attached to a foreign table should represent a constraint that is
being enforced by the remote server."

Partitioning constraints are not treated any differently for those
reasons.  It's a good idea to declare a CHECK constraint on the remote
table matching with the local partition constraint, though IIRC we
don't mention that advice anywhere in our documentation.

> I've checked if postgres_fdw mentions this
> somewhere, but all I see about row movement is this:
>
>    Note also that postgres_fdw supports row movement invoked by UPDATE
>    statements executed on partitioned tables, but it currently does not
>    handle the case where a remote partition chosen to insert a moved row
>    into is also an UPDATE target partition that will be updated later.
>
> and if I understand that correctly, that's about something different.

Yeah, that's a note saying that while we do support moving a row from
a local partition to a postgres_fdw foreign partition, it's only
allowed if the foreign partition won't subsequently be updated.  So to
reiterate, the cases we don't support:

* Moving a row from a foreign partition to a local one

* Moving a row from a local partition to a foreign one if the latter
will be updated subsequent to moving a row into it

postgres_fdw detects the second case with the following code in
postgresBeginForeignInsert():

    /*
     * If the foreign table we are about to insert routed rows into is also an
     * UPDATE subplan result rel that will be updated later, proceeding with
     * the INSERT will result in the later UPDATE incorrectly modifying those
     * routed rows, so prevent the INSERT --- it would be nice if we could
     * handle this case; but for now, throw an error for safety.
     */
    if (plan && plan->operation == CMD_UPDATE &&
        (resultRelInfo->ri_usesFdwDirectModify ||
         resultRelInfo->ri_FdwState) &&
        resultRelInfo > mtstate->resultRelInfo + mtstate->mt_whichplan)
        ereport(ERROR,
                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
                 errmsg("cannot route tuples into foreign table to be
updated \"%s\"",
                        RelationGetRelationName(rel))));
--
Amit Langote
EDB: http://www.enterprisedb.com