Thread: New Table Access Methods for Multi and Single Inserts

New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

08 December 2020, 12:57:37

Hi,

Currently, for any component (such as COPY, CTAS[1], CREATE/REFRESH
Mat View[1], INSERT INTO SELECTs[2]) multi insert logic such as buffer
slots allocation, maintenance, decision to flush and clean up, need to
be implemented outside the table_multi_insert() API. The main problem
is that it fails to take into consideration the underlying storage
engine capabilities, for more details of this point refer to a
discussion in multi inserts in CTAS thread[1]. This also creates a lot
of duplicate code which is more error prone and not maintainable.

More importantly, in another thread [3] @Andres Freund suggested to
have table insert APIs in such a way that they look more like 'scan'
APIs i.e. insert_begin, insert, insert_end. The main advantages doing
this are(quoting from his statement in [3]) - "more importantly it'd
allow an AM to optimize operations across multiple inserts, which is
important for column stores."

I propose to introduce new table access methods for both multi and
single inserts based on the prototype suggested by Andres in [3]. Main
design goal of these new APIs is to give flexibility to tableam
developers in implementing multi insert logic dependent on the
underlying storage engine.

Below are the APIs. I suggest to have a look at
v1-0001-New-Table-AMs-for-Multi-and-Single-Inserts.patch for details
of the new data structure and the API functionality. Note that
temporarily I used XX_v2, we can change it later.

TableInsertState* table_insert_begin(initial_args);
void table_insert_v2(TableInsertState *state, TupleTableSlot *slot);
void table_multi_insert_v2(TableInsertState *state, TupleTableSlot *slot);
void table_multi_insert_flush(TableInsertState *state);
void table_insert_end(TableInsertState *state);

I'm attaching a few patches(just to show that these APIs work, avoids
a lot of duplicate code and makes life easier). Better commenting can
be added later. If these APIs and patches look okay, we can even
consider replacing them in other places such as nodeModifyTable.c and
so on.

v1-0001-New-Table-AMs-for-Multi-and-Single-Inserts.patch --->
introduces new table access methods for multi and single inserts. Also
implements/rearranges the outside code for heap am into these new
APIs.
v1-0002-CTAS-and-REFRESH-Mat-View-With-New-Multi-Insert-Table-AM.patch
---> adds new multi insert table access methods to CREATE TABLE AS,
CREATE MATERIALIZED VIEW and REFRESH MATERIALIZED VIEW.
v1-0003-ATRewriteTable-With-New-Single-Insert-Table-AM.patch ---> adds
new single insert table access method to ALTER TABLE rewrite table
code.
v1-0004-COPY-With-New-Multi-and-Single-Insert-Table-AM.patch ---> adds
new single and multi insert table access method to COPY code.

Thoughts?

Many thanks to Robert, Vignesh and Dilip for offlist discussion.

[1] - https://www.postgresql.org/message-id/4eee0730-f6ec-e72d-3477-561643f4b327%40swarm64.com
[2] - https://www.postgresql.org/message-id/20201124020020.GK24052%40telsasoft.com
[3] - https://www.postgresql.org/message-id/20200924024128.kyk3r5g7dnu3fxxx%40alap3.anarazel.de

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

11 December 2020, 13:47:05

On Tue, Dec 8, 2020 at 6:27 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
> Hi,
>
> Currently, for any component (such as COPY, CTAS[1], CREATE/REFRESH
> Mat View[1], INSERT INTO SELECTs[2]) multi insert logic such as buffer
> slots allocation, maintenance, decision to flush and clean up, need to
> be implemented outside the table_multi_insert() API. The main problem
> is that it fails to take into consideration the underlying storage
> engine capabilities, for more details of this point refer to a
> discussion in multi inserts in CTAS thread[1]. This also creates a lot
> of duplicate code which is more error prone and not maintainable.
>
> More importantly, in another thread [3] @Andres Freund suggested to
> have table insert APIs in such a way that they look more like 'scan'
> APIs i.e. insert_begin, insert, insert_end. The main advantages doing
> this are(quoting from his statement in [3]) - "more importantly it'd
> allow an AM to optimize operations across multiple inserts, which is
> important for column stores."
>
> I propose to introduce new table access methods for both multi and
> single inserts based on the prototype suggested by Andres in [3]. Main
> design goal of these new APIs is to give flexibility to tableam
> developers in implementing multi insert logic dependent on the
> underlying storage engine.
>
> Below are the APIs. I suggest to have a look at
> v1-0001-New-Table-AMs-for-Multi-and-Single-Inserts.patch for details
> of the new data structure and the API functionality. Note that
> temporarily I used XX_v2, we can change it later.
>
> TableInsertState* table_insert_begin(initial_args);
> void table_insert_v2(TableInsertState *state, TupleTableSlot *slot);
> void table_multi_insert_v2(TableInsertState *state, TupleTableSlot *slot);
> void table_multi_insert_flush(TableInsertState *state);
> void table_insert_end(TableInsertState *state);
>
> I'm attaching a few patches(just to show that these APIs work, avoids
> a lot of duplicate code and makes life easier). Better commenting can
> be added later. If these APIs and patches look okay, we can even
> consider replacing them in other places such as nodeModifyTable.c and
> so on.
>
> v1-0001-New-Table-AMs-for-Multi-and-Single-Inserts.patch --->
> introduces new table access methods for multi and single inserts. Also
> implements/rearranges the outside code for heap am into these new
> APIs.
> v1-0002-CTAS-and-REFRESH-Mat-View-With-New-Multi-Insert-Table-AM.patch
> ---> adds new multi insert table access methods to CREATE TABLE AS,
> CREATE MATERIALIZED VIEW and REFRESH MATERIALIZED VIEW.
> v1-0003-ATRewriteTable-With-New-Single-Insert-Table-AM.patch ---> adds
> new single insert table access method to ALTER TABLE rewrite table
> code.
> v1-0004-COPY-With-New-Multi-and-Single-Insert-Table-AM.patch ---> adds
> new single and multi insert table access method to COPY code.
>
> Thoughts?
>
> [1] - https://www.postgresql.org/message-id/4eee0730-f6ec-e72d-3477-561643f4b327%40swarm64.com
> [2] - https://www.postgresql.org/message-id/20201124020020.GK24052%40telsasoft.com
> [3] - https://www.postgresql.org/message-id/20200924024128.kyk3r5g7dnu3fxxx%40alap3.anarazel.de

Added this to commitfest to get it reviewed further.

https://commitfest.postgresql.org/31/2871/

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: New Table Access Methods for Multi and Single Inserts

From

Justin Pryzby

Date:

17 December 2020, 05:05:22

Typos:

+ *  1) Specify is_multi as true, then multi insert state is allcoated.
=> allocated
+ * dropped, short-lived memory context is delted and mistate is freed up.
=> deleted
+ * 2) Currently, GetTupleSize() handles the existing heap, buffer, minmal and
=> minimal
+       /* Mulit insert state if requested, otherwise NULL. */
=> multi
+ * Buffer the input slots and insert the tuples from the buffered slots at a
=> *one* at a time ?
+ * Compute the size of the tuple only if mi_max_size i.e. the total tuple size
=> I guess you mean max_size

This variable could use a better name:
+CopyMulitInsertFlushBuffers(List **mirri, ..
mirri is fine for a local variable like an element of a struture/array, or a
loop variable, but not for a function parameter which is an "List" of arbitrary
pointers.

I think this comment needs to be updated (again) for the removal of the Info
structure.
- * CopyMultiInsertBuffer items stored in CopyMultiInsertInfo's
+ * multi insert buffer items stored in CopyMultiInsertInfo's

I think the COPY patch should be 0002 (or maybe merged into 0001).
There's some superfluous whitespace (and other) changes there which make the
patch unnecessarily long.

You made the v2 insert interface a requirement for all table AMs.
Should it be optional, and fall back to simple inserts if not implemented ? 

For CTAS, I think we need to consider Paul's idea here.
https://www.postgresql.org/message-id/26C14A63-CCE5-4B46-975A-57C1784B3690%40vmware.com
Conceivably, tableam should support something like that for arbitrary AMs
("insert into a new table for which we have exclusive lock").  I think that AM
method should also be optional.  It should be possible to implement a minimal
AM without implementing every available optimization, which may not apply to
all AMs, anyway.

-- 
Justin

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

17 December 2020, 11:05:33

Thanks a lot for taking a look at the patches.

On Thu, Dec 17, 2020 at 10:35 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> Typos:
>
> + *  1) Specify is_multi as true, then multi insert state is allcoated.
> => allocated
> + * dropped, short-lived memory context is delted and mistate is freed up.
> => deleted
> + * 2) Currently, GetTupleSize() handles the existing heap, buffer, minmal and
> => minimal
> +       /* Mulit insert state if requested, otherwise NULL. */
> => multi
> + * Buffer the input slots and insert the tuples from the buffered slots at a
> => *one* at a time ?
> + * Compute the size of the tuple only if mi_max_size i.e. the total tuple size
> => I guess you mean max_size
>
> This variable could use a better name:
> +CopyMulitInsertFlushBuffers(List **mirri, ..
> mirri is fine for a local variable like an element of a struture/array, or a
> loop variable, but not for a function parameter which is an "List" of arbitrary
> pointers.
>
> I think this comment needs to be updated (again) for the removal of the Info
> structure.
> - * CopyMultiInsertBuffer items stored in CopyMultiInsertInfo's
> + * multi insert buffer items stored in CopyMultiInsertInfo's
>
> There's some superfluous whitespace (and other) changes there which make the
> patch unnecessarily long.

I will correct them and post the next version of the patch set. Before
that, I would like to have the discussion and thoughts on the APIs and
their usefulness.

> I think the COPY patch should be 0002 (or maybe merged into 0001).

I can make it as a 0002 patch.

> You made the v2 insert interface a requirement for all table AMs.
> Should it be optional, and fall back to simple inserts if not implemented ?

I tried to implement the APIs mentioned by Andreas here in [1]. I just
used v2 table am APIs in existing table_insert places to show that it
works. Having said that, if you notice, I moved the bulk insert
allocation and deallocation to the new APIs table_insert_begin() and
table_insert_end() respectively, which make them tableam specific.
Currently, the bulk insert state is outside and independent of
tableam. I think we should not make bulk insert state allocation and
deallocation tableam specific. Thoughts?

[1] - https://www.postgresql.org/message-id/20200924024128.kyk3r5g7dnu3fxxx%40alap3.anarazel.de

> For CTAS, I think we need to consider Paul's idea here.
> https://www.postgresql.org/message-id/26C14A63-CCE5-4B46-975A-57C1784B3690%40vmware.com

IMO, if we were to allow those raw insert APIs to perform parallel
inserts, then we would be reimplementing the existing table_insert or
table_mulit_insert API by having some sort of shared memory for
coordinating among workers and so on, may be in some other way. Yes,
we could avoid all the existing locking and shared buffers with those
raw insert APIs, I also feel that we can now do that with the existing
insert APIs for unlogged tables and bulk insert state. To me, the raw
insert APIs after implementing them for the parallel inserts, they
would look like the existing insert APIs for unlogged tables and with
bulk insert state. Thoughts?

Please have a look at [1] for detailed comment.

[1] https://www.postgresql.org/message-id/CALj2ACX0u%3DQvB7GHLEqeVYwvs2eQS7%3D-cEuem7ZaF%3Dp%2BqZ0ikA%40mail.gmail.com

> Conceivably, tableam should support something like that for arbitrary AMs
> ("insert into a new table for which we have exclusive lock").  I think that AM
> method should also be optional.  It should be possible to implement a minimal
> AM without implementing every available optimization, which may not apply to
> all AMs, anyway.

I could not understand this point well. Maybe more thoughts help me here.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: New Table Access Methods for Multi and Single Inserts

From

Justin Pryzby

Date:

17 December 2020, 20:44:42

On Thu, Dec 17, 2020 at 04:35:33PM +0530, Bharath Rupireddy wrote:
> > You made the v2 insert interface a requirement for all table AMs.
> > Should it be optional, and fall back to simple inserts if not implemented ?
> 
> I tried to implement the APIs mentioned by Andreas here in [1]. I just
> used v2 table am APIs in existing table_insert places to show that it
> works. Having said that, if you notice, I moved the bulk insert
> allocation and deallocation to the new APIs table_insert_begin() and
> table_insert_end() respectively, which make them tableam specific.

I mean I think it should be optional for a tableam to support the optimized
insert routines.  Here, you've made it a requirement.

+       Assert(routine->tuple_insert_begin != NULL);
+       Assert(routine->tuple_insert_v2 != NULL);
+       Assert(routine->multi_insert_v2 != NULL);
+       Assert(routine->multi_insert_flush != NULL);
+       Assert(routine->tuple_insert_end != NULL);

+static inline void
+table_multi_insert_v2(TableInsertState *state, TupleTableSlot *slot)
+{
+       state->rel->rd_tableam->multi_insert_v2(state, slot);
+}

If multi_insert_v2 == NULL, I think table_multi_insert_v2() would just call
table_insert_v2(), and begin/flush/end would do nothing.  If
table_multi_insert_v2!=NULL, then you should assert that the other routines are
provided.

Are you thinking that TableInsertState would eventually have additional
attributes which would apply to other tableams, but not to heap ?  Is
heap_insert_begin() really specific to heap ?  It's allocating and populating a
structure based on its arguments, but those same arguments would be passed to
every other AM's insert_begin routine, too.  Do you need a more flexible data
structure, something that would also accomodate extensions?  I'm thinking of
reloptions as a loose analogy.

-- 
Justin

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

18 December 2020, 02:09:14

On Fri, Dec 18, 2020 at 2:14 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> On Thu, Dec 17, 2020 at 04:35:33PM +0530, Bharath Rupireddy wrote:
> > > You made the v2 insert interface a requirement for all table AMs.
> > > Should it be optional, and fall back to simple inserts if not implemented ?
> >
> > I tried to implement the APIs mentioned by Andreas here in [1]. I just
> > used v2 table am APIs in existing table_insert places to show that it
> > works. Having said that, if you notice, I moved the bulk insert
> > allocation and deallocation to the new APIs table_insert_begin() and
> > table_insert_end() respectively, which make them tableam specific.
>
> I mean I think it should be optional for a tableam to support the optimized
> insert routines.  Here, you've made it a requirement.
>
> +       Assert(routine->tuple_insert_begin != NULL);
> +       Assert(routine->tuple_insert_v2 != NULL);
> +       Assert(routine->multi_insert_v2 != NULL);
> +       Assert(routine->multi_insert_flush != NULL);
> +       Assert(routine->tuple_insert_end != NULL);

+1 to make them optional. I will change.

> +static inline void
> +table_multi_insert_v2(TableInsertState *state, TupleTableSlot *slot)
> +{
> +       state->rel->rd_tableam->multi_insert_v2(state, slot);
> +}
>
> If multi_insert_v2 == NULL, I think table_multi_insert_v2() would just call
> table_insert_v2(), and begin/flush/end would do nothing.  If
> table_multi_insert_v2!=NULL, then you should assert that the other routines are
> provided.

What should happen if both multi_insert_v2 and insert_v2 are NULL?
Should we error out from table_insert_v2()?

> Are you thinking that TableInsertState would eventually have additional
> attributes which would apply to other tableams, but not to heap ?  Is
> heap_insert_begin() really specific to heap ?  It's allocating and populating a
> structure based on its arguments, but those same arguments would be passed to
> every other AM's insert_begin routine, too.  Do you need a more flexible data
> structure, something that would also accomodate extensions?  I'm thinking of
> reloptions as a loose analogy.

I could not think of other tableam attributes now. But +1 to have that
kind of flexible structure for TableInsertState. So, it can have
tableam type and attributes within the union for each type.

> I moved the bulk insert allocation and deallocation to the new APIs table_insert_begin()
> and table_insert_end() respectively, which make them tableam specific.
> Currently, the bulk insert state is outside and independent of
> tableam. I think we should not make bulk insert state allocation and
> deallocation tableam specific.

Any thoughts on the above point?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: New Table Access Methods for Multi and Single Inserts

From

Justin Pryzby

Date:

18 December 2020, 17:54:39

On Fri, Dec 18, 2020 at 07:39:14AM +0530, Bharath Rupireddy wrote:
> On Fri, Dec 18, 2020 at 2:14 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > Are you thinking that TableInsertState would eventually have additional
> > attributes which would apply to other tableams, but not to heap ?  Is
> > heap_insert_begin() really specific to heap ?  It's allocating and populating a
> > structure based on its arguments, but those same arguments would be passed to
> > every other AM's insert_begin routine, too.  Do you need a more flexible data
> > structure, something that would also accomodate extensions?  I'm thinking of
> > reloptions as a loose analogy.
> 
> I could not think of other tableam attributes now. But +1 to have that
> kind of flexible structure for TableInsertState. So, it can have
> tableam type and attributes within the union for each type.

Right now you have heap_insert_begin(), and I asked if it was really
heap-specific.  Right now, it populates a struct based on a static list of
arguments, which are what heap uses.  

If you were to implement a burp_insert_begin(), how would it differ from
heap's?  With the current API, they'd (have to) be the same, which means either
that it should apply to all AMs (or have a "default" implementation that can be
overridden by an AM), or that this API assumes that other AMs will want to do
exactly what heap does, and fails to allow other AMs to implement optimizations
for bulk inserts as claimed.

I don't think using a "union" solves the problem, since it can only accommodate
core AMs, and not extensions, so I suggested something like reloptions, which
have a "namespace" prefix (and core has toast.*, like ALTER TABLE t SET
toast.autovacuum_enabled).

-- 
Justin

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

21 December 2020, 07:42:07

On Fri, Dec 18, 2020 at 11:24 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> On Fri, Dec 18, 2020 at 07:39:14AM +0530, Bharath Rupireddy wrote:
> > On Fri, Dec 18, 2020 at 2:14 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > > Are you thinking that TableInsertState would eventually have additional
> > > attributes which would apply to other tableams, but not to heap ?  Is
> > > heap_insert_begin() really specific to heap ?  It's allocating and populating a
> > > structure based on its arguments, but those same arguments would be passed to
> > > every other AM's insert_begin routine, too.  Do you need a more flexible data
> > > structure, something that would also accomodate extensions?  I'm thinking of
> > > reloptions as a loose analogy.
> >
> > I could not think of other tableam attributes now. But +1 to have that
> > kind of flexible structure for TableInsertState. So, it can have
> > tableam type and attributes within the union for each type.
>
> Right now you have heap_insert_begin(), and I asked if it was really
> heap-specific.  Right now, it populates a struct based on a static list of
> arguments, which are what heap uses.
>
> If you were to implement a burp_insert_begin(), how would it differ from
> heap's?  With the current API, they'd (have to) be the same, which means either
> that it should apply to all AMs (or have a "default" implementation that can be
> overridden by an AM), or that this API assumes that other AMs will want to do
> exactly what heap does, and fails to allow other AMs to implement optimizations
> for bulk inserts as claimed.
>
> I don't think using a "union" solves the problem, since it can only accommodate
> core AMs, and not extensions, so I suggested something like reloptions, which
> have a "namespace" prefix (and core has toast.*, like ALTER TABLE t SET
> toast.autovacuum_enabled).

IIUC, your suggestion is to make the heap options such as
alloc_bistate(bulk insert state is required or not), mi_max_slots
(number of maximum buffered slots/tuples) and mi_max_size (the maximum
tuple size of the buffered slots) as reloptions with some default
values in reloptions.c under RELOPT_KIND_HEAP category so that they
can be modified by users on a per table basis. And likewise other
tableam options can be added by the tableam developers. This way, the
APIs will become more generic. The tableam developers need to add
reloptions of their choice and use them in the new API
implementations.

Let me know if I am missing anything from what you have in your mind.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: New Table Access Methods for Multi and Single Inserts

From

Justin Pryzby

Date:

21 December 2020, 07:47:25

On Fri, Dec 18, 2020 at 11:54:39AM -0600, Justin Pryzby wrote:
> On Fri, Dec 18, 2020 at 07:39:14AM +0530, Bharath Rupireddy wrote:
> > On Fri, Dec 18, 2020 at 2:14 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > > Are you thinking that TableInsertState would eventually have additional
> > > attributes which would apply to other tableams, but not to heap ?  Is
> > > heap_insert_begin() really specific to heap ?  It's allocating and populating a
> > > structure based on its arguments, but those same arguments would be passed to
> > > every other AM's insert_begin routine, too.  Do you need a more flexible data
> > > structure, something that would also accomodate extensions?  I'm thinking of
> > > reloptions as a loose analogy.
> > 
> > I could not think of other tableam attributes now. But +1 to have that
> > kind of flexible structure for TableInsertState. So, it can have
> > tableam type and attributes within the union for each type.
> 
> Right now you have heap_insert_begin(), and I asked if it was really
> heap-specific.  Right now, it populates a struct based on a static list of
> arguments, which are what heap uses.  
> 
> If you were to implement a burp_insert_begin(), how would it differ from
> heap's?  With the current API, they'd (have to) be the same, which means either
> that it should apply to all AMs (or have a "default" implementation that can be
> overridden by an AM), or that this API assumes that other AMs will want to do
> exactly what heap does, and fails to allow other AMs to implement optimizations
> for bulk inserts as claimed.
> 
> I don't think using a "union" solves the problem, since it can only accommodate
> core AMs, and not extensions, so I suggested something like reloptions, which
> have a "namespace" prefix (and core has toast.*, like ALTER TABLE t SET
> toast.autovacuum_enabled).

I think you'd want to handle things like:

 - a compressed AM wants to specify a threshold for a tuple's *compressed* size
   (maybe in addition to the uncompressed size);
 - a "columnar" AM wants to specify a threshold size for a column, rather
   than for each tuple;

I'm not proposing to handle those specific parameters, but rather pointing out
that your implementation doesn't allow handling AM-specific considerations,
which I think was the goal.

The TableInsertState structure would need to store those, and then the AM's
multi_insert_v2 routine would need to make use of them.

It feels a bit like we'd introduce the idea of an "AM option", except that it
wouldn't be user-facing (or maybe some of them would be?).  Maybe I've
misunderstood though, so other opinions are welcome.

-- 
Justin

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

24 December 2020, 00:18:42

On Mon, Dec 21, 2020 at 1:17 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> On Fri, Dec 18, 2020 at 11:54:39AM -0600, Justin Pryzby wrote:
> > On Fri, Dec 18, 2020 at 07:39:14AM +0530, Bharath Rupireddy wrote:
> > > On Fri, Dec 18, 2020 at 2:14 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > > > Are you thinking that TableInsertState would eventually have additional
> > > > attributes which would apply to other tableams, but not to heap ?  Is
> > > > heap_insert_begin() really specific to heap ?  It's allocating and populating a
> > > > structure based on its arguments, but those same arguments would be passed to
> > > > every other AM's insert_begin routine, too.  Do you need a more flexible data
> > > > structure, something that would also accomodate extensions?  I'm thinking of
> > > > reloptions as a loose analogy.
> > >
> > > I could not think of other tableam attributes now. But +1 to have that
> > > kind of flexible structure for TableInsertState. So, it can have
> > > tableam type and attributes within the union for each type.
> >
> > Right now you have heap_insert_begin(), and I asked if it was really
> > heap-specific.  Right now, it populates a struct based on a static list of
> > arguments, which are what heap uses.
> >
> > If you were to implement a burp_insert_begin(), how would it differ from
> > heap's?  With the current API, they'd (have to) be the same, which means either
> > that it should apply to all AMs (or have a "default" implementation that can be
> > overridden by an AM), or that this API assumes that other AMs will want to do
> > exactly what heap does, and fails to allow other AMs to implement optimizations
> > for bulk inserts as claimed.
> >
> > I don't think using a "union" solves the problem, since it can only accommodate
> > core AMs, and not extensions, so I suggested something like reloptions, which
> > have a "namespace" prefix (and core has toast.*, like ALTER TABLE t SET
> > toast.autovacuum_enabled).
>
> I think you'd want to handle things like:
>
>  - a compressed AM wants to specify a threshold for a tuple's *compressed* size
>    (maybe in addition to the uncompressed size);
>  - a "columnar" AM wants to specify a threshold size for a column, rather
>    than for each tuple;
>
> I'm not proposing to handle those specific parameters, but rather pointing out
> that your implementation doesn't allow handling AM-specific considerations,
> which I think was the goal.
>
> The TableInsertState structure would need to store those, and then the AM's
> multi_insert_v2 routine would need to make use of them.
>
> It feels a bit like we'd introduce the idea of an "AM option", except that it
> wouldn't be user-facing (or maybe some of them would be?).  Maybe I've
> misunderstood though, so other opinions are welcome.

Attaching a v2 patch for the new table AMs.

This patch has following changes:

1) Made the TableInsertState structure generic by having a void
pointer for multi insert state and defined the heap specific multi
insert state information in heapam.h. This way each AM can have it's
own multi insert state structure and dereference the void pointer
using that structure inside the respective AM implementations.
2) Earlier in the v1 patch, the bulk insert state
allocation/deallocation was moved to AM level, but I see that there's
nothing specific in doing so and I think it should be independent of
AM. So I'm doing that in table_insert_begin() and table_insert_end().
Because of this, I had to move the BulkInsert function declarations
from heapam.h to tableam.h
3) Corrected the typos and tried to adjust indentation of the code.

Note that I have not yet made the multi_insert_v2 API optional as
suggested earlier. I will think more on this and update.

I'm not posting the updated 0002 to 0004 patches, I plan to do so
after a couple of reviews happen on the design of the APIs in 0001.

Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachment

v2-0001-New-Table-AMs-for-Multi-and-Single-Inserts.patch

Re: New Table Access Methods for Multi and Single Inserts

From

Justin Pryzby

Date:

25 December 2020, 02:40:14

On Thu, Dec 24, 2020 at 05:48:42AM +0530, Bharath Rupireddy wrote:
> I'm not posting the updated 0002 to 0004 patches, I plan to do so
> after a couple of reviews happen on the design of the APIs in 0001.
> 
> Thoughts?

Are you familiar with this work ?

https://commitfest.postgresql.org/31/2717/
Reloptions for table access methods

It seems like that can be relevant for your patch, and I think some of what
your patch needs might be provided by AM opts.  

It's difficult to generalize AMs when we have only one, but your use-case might
be a concrete example which would help to answer some questions on the other
thread.

@Jeff: https://commitfest.postgresql.org/31/2871/

-- 
Justin

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

28 December 2020, 12:48:08

On Fri, Dec 25, 2020 at 8:10 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> On Thu, Dec 24, 2020 at 05:48:42AM +0530, Bharath Rupireddy wrote:
> > I'm not posting the updated 0002 to 0004 patches, I plan to do so
> > after a couple of reviews happen on the design of the APIs in 0001.
> >
> > Thoughts?
>
> Are you familiar with this work ?
>
> https://commitfest.postgresql.org/31/2717/
> Reloptions for table access methods
>
> It seems like that can be relevant for your patch, and I think some of what
> your patch needs might be provided by AM opts.
>
> It's difficult to generalize AMs when we have only one, but your use-case might
> be a concrete example which would help to answer some questions on the other
> thread.
>
> @Jeff: https://commitfest.postgresql.org/31/2871/

Note that I have not gone through the entire thread at [1]. On some
initial study, that patch is proposing to allow different table AMs to
have custom rel options.

In the v2 patch that I sent upthread [2] for new table AMs has heap AM
multi insert code moved inside the new heap AM implementation and I
don't see any need of having rel options. In case, any other AMs want
to have the control for their multi insert API implementation via rel
options, I think the proposal at [1] can be useful.

IIUC, there's no dependency or anything as such for the new table AM
patch with the rel options thread [1]. If I'm right, can this new
table AM patch [2] be reviewed further?

Thoughts?

[1] - https://commitfest.postgresql.org/31/2717/
[2] -
https://www.postgresql.org/message-id/CALj2ACWMnZZCu%3DG0PJkEeYYicKeuJ-X%3DSU19i6vQ1%2B%3DuXz8u0Q%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: New Table Access Methods for Multi and Single Inserts

From

Luc Vlaming

Date:

04 January 2021, 07:59:01

On 28-12-2020 13:48, Bharath Rupireddy wrote:
> On Fri, Dec 25, 2020 at 8:10 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
>> On Thu, Dec 24, 2020 at 05:48:42AM +0530, Bharath Rupireddy wrote:
>>> I'm not posting the updated 0002 to 0004 patches, I plan to do so
>>> after a couple of reviews happen on the design of the APIs in 0001.
>>>
>>> Thoughts?
>>
>> Are you familiar with this work ?
>>
>> https://commitfest.postgresql.org/31/2717/
>> Reloptions for table access methods
>>
>> It seems like that can be relevant for your patch, and I think some of what
>> your patch needs might be provided by AM opts.
>>
>> It's difficult to generalize AMs when we have only one, but your use-case might
>> be a concrete example which would help to answer some questions on the other
>> thread.
>>
>> @Jeff: https://commitfest.postgresql.org/31/2871/
> 
> Note that I have not gone through the entire thread at [1]. On some
> initial study, that patch is proposing to allow different table AMs to
> have custom rel options.
> 
> In the v2 patch that I sent upthread [2] for new table AMs has heap AM
> multi insert code moved inside the new heap AM implementation and I
> don't see any need of having rel options. In case, any other AMs want
> to have the control for their multi insert API implementation via rel
> options, I think the proposal at [1] can be useful.
> 
> 
> Thoughts?
> 
> [1] - https://commitfest.postgresql.org/31/2717/
> [2] -
https://www.postgresql.org/message-id/CALj2ACWMnZZCu%3DG0PJkEeYYicKeuJ-X%3DSU19i6vQ1%2B%3DuXz8u0Q%40mail.gmail.com
> 
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com
> 
Hi,

 > IIUC, there's no dependency or anything as such for the new table AM
 > patch with the rel options thread [1]. If I'm right, can this new
 > table AM patch [2] be reviewed further?

To me this seems good enough. Reason is that I anticipate that there 
would not necessarily be per-table options for now but rather global 
options, if any. Moreover, if we want to make these kind of tradeoffs 
user-controllable I would argue this should be done in a different 
patch-set either way. Reason is that there are parameters in heap 
already that are computed / hardcoded as well (see e.g. 
RelationAddExtraBlocks).

===

As to the patches themselves:

I think the API is a huge step forward! I assume that we want to have a 
single-insert API like heap_insert_v2 so that we can encode the 
knowledge that there will just be a single insert coming and likely a 
commit afterwards?

Reason I'm asking is that I quite liked the heap_insert_begin parameter 
is_multi, which could even be turned into a "expected_rowcount" of the 
amount of rows expected to be commited in the transaction (e.g. single, 
several, thousands/stream).
If we were to make the API based on expected rowcounts, the whole 
heap_insert_v2, heap_insert and heap_multi_insert could be turned into a 
single function heap_insert, as the knowledge about buffering of the 
slots is then already stored in the TableInsertState, creating an API like:

// expectedRows: -1 = streaming, otherwise expected rowcount.
TableInsertState* heap_insert_begin(Relation rel, CommandId cid, int 
options, int expectedRows);
heap_insert(TableInsertState *state, TupleTableSlot *slot);

Do you think that's a good idea?

Two smaller things I'm wondering:
- the clear_mi_slots; why is this not in the HeapMultiInsertState? the 
slots themselves are declared there? also, the boolean themselves is 
somewhat problematic I think because it would only work if you specified 
is_multi=true which would depend on the actual tableam to implement this 
then in a way that copy/ctas/etc can also use the slot properly, which I 
think would severely limit their freedom to store the slots more 
efficiently? Also, why do we want to do ExecClearTuple() anyway? Isn't 
it good enough that the next call to ExecCopySlot will effectively clear 
it out?
- flushed -> why is this a stored boolean? isn't this indirectly encoded 
by cur_slots/cur_size == 0?

For patches 02-04 I quickly skimmed through them as I assume we first 
want the API agreed upon. Generally they look nice and like a big step 
forward. What I'm just wondering about is the usage of the 
implementation details like mistate->slots[X]. It makes a lot of sense 
to do so but also makes for a difficult compromise, because now the 
tableam has to guarantee a copy of the slot, and hopefully even one in a 
somewhat efficient form.

Kind regards,
Luc

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

05 January 2021, 10:06:24

On Mon, Jan 4, 2021 at 1:29 PM Luc Vlaming <luc@swarm64.com> wrote:
> > table AM patch [2] be reviewed further?
> As to the patches themselves:
>
> I think the API is a huge step forward! I assume that we want to have a
> single-insert API like heap_insert_v2 so that we can encode the
> knowledge that there will just be a single insert coming and likely a
> commit afterwards?
>
> Reason I'm asking is that I quite liked the heap_insert_begin parameter
> is_multi, which could even be turned into a "expected_rowcount" of the
> amount of rows expected to be commited in the transaction (e.g. single,
> several, thousands/stream).
> If we were to make the API based on expected rowcounts, the whole
> heap_insert_v2, heap_insert and heap_multi_insert could be turned into a
> single function heap_insert, as the knowledge about buffering of the
> slots is then already stored in the TableInsertState, creating an API like:
>
> // expectedRows: -1 = streaming, otherwise expected rowcount.
> TableInsertState* heap_insert_begin(Relation rel, CommandId cid, int
> options, int expectedRows);
> heap_insert(TableInsertState *state, TupleTableSlot *slot);
>
> Do you think that's a good idea?

IIUC, your suggestion is to use expectedRows and move the multi insert implementation heap_multi_insert_v2 to heap_insert_v2. If that's correct, so heap_insert_v2 will look something like this:

heap_insert_v2()
{
if (single_insert)
//do single insertion work, the code in existing heap_insert_v2 comes here
else
//do multi insertion work, the code in existing heap_multi_insert_v2 comes here
}

I don't see any problem in combining single and multi insert APIs into one. Having said that, will the APIs be cleaner then? Isn't it going to be confusing if a single heap_insert_v2 API does both the works? With the existing separate APIs, for single insertion, the sequence of the API can be like begin, insert_v2, end and for multi inserts it's like begin, multi_insert_v2, flush, end. I prefer to have a separate multi insert API so that it will make the code look readable.

Thoughts?

> Two smaller things I'm wondering:
> - the clear_mi_slots; why is this not in the HeapMultiInsertState? the
> slots themselves are declared there?

Firstly, we need to have the buffered slots sometimes(please have a look at the comments in TableInsertState structure) outside the multi_insert API. And we need to have cleared the previously flushed slots before we start buffering in heap_multi_insert_v2(). I can remove the clear_mi_slots flag altogether and do as follows: I will not set mistate->cur_slots to 0 in heap_multi_insert_flush after the flush, I will only set state->flushed to true. In heap_multi_insert_v2,

void
heap_multi_insert_v2(TableInsertState *state, TupleTableSlot *slot)
{
TupleTableSlot *batchslot;
HeapMultiInsertState *mistate = (HeapMultiInsertState *)state->mistate;
Size sz;

Assert(mistate && mistate->slots);

/* if the slots are flushed previously then clear them off before using them again. */
if (state->flushed)
{
int i;

for (i = 0; i < mistate->cur_slots; i++)
ExecClearTuple(mistate->slots[i]);

mistate->cur_slots = 0;
state->flushed = false
}

if (mistate->slots[mistate->cur_slots] == NULL)
mistate->slots[mistate->cur_slots] =
table_slot_create(state->rel, NULL);

batchslot = mistate->slots[mistate->cur_slots];

ExecCopySlot(batchslot, slot);

Thoughts?

> Also, why do we want to do ExecClearTuple() anyway? Isn't
> it good enough that the next call to ExecCopySlot will effectively clear
> it out?

For virtual, heap, minimal tuple slots, yes ExecCopySlot slot clears the slot before copying. But, for buffer heap slots, the tts_buffer_heap_copyslot does not always clear the destination slot, see below. If we fall into else condition, we might get some issues. And also note that, once the slot is cleared in ExecClearTuple, it will not be cleared again in ExecCopySlot because TTS_SHOULDFREE(slot) will be false. That is why, let's have ExecClearTuple as is.

/*
* If the source slot is of a different kind, or is a buffer slot that has
* been materialized / is virtual, make a new copy of the tuple. Otherwise
* make a new reference to the in-buffer tuple.
*/
if (dstslot->tts_ops != srcslot->tts_ops ||
TTS_SHOULDFREE(srcslot) ||
!bsrcslot->base.tuple)
{
MemoryContext oldContext;

ExecClearTuple(dstslot);
}
else
{
Assert(BufferIsValid(bsrcslot->buffer));

tts_buffer_heap_store_tuple(dstslot, bsrcslot->base.tuple,
bsrcslot->buffer, false);

> - flushed -> why is this a stored boolean? isn't this indirectly encoded
> by cur_slots/cur_size == 0?

Note that cur_slots is in HeapMultiInsertState and outside of the new APIs i.e. in TableInsertState, mistate is a void pointer, and we can't really access the cur_slots. I mean, we can access but we need to be dereferencing using the tableam kind. Instead of doing all of that, to keep the API cleaner, I chose to have a boolean in the TableInsertState which we can see and use outside of the new APIs. Hope that's fine.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: New Table Access Methods for Multi and Single Inserts

From

Jeff Davis

Date:

05 January 2021, 21:28:59

On Mon, 2021-01-04 at 08:59 +0100, Luc Vlaming wrote:
> Reason I'm asking is that I quite liked the heap_insert_begin
> parameter 
> is_multi, which could even be turned into a "expected_rowcount" of
> the 
> amount of rows expected to be commited in the transaction (e.g.
> single, 
> several, thousands/stream).

Do you mean "written by the statement" instead of "committed in the
transaction"? It doesn't look like the TableInsertState state will
survive across statement boundaries.

Though that is an important question to consider. If the premise is
that a given custom AM may be much more efficient at bulk inserts than
retail inserts (which is reasonable), then it makes sense to handle the
case of a transaction with many single-tuple inserts. But keeping
insert state across statement boundaries also raises a few potential
problems.

Regards,
    Jeff Davis

Re: New Table Access Methods for Multi and Single Inserts

From

Luc Vlaming

Date:

06 January 2021, 07:00:55

On 05-01-2021 22:28, Jeff Davis wrote:
> On Mon, 2021-01-04 at 08:59 +0100, Luc Vlaming wrote:
>> Reason I'm asking is that I quite liked the heap_insert_begin
>> parameter
>> is_multi, which could even be turned into a "expected_rowcount" of
>> the
>> amount of rows expected to be commited in the transaction (e.g.
>> single,
>> several, thousands/stream).
> 
> Do you mean "written by the statement" instead of "committed in the
> transaction"? It doesn't look like the TableInsertState state will
> survive across statement boundaries.
> 
> Though that is an important question to consider. If the premise is
> that a given custom AM may be much more efficient at bulk inserts than
> retail inserts (which is reasonable), then it makes sense to handle the
> case of a transaction with many single-tuple inserts. But keeping
> insert state across statement boundaries also raises a few potential
> problems.
> 
> Regards,
>     Jeff Davis
> 
> 

I did actually mean until the end of the transaction. I know this is 
currently not possible with the current design but I think it would be 
cool to start going that way (even if slightly). Creating some more 
freedom on how a tableam optimizes inserts, when one syncs to disk, etc 
would be good imo. It would allow one to create e.g. a tableam that 
would not have as a high overhead when doing single statement inserts.

Kind regards,
Luc

Re: New Table Access Methods for Multi and Single Inserts

From

Luc Vlaming

Date:

06 January 2021, 07:26:38

On 05-01-2021 11:06, Bharath Rupireddy wrote:
> On Mon, Jan 4, 2021 at 1:29 PM Luc Vlaming <luc@swarm64.com 
> <mailto:luc@swarm64.com>> wrote:
>  >  > table AM patch [2] be reviewed further?
>  > As to the patches themselves:
>  >
>  > I think the API is a huge step forward! I assume that we want to have a
>  > single-insert API like heap_insert_v2 so that we can encode the
>  > knowledge that there will just be a single insert coming and likely a
>  > commit afterwards?
>  >
>  > Reason I'm asking is that I quite liked the heap_insert_begin parameter
>  > is_multi, which could even be turned into a "expected_rowcount" of the
>  > amount of rows expected to be commited in the transaction (e.g. single,
>  > several, thousands/stream).
>  > If we were to make the API based on expected rowcounts, the whole
>  > heap_insert_v2, heap_insert and heap_multi_insert could be turned into a
>  > single function heap_insert, as the knowledge about buffering of the
>  > slots is then already stored in the TableInsertState, creating an API 
> like:
>  >
>  > // expectedRows: -1 = streaming, otherwise expected rowcount.
>  > TableInsertState* heap_insert_begin(Relation rel, CommandId cid, int
>  > options, int expectedRows);
>  > heap_insert(TableInsertState *state, TupleTableSlot *slot);
>  >
>  > Do you think that's a good idea?
> 
> IIUC, your suggestion is to use expectedRows and move the multi insert 
> implementation heap_multi_insert_v2 to heap_insert_v2. If that's 
> correct, so heap_insert_v2 will look something like this:
> 
> heap_insert_v2()
> {
>      if (single_insert)
>        //do single insertion work, the code in existing heap_insert_v2 
> comes here
>     else
>        //do multi insertion work, the code in existing 
> heap_multi_insert_v2 comes here
> }
> 
> I don't see any problem in combining single and multi insert APIs into 
> one. Having said that, will the APIs be cleaner then? Isn't it going to 
> be confusing if a single heap_insert_v2 API does both the works? With 
> the existing separate APIs, for single insertion, the sequence of the 
> API can be like begin, insert_v2, end and for multi inserts it's like 
> begin, multi_insert_v2, flush, end. I prefer to have a separate multi 
> insert API so that it will make the code look readable.
> 
> Thoughts?

The main reason for me for wanting a single API is that I would like the 
decision of using single or multi inserts to move to inside the tableam.
For e.g. a heap insert we might want to put the threshold at e.g. 100 
rows so that the overhead of buffering the tuples is actually 
compensated. For other tableam this logic might also be quite different, 
and I think therefore that it shouldn't be e.g. COPY or CTAS deciding 
whether or not multi inserts should be used. Because otherwise the thing 
we'll get is that there will be tableams that will ignore this flag and 
do their own thing anyway. I'd rather have an API that gives all 
necessary information to the tableam and then make the tableam do "the 
right thing".

Another reason I'm suggesting this API is that I would expect that the 
begin is called in a different place in the code for the (multiple) 
inserts than the actual insert statement.
To me conceptually the begin and end are like e.g. the executor begin 
and end: you prepare the inserts with the knowledge you have at that 
point. I assumed (wrongly?) that during the start of the statement one 
knows best how many rows are coming; and then the actual insertion of 
the row doesn't have to deal anymore with multi/single inserts, choosing 
when to buffer or not, because that information has already been given 
during the initial phase. One of the reasons this is appealing to me is 
that e.g. in [1] there was discussion on when to switch to a multi 
insert state, and imo this should be up to the tableam.

> 
>  > Two smaller things I'm wondering:
>  > - the clear_mi_slots; why is this not in the HeapMultiInsertState? the
>  > slots themselves are declared there?
> 
> Firstly, we need to have the buffered slots sometimes(please have a look 
> at the comments in TableInsertState structure) outside the multi_insert 
> API. And we need to have cleared the previously flushed slots before we 
> start buffering in heap_multi_insert_v2(). I can remove the 
> clear_mi_slots flag altogether and do as follows: I will not set 
> mistate->cur_slots to 0 in heap_multi_insert_flush after the flush, I 
> will only set state->flushed to true. In heap_multi_insert_v2,
> 
> void
> heap_multi_insert_v2(TableInsertState *state, TupleTableSlot *slot)
> {
>      TupleTableSlot  *batchslot;
>      HeapMultiInsertState *mistate = (HeapMultiInsertState *)state->mistate;
>      Size sz;
> 
>      Assert(mistate && mistate->slots);
> 
> *  /* if the slots are flushed previously then clear them off before 
> using them again. */
>      if (state->flushed)
>      {
>          int i;
> 
>          for (i = 0; i < mistate->cur_slots; i++)
>              ExecClearTuple(mistate->slots[i]);
> 
>          mistate->cur_slots = 0;
>          state->flushed = false
>      }*
> 
>      if (mistate->slots[mistate->cur_slots] == NULL)
>          mistate->slots[mistate->cur_slots] =
>                                      table_slot_create(state->rel, NULL);
> 
>      batchslot = mistate->slots[mistate->cur_slots];
> 
>      ExecCopySlot(batchslot, slot);
> 
> Thoughts?

 From what I can see you can just keep the v2-0001 patch and:
- remove the flushed variable alltogether. mistate->cur_slots == 0 
encodes this already and the variable is never actually checked on.
- call ExecClearTuple just before ExecCopySlot()

Which would make the code something like:

void
heap_multi_insert_v2(TableInsertState *state, TupleTableSlot *slot)
{
    TupleTableSlot  *batchslot;
    HeapMultiInsertState *mistate = (HeapMultiInsertState *)state->mistate;
    Size sz;

    Assert(mistate && mistate->slots);

    if (mistate->slots[mistate->cur_slots] == NULL)
        mistate->slots[mistate->cur_slots] =
                                    table_slot_create(state->rel, NULL);

    batchslot = mistate->slots[mistate->cur_slots];

    ExecClearTuple(batchslot);
    ExecCopySlot(batchslot, slot);

    /*
     * Calculate the tuple size after the original slot is copied, because the
     * copied slot type and the tuple size may change.
     */
    sz = GetTupleSize(batchslot, mistate->max_size);

    Assert(sz > 0);

    mistate->cur_slots++;
    mistate->cur_size += sz;

    if (mistate->cur_slots >= mistate->max_slots ||
        mistate->cur_size >= mistate->max_size)
        heap_multi_insert_flush(state);
}

void
heap_multi_insert_flush(TableInsertState *state)
{
    HeapMultiInsertState *mistate = (HeapMultiInsertState *)state->mistate;
    MemoryContext oldcontext;

    Assert(mistate && mistate->slots && mistate->cur_slots >= 0 &&
           mistate->context);

    if (mistate->cur_slots == 0)
        return;

    oldcontext = MemoryContextSwitchTo(mistate->context);

    heap_multi_insert(state->rel, mistate->slots, mistate->cur_slots,
                      state->cid, state->options, state->bistate);

    MemoryContextReset(mistate->context);
    MemoryContextSwitchTo(oldcontext);

    /*
     * Do not clear the slots always. Sometimes callers may want the slots for
     * index insertions or after row trigger executions in which case they have
     * to clear the tuples before using for the next insert batch.
     */
    if (state->clear_mi_slots)
    {
        int i;

        for (i = 0; i < mistate->cur_slots; i++)
            ExecClearTuple(mistate->slots[i]);
    }

    mistate->cur_slots = 0;
    mistate->cur_size = 0;
}


> 
>  > Also, why do we want to do ExecClearTuple() anyway? Isn't
>  > it good enough that the next call to ExecCopySlot will effectively clear
>  > it out?
> 
> For virtual, heap, minimal tuple slots, yes ExecCopySlot slot clears the 
> slot before copying. But, for buffer heap slots, the 
> tts_buffer_heap_copyslot does not always clear the destination slot, see 
> below. If we fall into else condition, we might get some issues. And
> also note that, once the slot is cleared in ExecClearTuple, it will not 
> be cleared again in ExecCopySlot because TTS_SHOULDFREE(slot) will be 
> false. That is why, let's have ExecClearTuple as is.
> 
I had no idea the buffer heap slot doesn't unconditionally clear out the 
slot :( So yes lets call it unconditionally ourselves. See also 
suggestion above.

>      /*
>       * If the source slot is of a different kind, or is a buffer slot 
> that has
>       * been materialized / is virtual, make a new copy of the tuple. 
> Otherwise
>       * make a new reference to the in-buffer tuple.
>       */
>      if (dstslot->tts_ops != srcslot->tts_ops ||
>          TTS_SHOULDFREE(srcslot) ||
>          !bsrcslot->base.tuple)
>      {
>          MemoryContext oldContext;
> 
>          ExecClearTuple(dstslot);
>      }
>      else
>      {
>          Assert(BufferIsValid(bsrcslot->buffer));
> 
>          tts_buffer_heap_store_tuple(dstslot, bsrcslot->base.tuple,
>                                      bsrcslot->buffer, false);
> 
>  > - flushed -> why is this a stored boolean? isn't this indirectly encoded
>  > by cur_slots/cur_size == 0?
> 
> Note that cur_slots is in HeapMultiInsertState and outside of the new 
> APIs i.e. in TableInsertState, mistate is a void pointer, and we can't 
> really access the cur_slots. I mean, we can access but we need to be 
> dereferencing using the tableam kind. Instead of  doing all of that, to 
> keep the API cleaner, I chose to have a boolean in the TableInsertState 
> which we can see and use outside of the new APIs. Hope that's fine.
> 
So you mean the flushed variable is actually there to tell the user of 
the API that they are supposed to call flush before end? Why can't the 
end call flush itself then? I guess I completely misunderstood the 
purpose of table_multi_insert_flush being public. I had assumed it is 
there to from the usage site indicate that now would be a good time to 
flush, e.g. because of a statement ending or something. I had not 
understood this is a requirement that its always required to do 
table_multi_insert_flush + table_insert_end.
IMHO I would hide this from the callee, given that you would only really 
call flush yourself when you immediately after would call end, or are 
there other cases where one would be required to explicitly call flush?

> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com <http://www.enterprisedb.com>

Kind regards,
Luc

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

06 January 2021, 13:06:38

On Wed, Jan 6, 2021 at 12:56 PM Luc Vlaming <luc@swarm64.com> wrote:
> The main reason for me for wanting a single API is that I would like the
> decision of using single or multi inserts to move to inside the tableam.
> For e.g. a heap insert we might want to put the threshold at e.g. 100
> rows so that the overhead of buffering the tuples is actually
> compensated. For other tableam this logic might also be quite different,
> and I think therefore that it shouldn't be e.g. COPY or CTAS deciding
> whether or not multi inserts should be used. Because otherwise the thing
> we'll get is that there will be tableams that will ignore this flag and
> do their own thing anyway. I'd rather have an API that gives all
> necessary information to the tableam and then make the tableam do "the
> right thing".
>
> Another reason I'm suggesting this API is that I would expect that the
> begin is called in a different place in the code for the (multiple)
> inserts than the actual insert statement.
> To me conceptually the begin and end are like e.g. the executor begin
> and end: you prepare the inserts with the knowledge you have at that
> point. I assumed (wrongly?) that during the start of the statement one
> knows best how many rows are coming; and then the actual insertion of
> the row doesn't have to deal anymore with multi/single inserts, choosing
> when to buffer or not, because that information has already been given
> during the initial phase. One of the reasons this is appealing to me is
> that e.g. in [1] there was discussion on when to switch to a multi
> insert state, and imo this should be up to the tableam.

Agree that whether to go with the multi or single inserts should be
completely left to tableam implementation, we, as callers of those API
just need to inform whether we expect single or multiple rows, and it
should be left to tableam implementation whether to actually go with
buffering or single inserts. ISTM that it's an elegant way of making
the API generic and abstracting everything from the callers. What I
wonder is how can we know in advance the expected row count that we
need to pass in to heap_insert_begin()? IIUC, we can not estimate the
upcoming rows in COPY, Insert Into Select, or Refresh Mat View or some
other insert queries?  Of course, we can look at the planner's
estimated row count for the selects in COPY, Insert Into Select or
Refresh Mat View after the planning, but to me that's not something we
can depend on and pass in the row count to the insert APIs.

When we don't know the expected row count, why can't we(as callers of
the APIs) tell the APIs something like, "I'm intending to perform
multi inserts, so if possible and if you have a mechanism to buffer
the slots, do it, otherwise insert the tuples one by one, or else do
whatever you want to do with the tuples I give it you". So, in case of
COPY we can ask the API for multi inserts and call heap_insert_begin()
and heap_insert_v2().

Given the above explanation, I still feel bool is_multi would suffice.

Thoughts?

On dynamically, switching from single to multi inserts, this can be
done by heap_insert_v2 itself. The way I think it's possible is that,
say we have some threshold row count 1000(can be a macro)  after
inserting those many tuples, heap_insert_v2 can switch to buffering
mode.

Thoughts?

> Which would make the code something like:
>
> void
> heap_multi_insert_v2(TableInsertState *state, TupleTableSlot *slot)
> {
>         TupleTableSlot  *batchslot;
>         HeapMultiInsertState *mistate = (HeapMultiInsertState *)state->mistate;
>         Size sz;
>
>         Assert(mistate && mistate->slots);
>
>         if (mistate->slots[mistate->cur_slots] == NULL)
>                 mistate->slots[mistate->cur_slots] =
>                                                                         table_slot_create(state->rel, NULL);
>
>         batchslot = mistate->slots[mistate->cur_slots];
>
>         ExecClearTuple(batchslot);
>         ExecCopySlot(batchslot, slot);
>
>         /*
>          * Calculate the tuple size after the original slot is copied, because the
>          * copied slot type and the tuple size may change.
>          */
>         sz = GetTupleSize(batchslot, mistate->max_size);
>
>         Assert(sz > 0);
>
>         mistate->cur_slots++;
>         mistate->cur_size += sz;
>
>         if (mistate->cur_slots >= mistate->max_slots ||
>                 mistate->cur_size >= mistate->max_size)
>                 heap_multi_insert_flush(state);
> }

I think clearing tuples before copying the slot as you suggested may
work without the need of clear_slots flag.

>
> >  > Also, why do we want to do ExecClearTuple() anyway? Isn't
> >  > it good enough that the next call to ExecCopySlot will effectively clear
> >  > it out?
> >
> > For virtual, heap, minimal tuple slots, yes ExecCopySlot slot clears the
> > slot before copying. But, for buffer heap slots, the
> > tts_buffer_heap_copyslot does not always clear the destination slot, see
> > below. If we fall into else condition, we might get some issues. And
> > also note that, once the slot is cleared in ExecClearTuple, it will not
> > be cleared again in ExecCopySlot because TTS_SHOULDFREE(slot) will be
> > false. That is why, let's have ExecClearTuple as is.
> >
> I had no idea the buffer heap slot doesn't unconditionally clear out the
> slot :( So yes lets call it unconditionally ourselves. See also
> suggestion above.

Yeah, we will clear the tuple slot before copy to be on the safer side.

> >      /*
> >       * If the source slot is of a different kind, or is a buffer slot
> > that has
> >       * been materialized / is virtual, make a new copy of the tuple.
> > Otherwise
> >       * make a new reference to the in-buffer tuple.
> >       */
> >      if (dstslot->tts_ops != srcslot->tts_ops ||
> >          TTS_SHOULDFREE(srcslot) ||
> >          !bsrcslot->base.tuple)
> >      {
> >          MemoryContext oldContext;
> >
> >          ExecClearTuple(dstslot);
> >      }
> >      else
> >      {
> >          Assert(BufferIsValid(bsrcslot->buffer));
> >
> >          tts_buffer_heap_store_tuple(dstslot, bsrcslot->base.tuple,
> >                                      bsrcslot->buffer, false);
> >
> >  > - flushed -> why is this a stored boolean? isn't this indirectly encoded
> >  > by cur_slots/cur_size == 0?
> >
> > Note that cur_slots is in HeapMultiInsertState and outside of the new
> > APIs i.e. in TableInsertState, mistate is a void pointer, and we can't
> > really access the cur_slots. I mean, we can access but we need to be
> > dereferencing using the tableam kind. Instead of  doing all of that, to
> > keep the API cleaner, I chose to have a boolean in the TableInsertState
> > which we can see and use outside of the new APIs. Hope that's fine.
> >
> So you mean the flushed variable is actually there to tell the user of
> the API that they are supposed to call flush before end? Why can't the
> end call flush itself then? I guess I completely misunderstood the
> purpose of table_multi_insert_flush being public. I had assumed it is
> there to from the usage site indicate that now would be a good time to
> flush, e.g. because of a statement ending or something. I had not
> understood this is a requirement that its always required to do
> table_multi_insert_flush + table_insert_end.
> IMHO I would hide this from the callee, given that you would only really
> call flush yourself when you immediately after would call end, or are
> there other cases where one would be required to explicitly call flush?

We need to know outside the multi_insert API whether the buffered
slots in case of multi inserts are flushed. Reason is that if we have
indexes or after row triggers, currently we call ExecInsertIndexTuples
or ExecARInsertTriggers on the buffered slots outside the API in a
loop after the flush.

If we agree on removing heap_multi_insert_v2 API and embed that logic
inside heap_insert_v2, then we can do this - pass the required
information and the functions ExecInsertIndexTuples and
ExecARInsertTriggers as callbacks so that, whether or not
heap_insert_v2 choses single or multi inserts, it can callback these
functions with the required information passed after the flush. We can
add the callback and required information into TableInsertState. But,
I'm not quite sure, we would make ExecInsertIndexTuples and
ExecARInsertTriggers. And in

If we don't want to go with callback way, then at least we need to
know whether or not heap_insert_v2 has chosen multi inserts, if yes,
the buffered slots array, and the number of current buffered slots,
whether they are flushed or not in the TableInsertState. Then,
eventually, we might need all the HeapMultiInsertState info in the
TableInsertState.

Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: New Table Access Methods for Multi and Single Inserts

From

Luc Vlaming

Date:

12 January 2021, 08:03:33

On 06-01-2021 14:06, Bharath Rupireddy wrote:
> On Wed, Jan 6, 2021 at 12:56 PM Luc Vlaming <luc@swarm64.com> wrote:
>> The main reason for me for wanting a single API is that I would like the
>> decision of using single or multi inserts to move to inside the tableam.
>> For e.g. a heap insert we might want to put the threshold at e.g. 100
>> rows so that the overhead of buffering the tuples is actually
>> compensated. For other tableam this logic might also be quite different,
>> and I think therefore that it shouldn't be e.g. COPY or CTAS deciding
>> whether or not multi inserts should be used. Because otherwise the thing
>> we'll get is that there will be tableams that will ignore this flag and
>> do their own thing anyway. I'd rather have an API that gives all
>> necessary information to the tableam and then make the tableam do "the
>> right thing".
>>
>> Another reason I'm suggesting this API is that I would expect that the
>> begin is called in a different place in the code for the (multiple)
>> inserts than the actual insert statement.
>> To me conceptually the begin and end are like e.g. the executor begin
>> and end: you prepare the inserts with the knowledge you have at that
>> point. I assumed (wrongly?) that during the start of the statement one
>> knows best how many rows are coming; and then the actual insertion of
>> the row doesn't have to deal anymore with multi/single inserts, choosing
>> when to buffer or not, because that information has already been given
>> during the initial phase. One of the reasons this is appealing to me is
>> that e.g. in [1] there was discussion on when to switch to a multi
>> insert state, and imo this should be up to the tableam.
> 
> Agree that whether to go with the multi or single inserts should be
> completely left to tableam implementation, we, as callers of those API
> just need to inform whether we expect single or multiple rows, and it
> should be left to tableam implementation whether to actually go with
> buffering or single inserts. ISTM that it's an elegant way of making
> the API generic and abstracting everything from the callers. What I
> wonder is how can we know in advance the expected row count that we
> need to pass in to heap_insert_begin()? IIUC, we can not estimate the
> upcoming rows in COPY, Insert Into Select, or Refresh Mat View or some
> other insert queries?  Of course, we can look at the planner's
> estimated row count for the selects in COPY, Insert Into Select or
> Refresh Mat View after the planning, but to me that's not something we
> can depend on and pass in the row count to the insert APIs.
> 
> When we don't know the expected row count, why can't we(as callers of
> the APIs) tell the APIs something like, "I'm intending to perform
> multi inserts, so if possible and if you have a mechanism to buffer
> the slots, do it, otherwise insert the tuples one by one, or else do
> whatever you want to do with the tuples I give it you". So, in case of
> COPY we can ask the API for multi inserts and call heap_insert_begin()
> and heap_insert_v2().
> 

I thought that when it is available (because of planning) it would be 
nice to pass it in. If you don't know you could pass in a 1 for doing 
single inserts, and e.g. -1 or max-int for streaming. The reason I 
proposed it is so that tableam's have as much knowledge as posisble to 
do the right thing. is_multi does also work of course but is just 
somewhat less informative.

What to me seemed somewhat counterintuitive is that with the proposed 
API it is possible to say is_multi=true and then still call 
heap_insert_v2 to do a single insert.

> Given the above explanation, I still feel bool is_multi would suffice.
> 
> Thoughts?
> 
> On dynamically, switching from single to multi inserts, this can be
> done by heap_insert_v2 itself. The way I think it's possible is that,
> say we have some threshold row count 1000(can be a macro)  after
> inserting those many tuples, heap_insert_v2 can switch to buffering
> mode.

For that I thought it'd be good to use the expected row count, but yeah 
dynamically switching also works and might work better if the expected 
row counts are usually off.

> 
> Thoughts?
> 
>> Which would make the code something like:
>>
>> void
>> heap_multi_insert_v2(TableInsertState *state, TupleTableSlot *slot)
>> {
>>          TupleTableSlot  *batchslot;
>>          HeapMultiInsertState *mistate = (HeapMultiInsertState *)state->mistate;
>>          Size sz;
>>
>>          Assert(mistate && mistate->slots);
>>
>>          if (mistate->slots[mistate->cur_slots] == NULL)
>>                  mistate->slots[mistate->cur_slots] =
>>                                                                          table_slot_create(state->rel, NULL);
>>
>>          batchslot = mistate->slots[mistate->cur_slots];
>>
>>          ExecClearTuple(batchslot);
>>          ExecCopySlot(batchslot, slot);
>>
>>          /*
>>           * Calculate the tuple size after the original slot is copied, because the
>>           * copied slot type and the tuple size may change.
>>           */
>>          sz = GetTupleSize(batchslot, mistate->max_size);
>>
>>          Assert(sz > 0);
>>
>>          mistate->cur_slots++;
>>          mistate->cur_size += sz;
>>
>>          if (mistate->cur_slots >= mistate->max_slots ||
>>                  mistate->cur_size >= mistate->max_size)
>>                  heap_multi_insert_flush(state);
>> }
> 
> I think clearing tuples before copying the slot as you suggested may
> work without the need of clear_slots flag.

ok, cool :)

> 
>>
>>>   > Also, why do we want to do ExecClearTuple() anyway? Isn't
>>>   > it good enough that the next call to ExecCopySlot will effectively clear
>>>   > it out?
>>>
>>> For virtual, heap, minimal tuple slots, yes ExecCopySlot slot clears the
>>> slot before copying. But, for buffer heap slots, the
>>> tts_buffer_heap_copyslot does not always clear the destination slot, see
>>> below. If we fall into else condition, we might get some issues. And
>>> also note that, once the slot is cleared in ExecClearTuple, it will not
>>> be cleared again in ExecCopySlot because TTS_SHOULDFREE(slot) will be
>>> false. That is why, let's have ExecClearTuple as is.
>>>
>> I had no idea the buffer heap slot doesn't unconditionally clear out the
>> slot :( So yes lets call it unconditionally ourselves. See also
>> suggestion above.
> 
> Yeah, we will clear the tuple slot before copy to be on the safer side.
> 

ok

>>>       /*
>>>        * If the source slot is of a different kind, or is a buffer slot
>>> that has
>>>        * been materialized / is virtual, make a new copy of the tuple.
>>> Otherwise
>>>        * make a new reference to the in-buffer tuple.
>>>        */
>>>       if (dstslot->tts_ops != srcslot->tts_ops ||
>>>           TTS_SHOULDFREE(srcslot) ||
>>>           !bsrcslot->base.tuple)
>>>       {
>>>           MemoryContext oldContext;
>>>
>>>           ExecClearTuple(dstslot);
>>>       }
>>>       else
>>>       {
>>>           Assert(BufferIsValid(bsrcslot->buffer));
>>>
>>>           tts_buffer_heap_store_tuple(dstslot, bsrcslot->base.tuple,
>>>                                       bsrcslot->buffer, false);
>>>
>>>   > - flushed -> why is this a stored boolean? isn't this indirectly encoded
>>>   > by cur_slots/cur_size == 0?
>>>
>>> Note that cur_slots is in HeapMultiInsertState and outside of the new
>>> APIs i.e. in TableInsertState, mistate is a void pointer, and we can't
>>> really access the cur_slots. I mean, we can access but we need to be
>>> dereferencing using the tableam kind. Instead of  doing all of that, to
>>> keep the API cleaner, I chose to have a boolean in the TableInsertState
>>> which we can see and use outside of the new APIs. Hope that's fine.
>>>
>> So you mean the flushed variable is actually there to tell the user of
>> the API that they are supposed to call flush before end? Why can't the
>> end call flush itself then? I guess I completely misunderstood the
>> purpose of table_multi_insert_flush being public. I had assumed it is
>> there to from the usage site indicate that now would be a good time to
>> flush, e.g. because of a statement ending or something. I had not
>> understood this is a requirement that its always required to do
>> table_multi_insert_flush + table_insert_end.
>> IMHO I would hide this from the callee, given that you would only really
>> call flush yourself when you immediately after would call end, or are
>> there other cases where one would be required to explicitly call flush?
> 
> We need to know outside the multi_insert API whether the buffered
> slots in case of multi inserts are flushed. Reason is that if we have
> indexes or after row triggers, currently we call ExecInsertIndexTuples
> or ExecARInsertTriggers on the buffered slots outside the API in a
> loop after the flush.
> 
> If we agree on removing heap_multi_insert_v2 API and embed that logic
> inside heap_insert_v2, then we can do this - pass the required
> information and the functions ExecInsertIndexTuples and
> ExecARInsertTriggers as callbacks so that, whether or not
> heap_insert_v2 choses single or multi inserts, it can callback these
> functions with the required information passed after the flush. We can
> add the callback and required information into TableInsertState. But,
> I'm not quite sure, we would make ExecInsertIndexTuples and
> ExecARInsertTriggers. And in
> 
> If we don't want to go with callback way, then at least we need to
> know whether or not heap_insert_v2 has chosen multi inserts, if yes,
> the buffered slots array, and the number of current buffered slots,
> whether they are flushed or not in the TableInsertState. Then,
> eventually, we might need all the HeapMultiInsertState info in the
> TableInsertState.
> 

To me the callback API seems cleaner, that on heap_insert_begin we can 
pass in a callback that is called on every flushed slot, or only on 
multi-insert flushes. Is there a reason it would only be done for 
multi-insert flushes or can it be generic?

> Thoughts?
> 
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com
> 

Hi,

Replied inline.

Kind regards,
Luc

Re: New Table Access Methods for Multi and Single Inserts

From

Jeff Davis

Date:

16 January 2021, 23:04:16

> If we agree on removing heap_multi_insert_v2 API and embed that logic
> inside heap_insert_v2, then we can do this - pass the required
> information and the functions ExecInsertIndexTuples and
> ExecARInsertTriggers as callbacks so that, whether or not
> heap_insert_v2 choses single or multi inserts, it can callback these
> functions with the required information passed after the flush. We
> can
> add the callback and required information into TableInsertState. But,
> I'm not quite sure, we would make ExecInsertIndexTuples and
> ExecARInsertTriggers.

How should the API interact with INSERT INTO ... SELECT? Right now it
doesn't appear to be integrated at all, but that seems like a fairly
important path for bulk inserts.

Regards,
    Jeff Davis

Re: New Table Access Methods for Multi and Single Inserts

From

Luc Vlaming

Date:

18 January 2021, 07:58:08

On 17-01-2021 00:04, Jeff Davis wrote:
> 
>> If we agree on removing heap_multi_insert_v2 API and embed that logic
>> inside heap_insert_v2, then we can do this - pass the required
>> information and the functions ExecInsertIndexTuples and
>> ExecARInsertTriggers as callbacks so that, whether or not
>> heap_insert_v2 choses single or multi inserts, it can callback these
>> functions with the required information passed after the flush. We
>> can
>> add the callback and required information into TableInsertState. But,
>> I'm not quite sure, we would make ExecInsertIndexTuples and
>> ExecARInsertTriggers.
> 
> How should the API interact with INSERT INTO ... SELECT? Right now it
> doesn't appear to be integrated at all, but that seems like a fairly
> important path for bulk inserts.
> 
> Regards,
>     Jeff Davis
> 
> 

Hi,

You mean how it could because of that the table modification API uses 
the table_tuple_insert_speculative ? Just wondering if you think if it 
generally cannot work or would like to see that path / more paths 
integrated in to the patch.

Kind regards,
Luc

Re: New Table Access Methods for Multi and Single Inserts

From

Jeff Davis

Date:

19 January 2021, 17:33:09

On Mon, 2021-01-18 at 08:58 +0100, Luc Vlaming wrote:
> You mean how it could because of that the table modification API
> uses 
> the table_tuple_insert_speculative ? Just wondering if you think if
> it 
> generally cannot work or would like to see that path / more paths 
> integrated in to the patch.

I think the patch should support INSERT INTO ... SELECT, and it will be
easier to tell if we have the right API when that's integrated.

Regards,
    Jeff Davis

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

17 February 2021, 07:16:25

Hi,

I addressed the following review comments and attaching v3 patch set.

1) ExecClearTuple happens before ExecCopySlot in heap_multi_insert_v2
and this allowed us to remove clear_mi_slots flag from
TableInsertState.
2) I retained the flushed variable inside TableInsertState so that the
callers can know whether the buffered slots have been flushed. If yes,
the callers can execute after insert row triggers or perform index
insertions. This is easier than passing the after insert row triggers
info and index info to new multi insert table am and let it do. This
way the functionalities can be kept separate i.e. multi insert ams do
only buffering, decisions on when to flush, insertions and the callers
will execute triggers or index insertions. And also none of the
existing table ams are performing these operations within them, so
this is inline with the current design of the table ams.
3) I have kept the single and multi insert API separate. The previous
suggestion was to have only a single insert API and let the callers
provide initially whether they want multi or single inserts. One
problem with that approach is that we have to allow table ams to
execute the after row triggers or index insertions. That is something
I personally don't like.

0001 - new table ams implementation
0002 - the new multi table ams used in CREATE TABLE AS and REFRESH
MATERIALIZED VIEW
0003 - the new multi table ams used in COPY

Please review the v3 patch set further.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

05 April 2021, 04:19:03

On Wed, Mar 10, 2021 at 10:21 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> Attaching the v4 patch set. Please review it further.

Attaching v5 patch set after rebasing onto the latest master. Please
review it further.


With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

19 April 2021, 04:51:36

On Mon, Apr 5, 2021 at 9:49 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Wed, Mar 10, 2021 at 10:21 AM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> > Attaching the v4 patch set. Please review it further.
>
> Attaching v5 patch set after rebasing onto the latest master.

Another rebase due to conflicts in 0003. Attaching v6 for review.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: New Table Access Methods for Multi and Single Inserts

From

Matthias van de Meent

Date:

04 March 2022, 14:37:32

On Mon, 19 Apr 2021 at 06:52, Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Mon, Apr 5, 2021 at 9:49 AM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > On Wed, Mar 10, 2021 at 10:21 AM Bharath Rupireddy
> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> > > Attaching the v4 patch set. Please review it further.
> >
> > Attaching v5 patch set after rebasing onto the latest master.
>
> Another rebase due to conflicts in 0003. Attaching v6 for review.

I recently touched the topic of multi_insert, and I remembered this
patch. I had to dig a bit to find it, but as it's still open I've
added some comments:

> diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
> +#define MAX_BUFFERED_TUPLES        1000
> +#define MAX_BUFFERED_BYTES        65535

It looks like these values were copied over from copyfrom.c, but are
now expressed in the context of the heapam.
As these values are now heap-specific (as opposed to the
TableAM-independent COPY infrastructure), shouldn't we instead
optimize for heap page insertions? That is, I suggest using a multiple
(1 or more) of MaxHeapTuplesPerPage for _TUPLES, and that same / a
similar multiple of BLCKSZ for MAX_BUFFERED_BYTES.

> TableInsertState->flushed
> TableInsertState->mi_slots

I don't quite like the current storage-and-feedback mechanism for
flushed tuples. The current assumptions in this mechanism seem to be
that
1.) access methods always want to flush all available tuples at once,
2.) access methods want to maintain the TupleTableSlots for all
inserted tuples that have not yet had all triggers handled, and
3.) we need access to the not-yet-flushed TupleTableSlots.

I think that that is not a correct set of assumptions; I think that
only flushed tuples need to be visible to the tableam-agnostic code;
and that tableams should be allowed to choose which tuples to flush at
which point; as long as they're all flushed after a final
multi_insert_flush.

Examples:
A heap-based access method might want bin-pack tuples and write out
full pages only; and thus keep some tuples in the buffers as they
didn't fill a page; thus having flushed only a subset of the current
buffered tuples.
A columnstore-based access method might not want to maintain the
TupleTableSlots of original tuples, but instead use temporary columnar
storage: TupleTableSlots are quite large when working with huge
amounts of tuples; and keeping lots of tuple data in memory is
expensive.

As such, I think that this should be replaced with a
TableInsertState->mi_flushed_slots + TableInsertState->mi_flushed_len,
managed by the tableAM, in which only the flushed tuples are visible
to the AM-agnostic code. This is not much different from how the
implementation currently works; except that ->mi_slots now does not
expose unflushed tuples; and that ->flushed is replaced by an integer
value of number of flushed tuples.

A further improvement (in my opinion) would be the change from a
single multi_insert_flush() to a signalling-based multi_insert_flush:
It is not unreasonable for e.g. a columnstore to buffer tens of
thousands of inserts; but doing so in TupleTableSlots would introduce
a high memory usage. Allowing for batched retrieval of flushed tuples
would help in memory usage; which is why multiple calls to
multi_insert_flush() could be useful. To handle this gracefully, we'd
probably add TIS->mi_flush_remaining, where each insert adds one to
mi_flush_remaining; and each time mi_flushed_slots has been handled
mi_flush_remaining is decreased by mi_flushed_len by the handler code.
Once we're done inserting into the table, we keep calling
multi_insert_flush until no more tuples are being flushed (and error
out if we're still waiting for flushes but no new flushed tuples are
returned).

- Matthias

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

06 March 2022, 11:11:54

On Fri, Mar 4, 2022 at 8:07 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> > Another rebase due to conflicts in 0003. Attaching v6 for review.
>
> I recently touched the topic of multi_insert, and I remembered this
> patch. I had to dig a bit to find it, but as it's still open I've
> added some comments:

Thanks for reviving the thread. I almost lost hope in it. In fact, it
took me a while to recollect the work and respond to your comments.
I'm now happy to answer or continue working on this patch if you or
someone is really interested to review it and take it to the end.

> > diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
> > +#define MAX_BUFFERED_TUPLES        1000
> > +#define MAX_BUFFERED_BYTES        65535
>
> It looks like these values were copied over from copyfrom.c, but are
> now expressed in the context of the heapam.
> As these values are now heap-specific (as opposed to the
> TableAM-independent COPY infrastructure), shouldn't we instead
> optimize for heap page insertions? That is, I suggest using a multiple
> (1 or more) of MaxHeapTuplesPerPage for _TUPLES, and that same / a
> similar multiple of BLCKSZ for MAX_BUFFERED_BYTES.

We can do that. In fact, it is a good idea to let callers pass in as
an input to tuple_insert_begin and have it as part of
TableInsertState. If okay, I will do that in the next patch.

> > TableInsertState->flushed
> > TableInsertState->mi_slots
>
> I don't quite like the current storage-and-feedback mechanism for
> flushed tuples. The current assumptions in this mechanism seem to be
> that
> 1.) access methods always want to flush all available tuples at once,
> 2.) access methods want to maintain the TupleTableSlots for all
> inserted tuples that have not yet had all triggers handled, and
> 3.) we need access to the not-yet-flushed TupleTableSlots.
>
> I think that that is not a correct set of assumptions; I think that
> only flushed tuples need to be visible to the tableam-agnostic code;
> and that tableams should be allowed to choose which tuples to flush at
> which point; as long as they're all flushed after a final
> multi_insert_flush.
>
> Examples:
> A heap-based access method might want bin-pack tuples and write out
> full pages only; and thus keep some tuples in the buffers as they
> didn't fill a page; thus having flushed only a subset of the current
> buffered tuples.
> A columnstore-based access method might not want to maintain the
> TupleTableSlots of original tuples, but instead use temporary columnar
> storage: TupleTableSlots are quite large when working with huge
> amounts of tuples; and keeping lots of tuple data in memory is
> expensive.
>
> As such, I think that this should be replaced with a
> TableInsertState->mi_flushed_slots + TableInsertState->mi_flushed_len,
> managed by the tableAM, in which only the flushed tuples are visible
> to the AM-agnostic code. This is not much different from how the
> implementation currently works; except that ->mi_slots now does not
> expose unflushed tuples; and that ->flushed is replaced by an integer
> value of number of flushed tuples.

Yeah, that makes sense. Let's table AMs expose the flushed tuples
outside on which the callers can handle the after-insert row triggers
upon them.

IIUC, TableInsertState needs to have few other variables:

    /* Below members are only used for multi inserts. */
    /* Array of buffered slots. */
    TupleTableSlot  **mi_slots;
    /* Number of slots that are currently buffered. */
    int32   mi_cur_slots;
    /* Array of flushed slots that will be used by callers to handle
after-insert row triggers or similar events outside */
    TupleTableSlot  **mi_flushed_slots ;
    /* Number of slots that are currently buffered. */
    int32   mi_no_of_flushed_slots;

The implementation of heap_multi_insert_flush will just set the
mi_slots to mi_flushed_slots.

> A further improvement (in my opinion) would be the change from a
> single multi_insert_flush() to a signalling-based multi_insert_flush:
> It is not unreasonable for e.g. a columnstore to buffer tens of
> thousands of inserts; but doing so in TupleTableSlots would introduce
> a high memory usage. Allowing for batched retrieval of flushed tuples
> would help in memory usage; which is why multiple calls to
> multi_insert_flush() could be useful. To handle this gracefully, we'd
> probably add TIS->mi_flush_remaining, where each insert adds one to
> mi_flush_remaining; and each time mi_flushed_slots has been handled
> mi_flush_remaining is decreased by mi_flushed_len by the handler code.
> Once we're done inserting into the table, we keep calling
> multi_insert_flush until no more tuples are being flushed (and error
> out if we're still waiting for flushes but no new flushed tuples are
> returned).

The current approach is signalling-based right?
heap_multi_insert_v2
    if (state->mi_cur_slots >= mistate->max_slots ||
        mistate->cur_size >= mistate->max_size)
        heap_multi_insert_flush(state);

The table_multi_insert_v2 am implementers will have to carefully
choose buffering strategy i.e. number of tuples, size to buffer and
decide rightly without hitting memory usages.

Thoughts?

Regards,
Bharath Rupireddy.

Re: New Table Access Methods for Multi and Single Inserts

From

Matthias van de Meent

Date:

07 March 2022, 16:09:23

On Sun, 6 Mar 2022 at 12:12, Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Fri, Mar 4, 2022 at 8:07 PM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > > Another rebase due to conflicts in 0003. Attaching v6 for review.
> >
> > I recently touched the topic of multi_insert, and I remembered this
> > patch. I had to dig a bit to find it, but as it's still open I've
> > added some comments:
>
> Thanks for reviving the thread. I almost lost hope in it. In fact, it
> took me a while to recollect the work and respond to your comments.
> I'm now happy to answer or continue working on this patch if you or
> someone is really interested to review it and take it to the end.
>
> > > diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
> > > +#define MAX_BUFFERED_TUPLES        1000
> > > +#define MAX_BUFFERED_BYTES        65535
> >
> > It looks like these values were copied over from copyfrom.c, but are
> > now expressed in the context of the heapam.
> > As these values are now heap-specific (as opposed to the
> > TableAM-independent COPY infrastructure), shouldn't we instead
> > optimize for heap page insertions? That is, I suggest using a multiple
> > (1 or more) of MaxHeapTuplesPerPage for _TUPLES, and that same / a
> > similar multiple of BLCKSZ for MAX_BUFFERED_BYTES.
>
> We can do that. In fact, it is a good idea to let callers pass in as
> an input to tuple_insert_begin and have it as part of
> TableInsertState. If okay, I will do that in the next patch.
>
> > > TableInsertState->flushed
> > > TableInsertState->mi_slots
> >
> > I don't quite like the current storage-and-feedback mechanism for
> > flushed tuples. The current assumptions in this mechanism seem to be
> > that
> > 1.) access methods always want to flush all available tuples at once,
> > 2.) access methods want to maintain the TupleTableSlots for all
> > inserted tuples that have not yet had all triggers handled, and
> > 3.) we need access to the not-yet-flushed TupleTableSlots.
> >
> > I think that that is not a correct set of assumptions; I think that
> > only flushed tuples need to be visible to the tableam-agnostic code;
> > and that tableams should be allowed to choose which tuples to flush at
> > which point; as long as they're all flushed after a final
> > multi_insert_flush.
> >
> > Examples:
> > A heap-based access method might want bin-pack tuples and write out
> > full pages only; and thus keep some tuples in the buffers as they
> > didn't fill a page; thus having flushed only a subset of the current
> > buffered tuples.
> > A columnstore-based access method might not want to maintain the
> > TupleTableSlots of original tuples, but instead use temporary columnar
> > storage: TupleTableSlots are quite large when working with huge
> > amounts of tuples; and keeping lots of tuple data in memory is
> > expensive.
> >
> > As such, I think that this should be replaced with a
> > TableInsertState->mi_flushed_slots + TableInsertState->mi_flushed_len,
> > managed by the tableAM, in which only the flushed tuples are visible
> > to the AM-agnostic code. This is not much different from how the
> > implementation currently works; except that ->mi_slots now does not
> > expose unflushed tuples; and that ->flushed is replaced by an integer
> > value of number of flushed tuples.
>
> Yeah, that makes sense. Let's table AMs expose the flushed tuples
> outside on which the callers can handle the after-insert row triggers
> upon them.
>
> IIUC, TableInsertState needs to have few other variables:
>
>     /* Below members are only used for multi inserts. */
>     /* Array of buffered slots. */
>     TupleTableSlot  **mi_slots;

Not quite: there's no need for TupleTableSlot  **mi_slots in the
TableInsertState; as the buffer used by the tableAM to buffer
unflushed tuples shouldn't be publicly visible. I suspect that moving
that field to HeapMultiInsertState instead would be the prudent thing
to do; limiting the external access of AM-specific buffers.

>     /* Number of slots that are currently buffered. */
>     int32   mi_cur_slots;
>     /* Array of flushed slots that will be used by callers to handle
> after-insert row triggers or similar events outside */
>     TupleTableSlot  **mi_flushed_slots ;
>     /* Number of slots that are currently buffered. */
>     int32   mi_no_of_flushed_slots;
>
> The implementation of heap_multi_insert_flush will just set the
> mi_slots to mi_flushed_slots.

Yes.

> > A further improvement (in my opinion) would be the change from a
> > single multi_insert_flush() to a signalling-based multi_insert_flush:
> > It is not unreasonable for e.g. a columnstore to buffer tens of
> > thousands of inserts; but doing so in TupleTableSlots would introduce
> > a high memory usage. Allowing for batched retrieval of flushed tuples
> > would help in memory usage; which is why multiple calls to
> > multi_insert_flush() could be useful. To handle this gracefully, we'd
> > probably add TIS->mi_flush_remaining, where each insert adds one to
> > mi_flush_remaining; and each time mi_flushed_slots has been handled
> > mi_flush_remaining is decreased by mi_flushed_len by the handler code.
> > Once we're done inserting into the table, we keep calling
> > multi_insert_flush until no more tuples are being flushed (and error
> > out if we're still waiting for flushes but no new flushed tuples are
> > returned).
>
> The current approach is signalling-based right?
> heap_multi_insert_v2
>     if (state->mi_cur_slots >= mistate->max_slots ||
>         mistate->cur_size >= mistate->max_size)
>         heap_multi_insert_flush(state);

That's for the AM-internal flushing; yes. I was thinking about the AM
api for flushing that's used when finalizing the batched insert; i.e.
table_multi_insert_flush.

Currently it assumes that all buffered tuples will be flushed after
one call (which is correct for heap), but putting those unflushed
tuples all at once back in memory might not be desirable or possible
(for e.g. columnar); so we might need to call table_multi_insert_flush
until there's no more buffered tuples.

> The table_multi_insert_v2 am implementers will have to carefully
> choose buffering strategy i.e. number of tuples, size to buffer and
> decide rightly without hitting memory usages.

Agreed

-Matthias

Re: New Table Access Methods for Multi and Single Inserts

From

Michael Paquier

Date:

12 October 2022, 05:30:59

On Mon, Mar 07, 2022 at 05:09:23PM +0100, Matthias van de Meent wrote:
> That's for the AM-internal flushing; yes. I was thinking about the AM
> api for flushing that's used when finalizing the batched insert; i.e.
> table_multi_insert_flush.
>
> Currently it assumes that all buffered tuples will be flushed after
> one call (which is correct for heap), but putting those unflushed
> tuples all at once back in memory might not be desirable or possible
> (for e.g. columnar); so we might need to call table_multi_insert_flush
> until there's no more buffered tuples.

This thread has been idle for 6 months now, so I have marked it as
returned with feedback as of what looks like a lack of activity.  I
have looked at what's been proposed, and I am not really sure if the
direction taken is correct, though there may be a potential gain in
consolidating the multi-insert path within the table AM set of
callbacks.
--
Michael

Attachment

signature.asc

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

12 October 2022, 05:35:29

On Wed, Oct 12, 2022 at 11:01 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> This thread has been idle for 6 months now, so I have marked it as
> returned with feedback as of what looks like a lack of activity.  I
> have looked at what's been proposed, and I am not really sure if the
> direction taken is correct, though there may be a potential gain in
> consolidating the multi-insert path within the table AM set of
> callbacks.

Thanks. Unfortunately, I'm not finding enough cycles to work on this
feature. I'm happy to help if others have any further thoughts and
take it from here.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: New Table Access Methods for Multi and Single Inserts

From

Andres Freund

Date:

03 June 2023, 22:38:24

Hi,

This patch was referenced in a discussion at pgcon, so I thought I'd give it a
look, even though Bharat said that he won't have time to drive it forward...


On 2021-04-19 10:21:36 +0530, Bharath Rupireddy wrote:
> diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
> index bd5faf0c1f..655de8e6b7 100644
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -2558,6 +2558,11 @@ static const TableAmRoutine heapam_methods = {
>      .tuple_insert_speculative = heapam_tuple_insert_speculative,
>      .tuple_complete_speculative = heapam_tuple_complete_speculative,
>      .multi_insert = heap_multi_insert,
> +    .tuple_insert_begin = heap_insert_begin,
> +    .tuple_insert_v2 = heap_insert_v2,
> +    .multi_insert_v2 = heap_multi_insert_v2,
> +    .multi_insert_flush = heap_multi_insert_flush,
> +    .tuple_insert_end = heap_insert_end,
>      .tuple_delete = heapam_tuple_delete,
>      .tuple_update = heapam_tuple_update,
>      .tuple_lock = heapam_tuple_lock,

I don't think we should have multiple callback for the insertion APIs in
tableam.h. I think it'd be good to continue supporting the old table_*()
functions, but supporting multiple insert APIs in each AM doesn't make much
sense to me.


> +/*
> + * GetTupleSize - Compute the tuple size given a table slot.
> + *
> + * For heap tuple, buffer tuple and minimal tuple slot types return the actual
> + * tuple size that exists. For virtual tuple, the size is calculated as the
> + * slot does not have the tuple size. If the computed size exceeds the given
> + * maxsize for the virtual tuple, this function exits, not investing time in
> + * further unnecessary calculation.
> + *
> + * Important Notes:
> + * 1) Size calculation code for virtual slots is being used from
> + *       tts_virtual_materialize(), hence ensure to have the same changes or fixes
> + *       here and also there.
> + * 2) Currently, GetTupleSize() handles the existing heap, buffer, minimal and
> + *       virtual slots. Ensure to add related code in case any new slot type is
> + *    introduced.
> + */
> +inline Size
> +GetTupleSize(TupleTableSlot *slot, Size maxsize)
> +{
> +    Size sz = 0;
> +    HeapTuple tuple = NULL;
> +
> +    if (TTS_IS_HEAPTUPLE(slot))
> +        tuple = ((HeapTupleTableSlot *) slot)->tuple;
> +    else if(TTS_IS_BUFFERTUPLE(slot))
> +        tuple = ((BufferHeapTupleTableSlot *) slot)->base.tuple;
> +    else if(TTS_IS_MINIMALTUPLE(slot))
> +        tuple = ((MinimalTupleTableSlot *) slot)->tuple;
> +    else if(TTS_IS_VIRTUAL(slot))

I think this embeds too much knowledge of the set of slot types in core
code. I don't see why it's needed either?


> diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
> index 414b6b4d57..2a1470a7b6 100644
> --- a/src/include/access/tableam.h
> +++ b/src/include/access/tableam.h
> @@ -229,6 +229,32 @@ typedef struct TM_IndexDeleteOp
>      TM_IndexStatus *status;
>  } TM_IndexDeleteOp;
>
> +/* Holds table insert state. */
> +typedef struct TableInsertState

I suspect we should design it to be usable for updates and deletes in the
future, and thus name it TableModifyState.



> +{
> +    Relation    rel;
> +    /* Bulk insert state if requested, otherwise NULL. */
> +    struct BulkInsertStateData    *bistate;
> +    CommandId    cid;

Hm - I'm not sure it's a good idea to force the cid to be the same for all
inserts done via one TableInsertState.



> +    int    options;
> +    /* Below members are only used for multi inserts. */
> +    /* Array of buffered slots. */
> +    TupleTableSlot    **mi_slots;
> +    /* Number of slots that are currently buffered. */
> +    int32    mi_cur_slots;

> +    /*
> +     * Access method specific information such as parameters that are needed
> +     * for buffering and flushing decisions can go here.
> +     */
> +    void    *mistate;

I think we should instead have a generic TableModifyState, which each AM then
embeds into an AM specific AM state. Forcing two very related structs to be
allocated separately doesn't seem wise in this case.



> @@ -1430,6 +1473,50 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
>                                    cid, options, bistate);
>  }
>
> +static inline TableInsertState*
> +table_insert_begin(Relation rel, CommandId cid, int options,
> +                   bool alloc_bistate, bool is_multi)

Why have alloc_bistate and options?


> +static inline void
> +table_insert_end(TableInsertState *state)
> +{
> +    /* Deallocate bulk insert state here, since it's AM independent. */
> +    if (state->bistate)
> +        FreeBulkInsertState(state->bistate);
> +
> +    state->rel->rd_tableam->tuple_insert_end(state);
> +}

Seems like the order in here should be swapped?


Greetings,

Andres Freund

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

05 June 2023, 02:30:00

On Sun, Jun 4, 2023 at 4:08 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> This patch was referenced in a discussion at pgcon, so I thought I'd give it a
> look, even though Bharat said that he won't have time to drive it forward...

Thanks. I'm glad to know that the feature was discussed at PGCon.

If there's an interest, I'm happy to spend time again on it.

I'll look into the review comments and respond soon.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

01 August 2023, 16:30:00

On Sun, Jun 4, 2023 at 4:08 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> This patch was referenced in a discussion at pgcon, so I thought I'd give it a
> look, even though Bharat said that he won't have time to drive it forward...

Thanks. Finally, I  started to spend time on this. Just curious - may
I know the discussion in/for which this patch is referenced? What was
the motive? Is it captured somewhere?

> On 2021-04-19 10:21:36 +0530, Bharath Rupireddy wrote:
> > +     .tuple_insert_begin = heap_insert_begin,
> > +     .tuple_insert_v2 = heap_insert_v2,
> > +     .multi_insert_v2 = heap_multi_insert_v2,
> > +     .multi_insert_flush = heap_multi_insert_flush,
> > +     .tuple_insert_end = heap_insert_end,
>
> I don't think we should have multiple callback for the insertion APIs in
> tableam.h. I think it'd be good to continue supporting the old table_*()
> functions, but supporting multiple insert APIs in each AM doesn't make much
> sense to me.

I named these new functions XXX_v2 for compatibility reasons. Because,
it's quite possible for external modules to use existing
table_tuple_insert, table_multi_insert functions. If we were to change
the existing insert tableams, all the external modules using them
would have to change their code, is that okay?

> > +/*
> > + * GetTupleSize - Compute the tuple size given a table slot.
> > +inline Size
>
> I think this embeds too much knowledge of the set of slot types in core
> code. I don't see why it's needed either?

The heapam multi-insert implementation needs to know the tuple size
from the slot to decide whether or not to flush the tuples from the
buffers. I couldn't find a direct way then to know the tuple size from
the slot, so added that helper function. With a better understanding
now, I think we can rely on the memory allocated for TupleTableSlot's
tts_mcxt. While this works for the materialized slots passed in to the
insert functions, for non-materialized slots the flushing decision can
be solely on the number of tuples stored in the buffers. Another way
is to add a get_tuple_size callback to TupleTableSlotOps and let the
tuple slot providers give us the tuple size.

> > diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
> > index 414b6b4d57..2a1470a7b6 100644
> > --- a/src/include/access/tableam.h
> > +++ b/src/include/access/tableam.h
> > @@ -229,6 +229,32 @@ typedef struct TM_IndexDeleteOp
> >       TM_IndexStatus *status;
> >  } TM_IndexDeleteOp;
> >
> > +/* Holds table insert state. */
> > +typedef struct TableInsertState
>
> I suspect we should design it to be usable for updates and deletes in the
> future, and thus name it TableModifyState.

There are different parameters that insert/update/delete would want to
pass across in the state. So, having Table{Insert/Update/Delete}State
may be a better idea than having the unneeded variables lying around
or having a union and state_type as INSERT/UPDATE/DELETE, no? Do you
have a different thought here?

> I think we should instead have a generic TableModifyState, which each AM then
> embeds into an AM specific AM state. Forcing two very related structs to be
> allocated separately doesn't seem wise in this case.

The v7 patches have largely changed the way these options and
parameters are passed, please have a look.

> > +{
> > +     Relation        rel;
> > +     /* Bulk insert state if requested, otherwise NULL. */
> > +     struct BulkInsertStateData      *bistate;
> > +     CommandId       cid;
>
> Hm - I'm not sure it's a good idea to force the cid to be the same for all
> inserts done via one TableInsertState.

If required, someone can always pass a new CID before every
tuple_insert_v2/tuple_multi_insert_v2 call via TableInsertState. Isn't
it sufficient?

> > @@ -1430,6 +1473,50 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
> >                                                                 cid, options, bistate);
> >  }
> >
> > +static inline TableInsertState*
> > +table_insert_begin(Relation rel, CommandId cid, int options,
> > +                                bool alloc_bistate, bool is_multi)
>
> Why have alloc_bistate and options?

"alloc_bistate" is for the caller to specify if they need a bulk
insert state or not. "options" is for the caller to specify if they
need table_tuple_insert performance options such as
TABLE_INSERT_SKIP_FSM, TABLE_INSERT_FROZEN, TABLE_INSERT_NO_LOGICAL.
The v7 patches have changed the way these options and parameters are
passed, please have a look.

> > +static inline void
> > +table_insert_end(TableInsertState *state)
> > +{
> > +     /* Deallocate bulk insert state here, since it's AM independent. */
> > +     if (state->bistate)
> > +             FreeBulkInsertState(state->bistate);
> > +
> > +     state->rel->rd_tableam->tuple_insert_end(state);
> > +}
>
> Seems like the order in here should be swapped?

Right. It looks like BulkInsertState is for heapam, it really doesn't
have to be in table_XXX functions, hence it all the way down to
heap_insert_XXX functions.

I'm attaching the v7 patch set with the above review comments
addressed. My initial idea behind these new insert APIs was the
ability to re-use the multi insert code in COPY for CTAS and REFRESH
MATERIALIZED VIEW. I'm open to more thoughts here.

The v7 patches have largely changed the way state structure (heapam
specific things are moved all the way down to heapam.c) is defined,
the parameters are passed, and simplified the multi insert logic a
lot.

0001 - introduces new single and multi insert table AM and heapam
implementation of the new AM.
0002 - optimizes CREATE TABLE AS to use the new multi inserts table AM
making it faster by 2.13X or 53%.
0003 - optimizes REFRESH MATERIALIZED VIEW to use the new multi
inserts table AM making it faster by 1.52X or 34%.
0004 - uses the new multi inserts table AM for COPY FROM - I'm yet to
spend time on this, I'll share the patch when ready.

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: New Table Access Methods for Multi and Single Inserts

From

Jacob Champion

Date:

01 August 2023, 17:02:12

On Tue, Aug 1, 2023 at 9:31 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> Thanks. Finally, I  started to spend time on this. Just curious - may
> I know the discussion in/for which this patch is referenced? What was
> the motive? Is it captured somewhere?

It may not have been the only place, but we at least touched on it
during the unconference:

    https://wiki.postgresql.org/wiki/PgCon_2023_Developer_Unconference#Table_AMs

We discussed two related-but-separate ideas:
1) bulk/batch operations and
2) maintenance of TAM state across multiple related operations.

--Jacob

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

17 January 2024, 17:27:48

On Tue, Aug 1, 2023 at 10:32 PM Jacob Champion <jchampion@timescale.com> wrote:
>
> On Tue, Aug 1, 2023 at 9:31 AM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> > Thanks. Finally, I  started to spend time on this. Just curious - may
> > I know the discussion in/for which this patch is referenced? What was
> > the motive? Is it captured somewhere?
>
> It may not have been the only place, but we at least touched on it
> during the unconference:
>
>     https://wiki.postgresql.org/wiki/PgCon_2023_Developer_Unconference#Table_AMs
>
> We discussed two related-but-separate ideas:
> 1) bulk/batch operations and
> 2) maintenance of TAM state across multiple related operations.

Thank you. I'm attaching v8 patch-set here which includes use of new
insert TAMs for COPY FROM. With this, postgres not only will have the
new TAM for inserts, but they also can make the following commands
faster - CREATE TABLE AS, SELECT INTO, CREATE MATERIALIZED VIEW,
REFRESH MATERIALIZED VIEW and COPY FROM. I'll perform some testing in
the coming days and post the results here, until then I appreciate any
feedback on the patches.

I've also added this proposal to CF -
https://commitfest.postgresql.org/47/4777/.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

29 January 2024, 07:27:00

On Wed, Jan 17, 2024 at 10:57 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Thank you. I'm attaching v8 patch-set here which includes use of new
> insert TAMs for COPY FROM. With this, postgres not only will have the
> new TAM for inserts, but they also can make the following commands
> faster - CREATE TABLE AS, SELECT INTO, CREATE MATERIALIZED VIEW,
> REFRESH MATERIALIZED VIEW and COPY FROM. I'll perform some testing in
> the coming days and post the results here, until then I appreciate any
> feedback on the patches.
>
> I've also added this proposal to CF -
> https://commitfest.postgresql.org/47/4777/.

Some of the tests related to Incremental Sort added by a recent commit
0452b461bc4 in aggregates.sql are failing when the multi inserts
feature is used for CTAS (like done in 0002 patch). I'm not so sure if
it's because of the reduction in the CTAS execution times. Execution
time for table 'btg' created with CREATE TABLE AS added by commit
0452b461bc4 with single inserts is 25.3 msec, with multi inserts is
17.7 msec. This means that the multi inserts are about 1.43 times or
30.04% faster  than the single inserts. Couple of ways to make these
tests pick Incremental Sort as expected - 1) CLUSTER btg USING abc; or
2) increase the number of rows in table btg to 100K from 10K. FWIW, if
I reduce the number of rows in the table from 10K to 1K, the
Incremental Sort won't get picked on HEAD with CTAS using single
inserts. Hence, I chose option (2) to fix the issue.

Please find the attached v9 patch set.

[1]
 -- Engage incremental sort
 explain (COSTS OFF) SELECT x,y FROM btg GROUP BY x,y,z,w;
-                   QUERY PLAN
--------------------------------------------------
+          QUERY PLAN
+------------------------------
  Group
    Group Key: x, y, z, w
-   ->  Incremental Sort
+   ->  Sort
          Sort Key: x, y, z, w
-         Presorted Key: x, y
-         ->  Index Scan using btg_x_y_idx on btg
-(6 rows)
+         ->  Seq Scan on btg
+(5 rows)

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

29 January 2024, 11:46:47

On Mon, Jan 29, 2024 at 12:57 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Wed, Jan 17, 2024 at 10:57 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > Thank you. I'm attaching v8 patch-set here which includes use of new
> > insert TAMs for COPY FROM. With this, postgres not only will have the
> > new TAM for inserts, but they also can make the following commands
> > faster - CREATE TABLE AS, SELECT INTO, CREATE MATERIALIZED VIEW,
> > REFRESH MATERIALIZED VIEW and COPY FROM. I'll perform some testing in
> > the coming days and post the results here, until then I appreciate any
> > feedback on the patches.
> >
> > I've also added this proposal to CF -
> > https://commitfest.postgresql.org/47/4777/.
>
> Some of the tests related to Incremental Sort added by a recent commit
> 0452b461bc4 in aggregates.sql are failing when the multi inserts
> feature is used for CTAS (like done in 0002 patch). I'm not so sure if
> it's because of the reduction in the CTAS execution times. Execution
> time for table 'btg' created with CREATE TABLE AS added by commit
> 0452b461bc4 with single inserts is 25.3 msec, with multi inserts is
> 17.7 msec. This means that the multi inserts are about 1.43 times or
> 30.04% faster  than the single inserts. Couple of ways to make these
> tests pick Incremental Sort as expected - 1) CLUSTER btg USING abc; or
> 2) increase the number of rows in table btg to 100K from 10K. FWIW, if
> I reduce the number of rows in the table from 10K to 1K, the
> Incremental Sort won't get picked on HEAD with CTAS using single
> inserts. Hence, I chose option (2) to fix the issue.
>
> Please find the attached v9 patch set.
>
> [1]
>  -- Engage incremental sort
>  explain (COSTS OFF) SELECT x,y FROM btg GROUP BY x,y,z,w;
> -                   QUERY PLAN
> --------------------------------------------------
> +          QUERY PLAN
> +------------------------------
>   Group
>     Group Key: x, y, z, w
> -   ->  Incremental Sort
> +   ->  Sort
>           Sort Key: x, y, z, w
> -         Presorted Key: x, y
> -         ->  Index Scan using btg_x_y_idx on btg
> -(6 rows)
> +         ->  Seq Scan on btg
> +(5 rows)

CF bot machine with Windows isn't happy with the compilation [1], so
fixed those warnings and attached v10 patch set.

[1]
[07:35:25.458] [632/2212] Compiling C object
src/backend/postgres_lib.a.p/commands_copyfrom.c.obj
[07:35:25.458] c:\cirrus\src\include\access\tableam.h(1574) : warning
C4715: 'table_multi_insert_slots': not all control paths return a
value
[07:35:25.458] c:\cirrus\src\include\access\tableam.h(1522) : warning
C4715: 'table_insert_begin': not all control paths return a value
[07:35:25.680] c:\cirrus\src\include\access\tableam.h(1561) : warning
C4715: 'table_multi_insert_next_free_slot': not all control paths
return a value
[07:35:25.680] [633/2212] Compiling C object
src/backend/postgres_lib.a.p/commands_createas.c.obj
[07:35:25.680] c:\cirrus\src\include\access\tableam.h(1522) : warning
C4715: 'table_insert_begin': not all control paths return a value
[07:35:26.310] [646/2212] Compiling C object
src/backend/postgres_lib.a.p/commands_matview.c.obj
[07:35:26.310] c:\cirrus\src\include\access\tableam.h(1522) : warning
C4715: 'table_insert_begin': not all control paths return a value

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

02 March 2024, 06:32:29

On Mon, Jan 29, 2024 at 5:16 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> > Please find the attached v9 patch set.

I've had to rebase the patches due to commit 874d817, please find the
attached v11 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

08 March 2024, 10:36:59

On Sat, Mar 2, 2024 at 12:02 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Mon, Jan 29, 2024 at 5:16 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > > Please find the attached v9 patch set.
>
> I've had to rebase the patches due to commit 874d817, please find the
> attached v11 patch set.

Rebase needed. Please see the v12 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: New Table Access Methods for Multi and Single Inserts

From

Masahiko Sawada

Date:

19 March 2024, 05:09:36

Hi,

On Fri, Mar 8, 2024 at 7:37 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Sat, Mar 2, 2024 at 12:02 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > On Mon, Jan 29, 2024 at 5:16 PM Bharath Rupireddy
> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> > >
> > > > Please find the attached v9 patch set.
> >
> > I've had to rebase the patches due to commit 874d817, please find the
> > attached v11 patch set.
>
> Rebase needed. Please see the v12 patch set.
>

I've not reviewed the patches in depth yet, but run performance tests
for CREATE MATERIALIZED VIEW. The test scenarios is:

-- setup
create unlogged table test (c int);
insert into test select generate_series(1, 10000000);

-- run
create materialized view test_mv as select * from test;

Here are the results:

* HEAD
3775.221 ms
3744.039 ms
3723.228 ms

* v12 patch
6289.972 ms
5880.674 ms
7663.509 ms

I can see performance regressions and the perf report says that CPU
spent most time on extending the ResourceOwner's array while copying
the buffer-heap tuple:

- 52.26% 0.18% postgres postgres [.] intorel_receive
    52.08% intorel_receive
        table_multi_insert_v2 (inlined)
        - heap_multi_insert_v2
            - 51.53% ExecCopySlot (inlined)
                tts_buffer_heap_copyslot
                tts_buffer_heap_store_tuple (inlined)
             - IncrBufferRefCount
                 - ResourceOwnerEnlarge
                     ResourceOwnerAddToHash (inlined)

Is there any reason why we copy a buffer-heap tuple to another
buffer-heap tuple? Which results in that we increments the buffer
refcount and register it to ResourceOwner for every tuples. I guess
that the destination tuple slot is not necessarily a buffer-heap, and
we could use VirtualTupleTableSlot instead. It would in turn require
copying a heap tuple. I might be missing something but it improved the
performance at least in my env. The change I made was:

-       dstslot = table_slot_create(state->rel, NULL);
+       //dstslot = table_slot_create(state->rel, NULL);
+       dstslot = MakeTupleTableSlot(RelationGetDescr(state->rel),
+                                    &TTSOpsVirtual);
+

And the execution times are:
1588.984 ms
1591.618 ms
1582.519 ms

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

21 March 2024, 04:14:16

On Tue, Mar 19, 2024 at 10:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I've not reviewed the patches in depth yet, but run performance tests
> for CREATE MATERIALIZED VIEW. The test scenarios is:

Thanks for looking into this.

> Is there any reason why we copy a buffer-heap tuple to another
> buffer-heap tuple? Which results in that we increments the buffer
> refcount and register it to ResourceOwner for every tuples. I guess
> that the destination tuple slot is not necessarily a buffer-heap, and
> we could use VirtualTupleTableSlot instead. It would in turn require
> copying a heap tuple. I might be missing something but it improved the
> performance at least in my env. The change I made was:
>
> -       dstslot = table_slot_create(state->rel, NULL);
> +       //dstslot = table_slot_create(state->rel, NULL);
> +       dstslot = MakeTupleTableSlot(RelationGetDescr(state->rel),
> +                                    &TTSOpsVirtual);
> +
>
> And the execution times are:
> 1588.984 ms
> 1591.618 ms
> 1582.519 ms

Yes, usingVirtualTupleTableSlot helps improve the performance a lot.
Below are results from my testing. Note that CMV, RMV, CTAS stand for
CREATE MATERIALIZED VIEW, REFRESH MATERIALIZED VIEW, CREATE TABLE AS
respectively. These commands got faster by 62.54%, 68.87%, 74.31% or
2.67, 3.21, 3.89 times respectively. I've used the test case specified
at [1].

HEAD:
CMV:
Time: 6276.468 ms (00:06.276)
CTAS:
Time: 8141.632 ms (00:08.142)
RMV:
Time: 14747.139 ms (00:14.747)

PATCHED:
CMV:
Time: 2350.282 ms (00:02.350)
CTAS:
Time: 2091.427 ms (00:02.091)
RMV:
Time: 4590.180 ms (00:04.590)

I quickly looked at the description of what a "virtual" tuple is from
src/include/executor/tuptable.h [2]. IIUC, it is invented for
minimizing data copying, but it also says that it's the responsibility
of the generating plan node to be sure these resources are not
released for as long as the virtual tuple needs to be valid or is
materialized. While it says this, as far as this patch is concerned,
the virtual slot gets materialized when we copy the tuples from source
slot (can be any type of slot) to destination slot (which is virtual
slot). See ExecCopySlot->
tts_virtual_copyslot->tts_virtual_materialize. This way,
tts_virtual_copyslot ensures the tuples storage doesn't depend on
external memory because all the datums that aren't passed by value are
copied into the slot's memory context.

With the above understanding, it looks safe to use virtual slots for
the multi insert buffered slots. I'm not so sure if I'm missing
anything here.

[1]
cd $PWD/pg17/bin
rm -rf data logfile
./initdb -D data
./pg_ctl -D data -l logfile start

./psql -d postgres
\timing
drop table test cascade;
create unlogged table test (c int);
insert into test select generate_series(1, 10000000);
create materialized view test_mv as select * from test;
create table test_copy as select * from test;
insert into test select generate_series(1, 10000000);
refresh materialized view test_mv;

[2]
 * A "virtual" tuple is an optimization used to minimize physical data copying
 * in a nest of plan nodes.  Until materialized pass-by-reference Datums in
 * the slot point to storage that is not directly associated with the
 * TupleTableSlot; generally they will point to part of a tuple stored in a
 * lower plan node's output TupleTableSlot, or to a function result
 * constructed in a plan node's per-tuple econtext.  It is the responsibility
 * of the generating plan node to be sure these resources are not released for
 * as long as the virtual tuple needs to be valid or is materialized.  Note
 * also that a virtual tuple does not have any "system columns".

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

21 March 2024, 07:40:16

On Thu, Mar 21, 2024 at 9:44 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Yes, usingVirtualTupleTableSlot helps improve the performance a lot.
> Below are results from my testing. Note that CMV, RMV, CTAS stand for
> CREATE MATERIALIZED VIEW, REFRESH MATERIALIZED VIEW, CREATE TABLE AS
> respectively. These commands got faster by 62.54%, 68.87%, 74.31% or
> 2.67, 3.21, 3.89 times respectively. I've used the test case specified
> at [1].
>
> HEAD:
> CMV:
> Time: 6276.468 ms (00:06.276)
> CTAS:
> Time: 8141.632 ms (00:08.142)
> RMV:
> Time: 14747.139 ms (00:14.747)
>
> PATCHED:
> CMV:
> Time: 2350.282 ms (00:02.350)
> CTAS:
> Time: 2091.427 ms (00:02.091)
> RMV:
> Time: 4590.180 ms (00:04.590)
>
> I quickly looked at the description of what a "virtual" tuple is from
> src/include/executor/tuptable.h [2]. IIUC, it is invented for
> minimizing data copying, but it also says that it's the responsibility
> of the generating plan node to be sure these resources are not
> released for as long as the virtual tuple needs to be valid or is
> materialized. While it says this, as far as this patch is concerned,
> the virtual slot gets materialized when we copy the tuples from source
> slot (can be any type of slot) to destination slot (which is virtual
> slot). See ExecCopySlot->
> tts_virtual_copyslot->tts_virtual_materialize. This way,
> tts_virtual_copyslot ensures the tuples storage doesn't depend on
> external memory because all the datums that aren't passed by value are
> copied into the slot's memory context.
>
> With the above understanding, it looks safe to use virtual slots for
> the multi insert buffered slots. I'm not so sure if I'm missing
> anything here.

I'm attaching the v13 patches using virtual tuple slots for buffered
tuples for multi inserts.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: New Table Access Methods for Multi and Single Inserts

From

Jeff Davis

Date:

23 March 2024, 00:17:12

On Thu, 2024-03-21 at 13:10 +0530, Bharath Rupireddy wrote:
> I'm attaching the v13 patches using virtual tuple slots for buffered
> tuples for multi inserts.

Comments:

* Do I understand correctly that CMV, RMV, and CTAS experience a
performance benefit, but COPY FROM does not? And is that because COPY
already used table_multi_insert, whereas CMV and RMV did not?

* In the COPY FROM code, it looks like it's deciding whether to flush
based on MAX_BUFFERED_TUPLES, but the slot array is allocated with
MAX_BUFFERED_SLOTS (they happen to be the same for heap, but perhaps
not for other AMs). The copy code shouldn't be using internal knowledge
of the multi-insert code; it should know somehow from the API when the
right time is to flush.

* How is the memory management expected to work? It looks like COPY
FROM is using the ExprContext when running the input functions, but we
really want to be using a memory context owned by the table AM, right?

* What's the point of the table_multi_insert_slots() "num_slots"
argument? The only caller simply discards it.

* table_tuple_insert_v2 isn't called anywhere, what's it for?

* the "v2" naming is inconsistent -- it seems you only added it in
places where there's a name conflict, which makes it hard to tell which
API methods go together. I'm not sure how widely table_multi_insert* is
used outside of core, so it's possible that we may even be able to just
change those APIs and the few extensions that call it can be updated.

* Memory usage tracking should be done in the AM by allocating
everything in a single context so it's easy to check the size. Don't
manually add up memory.

* I don't understand: "Caller may have got the slot using
heap_multi_insert_next_free_slot, filled it and passed. So, skip
copying in such a case." If the COPY FROM had a WHERE clause and
skipped a tuple after filling the slot, doesn't that mean the slot has
bogus data from the last tuple?

* We'd like this to work for insert-into-select (IIS) and logical
replication, too. Do you see any problem there, or is it just a matter
of code?

* Andres had some comments[1] that don't seem entirely addressed.
  - You are still allocating the AM-specific part of TableModifyState
as a separately-allocated chunk.
  - It's still called TableInsertState rather than TableModifyState as
he suggested. If you change that, you should also change to
table_modify_begin/end.
  - CID: I suppose Andres is considering the use case of "BEGIN; ...
ten thousand inserts ... COMMIT;". I don't think this problem is really
solvable (discussed below) but we should have some response/consensus
on that point.
  - He mentioned that we only need one new method implemented by the
AM. I don't know if one is enough, but 7 does seem excessive. I have
some simplification ideas below.

Overall:

If I understand correctly, there are two ways to use the API:

1. used by CTAS, MV:

  tistate = table_insert_begin(...);
  table_multi_insert_v2(tistate, tup1);
  ...
  table_multi_insert_v2(tistate, tupN);
  table_insert_end(tistate);

2. used by COPY ... FROM:

  tistate = table_insert_begin(..., SKIP_FLUSH);
  if (multi_insert_slot_array_is_full())
     table_multi_insert_flush(tistate);
  slot = table_insert_next_free_slot(tistate);
  ... fill slot with tup1
  table_multi_insert_v2(tistate, tup1);
  ...
  slot = table_insert_next_free_slot(tistate);
  ... fill slot with tupN
  table_multi_insert_v2(tistate, tupN);
  table_insert_end(tistate);

Those two uses need comments explaining what's going on. It appears the
SKIP_FLUSH flag is used to indicate which use the caller intends.

Use #2 is not enforced well by either the API or runtime checks. If the
caller neglects to check for a full buffer, it appears that it will
just overrun the slots array.

Also, for use #2, table_multi_insert_v2() doesn't do much other than
incrementing the memory used. The slot will never be NULL because it
was obtained with table_multi_insert_next_free_slot(), and the other
two branches don't happen when SKIP_FLUSH is true.

The real benefit to COPY of your new API is that the AM can manage
slots for itself, and how many tuples may be tracked (which might be a
lot higher for non-heap AMs).

I agree with Luc Vlaming's comment[2] that more should be left to the
table AM. Your patch tries too hard to work with the copyfrom.c slot
array, somehow sharing it with the table AM. That adds complexity to
the API and feels like a layering violation.

We also shouldn't mandate a slot array in the API. Each slot is 64
bytes -- a lot of overhead for small tuples. For a non-heap AM, it's
much better to store the tuple data in a big contiguous chunk with
minimal overhead.

Let's just have a simple API like:

  tmstate = table_modify_begin(...);
  table_modify_save_insert(tmstate, tup1);
  ...
  table_modify_save_insert(tmstate, tupN);
  table_modify_end(tmstate);

and leave it up to the AM to do all the buffering and flushing work (as
Luc Vlaming suggested[2]).

That leaves one problem, which is: how do we update the indexes and
call AR triggers while flushing? I think the best way is to just have a
callback in the TableModifyState that is called during flush. (I don't
think that would affect performance, but worth double-checking.)

We have to disable this whole multi-insert mechanism if there are
volatile BR/AR triggers, because those are supposed to see already-
inserted tuples. That's not a problem with your patch but it is a bit
unfortunate -- triggers can be costly already, but this increases the
penalty. There may be some theoretical ways to avoid this problem, like
reading tuples out of the unflushed buffer during a SELECT, which
sounds a little too clever (though perhaps not completely crazy if the
AM is in control of both?).

For potentially working with multi-updates/deletes, it might be as
simple as tracking the old TIDs along with the slots and having new
_save_update and _save_delete methods. I haven't thought deeply about
that, and I'm not sure we have a good example AM to work with, but it
seems plausible that we could make something useful here.

To batch multiple different INSERT statements within a transaction just
seems like a really hard problem. That could mean different CIDs, but
also different subtransaction IDs. Constraint violation errors will
happen at the time of flushing, which could be many commands later from
the one that actually violates the constraint. And what if someone
issues a SELECT in the middle of the transaction, how does it see the
already-inserted-but-not-flushed tuples? If that's not hard enough
already, then you would also need to extend low-level APIs to accept
arbitrary CIDs and subxact IDs when storing tuples during a flush. The
only way I could imagine solving all of these problems is declaring
somehow that your transaction won't do any of these complicated things,
and that you don't mind getting constraint violations at the wrong
time. So I recommend that you punt on this problem.

Regards,
    Jeff Davis

[1]
https://www.postgresql.org/message-id/20230603223824.o7iyochli2dwwi7k%40alap3.anarazel.de
[2]
https://www.postgresql.org/message-id/508af801-6356-d36b-1867-011ac6df8f55%40swarm64.com

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

25 March 2024, 19:58:00

On Sat, Mar 23, 2024 at 5:47 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> Comments:

Thanks for looking into it.

> * Do I understand correctly that CMV, RMV, and CTAS experience a
> performance benefit, but COPY FROM does not? And is that because COPY
> already used table_multi_insert, whereas CMV and RMV did not?

Yes, that's right. COPY FROM is already optimized with multi inserts.

I now have a feeling that I need to simplify the patches. I'm thinking
of dropping the COPY FROM patch using the new multi insert API for the
following reasons:
1. We can now remove some of the new APIs (table_multi_insert_slots
and table_multi_insert_next_free_slot) that were just invented for
COPY FROM.
2. COPY FROM is already optimized with multi inserts, so no real gain
is expected with the new multi insert API.
3. As we are inching towards feature freeze, simplifying the patches
by having only the necessary things increases the probability of
getting this in.
4. The real benefit of this whole new multi insert API is seen if used
for the commands CMV, RMV, CTAS. These commands got faster by 62.54%,
68.87%, 74.31% or 2.67, 3.21, 3.89 times respectively.
5. This leaves with really simple APIs. No need for callback stuff for
dealing with indexes, triggers etc. as CMV, RMV, CTAS cannot have any
of them.

The new APIs are more extensible, memory management is taken care of
by AM, and with TableModifyState as the structure name and more
meaningful API names. The callback for triggers/indexes etc. aren't
taken care of as I'm now only focusing on CTAS, CMV, RMV
optimizations.

Please see the attached v14 patches.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: New Table Access Methods for Multi and Single Inserts

From

Jeff Davis

Date:

26 March 2024, 15:37:04

On Tue, 2024-03-26 at 01:28 +0530, Bharath Rupireddy wrote:
> I'm thinking
> of dropping the COPY FROM patch using the new multi insert API for
> the
> following reasons: ...

I agree with all of this. We do want COPY ... FROM support, but there
are some details to work out and we don't want to make a big code
change at this point in the cycle.

> The new APIs are more extensible, memory management is taken care of
> by AM, and with TableModifyState as the structure name and more
> meaningful API names. The callback for triggers/indexes etc. aren't
> taken care of as I'm now only focusing on CTAS, CMV, RMV
> optimizations.
>
> Please see the attached v14 patches.

* No need for a 'kind' field in TableModifyState. The state should be
aware of the kinds of changes that it has received and that may need to
be flushed later -- for now, only inserts, but possibly updates/deletes
in the future.

* If the AM doesn't support the bulk methods, fall back to retail
inserts instead of throwing an error.

* It seems like this API will eventually replace table_multi_insert and
table_finish_bulk_insert completely. Do those APIs have any advantage
remaining over the new one proposed here?

* Right now I don't any important use of the flush method. It seems
that could be accomplished in the finish method, and flush could just
be an internal detail when the memory is exhausted. If we find a use
for it later, we can always add it, but right now it seems unnecessary.

* We need to be careful about cases where the command can be successful
but the writes are not flushed. I don't tihnk that's a problem with the
current patch, but we will need to do something here when we expand to
INSERT INTO ... SELECT.

Andres, is this patch overall closer to what you had in mind in the
email here:

https://www.postgresql.org/message-id/20230603223824.o7iyochli2dwwi7k@alap3.anarazel.de

?

Regards,
    Jeff Davis

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

26 March 2024, 19:49:51

On Tue, Mar 26, 2024 at 9:07 PM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Tue, 2024-03-26 at 01:28 +0530, Bharath Rupireddy wrote:
> > I'm thinking
> > of dropping the COPY FROM patch using the new multi insert API for
> > the
> > following reasons: ...
>
> I agree with all of this. We do want COPY ... FROM support, but there
> are some details to work out and we don't want to make a big code
> change at this point in the cycle.

Right.

> > Please see the attached v14 patches.
>
> * No need for a 'kind' field in TableModifyState. The state should be
> aware of the kinds of changes that it has received and that may need to
> be flushed later -- for now, only inserts, but possibly updates/deletes
> in the future.

Removed 'kind' field with lazy initialization of required AM specific
modify (insert in this case) state. Since we don't have 'kind', I
chose the callback approach to cleanup the modify (insert in this
case) specific state at the end.

> * If the AM doesn't support the bulk methods, fall back to retail
> inserts instead of throwing an error.

For instance, CREATE MATERIALIZED VIEW foo_mv AS SELECT * FROM foo
USING bar_tam; doesn't work if bar_tam doesn't have the
table_tuple_insert implemented.

Similarly, with this new AM, the onus lies on the table AM
implementers to provide an implementation for these new AMs even if
they just do single inserts. But, I do agree that we must catch this
ahead during parse analysis itself, so I've added assertions in
GetTableAmRoutine().

> * It seems like this API will eventually replace table_multi_insert and
> table_finish_bulk_insert completely. Do those APIs have any advantage
> remaining over the new one proposed here?

table_multi_insert needs to be there for sure as COPY ... FROM uses
it. Not sure if we need to remove the optional callback
table_finish_bulk_insert though. Heap AM doesn't implement one, but
some other AM might. Having said that, with this new AM, whatever the
logic that used to be there in table_finish_bulk_insert previously,
table AM implementers will have to move them to table_modify_end.

FWIW, I can try writing a test table AM that uses this new AM but just
does single inserts, IOW, equivalent to table_tuple_insert().
Thoughts?

> * Right now I don't any important use of the flush method. It seems
> that could be accomplished in the finish method, and flush could just
> be an internal detail when the memory is exhausted. If we find a use
> for it later, we can always add it, but right now it seems unnecessary.

Firstly, we are not storing CommandId and options in TableModifyState,
because we expect CommandId to be changing (per Andres comment).
Secondly, we don't want to pass just the CommandId and options to
table_modify_end(). Thirdly, one just has to call the
table_modify_buffer_flush before the table_modify_end. Do you have any
other thoughts here?

> * We need to be careful about cases where the command can be successful
> but the writes are not flushed. I don't tihnk that's a problem with the
> current patch, but we will need to do something here when we expand to
> INSERT INTO ... SELECT.

You mean, writes are not flushed to the disk? Can you please elaborate
why it's different for INSERT INTO ... SELECT and not others? Can't
the new flush AM be helpful here to implement any flush related
things?

Please find the attached v15 patches with the above review comments addressed.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: New Table Access Methods for Multi and Single Inserts

From

Jeff Davis

Date:

27 March 2024, 08:12:19

On Wed, 2024-03-27 at 01:19 +0530, Bharath Rupireddy wrote:
>
> Similarly, with this new AM, the onus lies on the table AM
> implementers to provide an implementation for these new AMs even if
> they just do single inserts.

Why not fall back to using the plain tuple_insert? Surely some table
AMs might be simple and limited, and we shouldn't break them just
because they don't implement the new APIs.

>
> table_multi_insert needs to be there for sure as COPY ... FROM uses
> it.

After we have these new APIs fully in place and used by COPY, what will
happen to those other APIs? Will they be deprecated or will there be a
reason to keep them?

> FWIW, I can try writing a test table AM that uses this new AM but
> just
> does single inserts, IOW, equivalent to table_tuple_insert().
> Thoughts?

More table AMs to test against would be great, but I also know that can
be a lot of work.

>
> Firstly, we are not storing CommandId and options in
> TableModifyState,
> because we expect CommandId to be changing (per Andres comment).

Trying to make this feature work across multiple commands poses a lot
of challenges: what happens when there are SELECTs and subtransactions
and non-deferrable constraints?

Regardless, if we care about multiple CIDs, they should be stored along
with the tuples, not supplied at the time of flushing.

> You mean, writes are not flushed to the disk? Can you please
> elaborate
> why it's different for INSERT INTO ... SELECT and not others? Can't
> the new flush AM be helpful here to implement any flush related
> things?

Not a major problem. We can discuss while working on IIS support.

I am concnerned that the flush callback is not a part of the API. We
will clearly need that to support index insertions for COPY/IIS, so as-
is the API feels incomplete. Thoughts?

Regards,
    Jeff Davis

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

31 March 2024, 15:48:14

On Wed, Mar 27, 2024 at 1:42 PM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Wed, 2024-03-27 at 01:19 +0530, Bharath Rupireddy wrote:
> >
> > Similarly, with this new AM, the onus lies on the table AM
> > implementers to provide an implementation for these new AMs even if
> > they just do single inserts.
>
> Why not fall back to using the plain tuple_insert? Surely some table
> AMs might be simple and limited, and we shouldn't break them just
> because they don't implement the new APIs.

Hm. That might complicate table_modify_begin,
table_modify_buffer_insert and table_modify_end a bit. What do we put
in TableModifyState then? Do we create the bulk insert state
(BulkInsertStateData) outside? I think to give a better interface, can
we let TAM implementers support these new APIs in their own way? If
this sounds rather intrusive, we can just implement the fallback to
tuple_insert if these new API are not supported in the caller, for
example, do something like below in createas.c and matview.c.
Thoughts?

if (table_modify_buffer_insert() is defined)
   table_modify_buffer_insert(...);
else
{
  myState->bistate = GetBulkInsertState();
  table_tuple_insert(...);
}

> > table_multi_insert needs to be there for sure as COPY ... FROM uses
> > it.
>
> After we have these new APIs fully in place and used by COPY, what will
> happen to those other APIs? Will they be deprecated or will there be a
> reason to keep them?

Deprecated perhaps?

Please find the attached v16 patches for further review.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: New Table Access Methods for Multi and Single Inserts

From

Jeff Davis

Date:

02 April 2024, 19:40:36

On Sun, 2024-03-31 at 21:18 +0530, Bharath Rupireddy wrote:
> if (table_modify_buffer_insert() is defined)
>    table_modify_buffer_insert(...);
> else
> {
>   myState->bistate = GetBulkInsertState();
>   table_tuple_insert(...);
> }

We can't alloc/free the bulk insert state for every insert call. I see
two options:

* Each caller needs to support two code paths: if the buffered insert
APIs are defined, then use those; otherwise the caller needs to manage
the bulk insert state itself and call the plain insert API.

* Have default implementation for the new API methods, so that the
default for the begin method would allocate the bulk insert state, and
the default for the buffered insert method would be to call plain
insert using the bulk insert state.

I'd prefer the latter, at least in the long term. But I haven't really
thought through the details, so perhaps we'd need to use the former.

> >
> > After we have these new APIs fully in place and used by COPY, what
> > will
> > happen to those other APIs? Will they be deprecated or will there
> > be a
> > reason to keep them?
>
> Deprecated perhaps?

Including Alexander on this thread, because he's making changes to the
multi-insert API. We need some consensus on where we are going with
these APIs before we make more changes, and what incremental steps make
sense in v17.

Here's where I think this API should go:

1. Have table_modify_begin/end and table_modify_buffer_insert, like
those that are implemented in your patch.

2. Add some kind of flush callback that will be called either while the
tuples are being flushed or after the tuples are flushed (but before
they are freed by the AM). (Aside: do we need to call it while the
tuples are being flushed to get the right visibility semantics for
after-row triggers?)

3. Add table_modify_buffer_{update|delete} APIs.

4. Some kind of API tweaks to help manage memory when modifying
pertitioned tables, so that the buffering doesn't get out of control.
Perhaps just reporting memory usage and allowing the caller to force
flushes would be enough.

5. Use these new methods for CREATE/REFRESH MATERIALIZED VIEW. This is
fairly straightforward, I believe, and handled by your patch. Indexes
are (re)built afterward, and no triggers are possible.

6. Use these new methods for CREATE TABLE ... AS. This is fairly
straightforward, I believe, and handled by your patch. No indexes or
triggers are possible.

7. Use these new methods for COPY. We have to be careful to avoid
regressions for the heap method, because it's already managing its own
buffers. If the AM manages the buffering, then it may require
additional copying of slots, which could be a disadvantage. To solve
this, we may need some minor API tweaks to avoid copying when the
caller guarantees that the memory will not be freed to early, or
perhaps expose the AM's memory context to copyfrom.c. Another thing to
consider is that the buffering in copyfrom.c is also used for FDWs, so
that buffering code path needs to be preserved in copyfrom.c even if
not used for AMs.

8. Use these new methods for INSERT INTO ... SELECT. One potential
challenge here is that execution nodes are not always run to
completion, so we need to be sure that the flush isn't forgotten in
that case.

9. Use these new methods for DELETE, UPDATE, and MERGE. MERGE can use
the buffer_insert/update/delete APIs; we don't need a separate merge
method. This probably requires that the AM maintain 3 separate buffers
to distinguish different kinds of changes at flush time (obviously
these can be initialized lazily to avoid overhead when not being used).

10. Use these new methods for logical apply.

11. Deprecate the multi_insert API.

Thoughts on this plan? Does your patch make sense in v17 as a stepping
stone, or should we try to make all of these API changes together in
v18?

Also, a sample AM code would be a huge benefit here. Writing a real AM
is hard, but perhaps we can at least have an example one to demonstrate
how to use these APIs?

Regards,
    Jeff Davis

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

03 April 2024, 09:02:43

On Wed, Apr 3, 2024 at 1:10 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Sun, 2024-03-31 at 21:18 +0530, Bharath Rupireddy wrote:
> > if (table_modify_buffer_insert() is defined)
> >    table_modify_buffer_insert(...);
> > else
> > {
> >   myState->bistate = GetBulkInsertState();
> >   table_tuple_insert(...);
> > }
>
> We can't alloc/free the bulk insert state for every insert call. I see
> two options:
>
> * Each caller needs to support two code paths: if the buffered insert
> APIs are defined, then use those; otherwise the caller needs to manage
> the bulk insert state itself and call the plain insert API.
>
> * Have default implementation for the new API methods, so that the
> default for the begin method would allocate the bulk insert state, and
> the default for the buffered insert method would be to call plain
> insert using the bulk insert state.
>
> I'd prefer the latter, at least in the long term. But I haven't really
> thought through the details, so perhaps we'd need to use the former.

I too prefer the latter so that the caller doesn't have to have two
paths. The new API can just transparently fallback to single inserts.
I've implemented that in the attached v17 patch. I also tested the
default APIs manually, but I'll see if I can add some tests to it the
default API.

> > > After we have these new APIs fully in place and used by COPY, what
> > > will
> > > happen to those other APIs? Will they be deprecated or will there
> > > be a
> > > reason to keep them?
> >
> > Deprecated perhaps?
>
> Including Alexander on this thread, because he's making changes to the
> multi-insert API. We need some consensus on where we are going with
> these APIs before we make more changes, and what incremental steps make
> sense in v17.
>
> Here's where I think this API should go:
>
> 1. Have table_modify_begin/end and table_modify_buffer_insert, like
> those that are implemented in your patch.
>
> 2. Add some kind of flush callback that will be called either while the
> tuples are being flushed or after the tuples are flushed (but before
> they are freed by the AM). (Aside: do we need to call it while the
> tuples are being flushed to get the right visibility semantics for
> after-row triggers?)
>
> 3. Add table_modify_buffer_{update|delete} APIs.
>
> 4. Some kind of API tweaks to help manage memory when modifying
> pertitioned tables, so that the buffering doesn't get out of control.
> Perhaps just reporting memory usage and allowing the caller to force
> flushes would be enough.
>
> 5. Use these new methods for CREATE/REFRESH MATERIALIZED VIEW. This is
> fairly straightforward, I believe, and handled by your patch. Indexes
> are (re)built afterward, and no triggers are possible.
>
> 6. Use these new methods for CREATE TABLE ... AS. This is fairly
> straightforward, I believe, and handled by your patch. No indexes or
> triggers are possible.
>
> 7. Use these new methods for COPY. We have to be careful to avoid
> regressions for the heap method, because it's already managing its own
> buffers. If the AM manages the buffering, then it may require
> additional copying of slots, which could be a disadvantage. To solve
> this, we may need some minor API tweaks to avoid copying when the
> caller guarantees that the memory will not be freed to early, or
> perhaps expose the AM's memory context to copyfrom.c. Another thing to
> consider is that the buffering in copyfrom.c is also used for FDWs, so
> that buffering code path needs to be preserved in copyfrom.c even if
> not used for AMs.
>
> 8. Use these new methods for INSERT INTO ... SELECT. One potential
> challenge here is that execution nodes are not always run to
> completion, so we need to be sure that the flush isn't forgotten in
> that case.
>
> 9. Use these new methods for DELETE, UPDATE, and MERGE. MERGE can use
> the buffer_insert/update/delete APIs; we don't need a separate merge
> method. This probably requires that the AM maintain 3 separate buffers
> to distinguish different kinds of changes at flush time (obviously
> these can be initialized lazily to avoid overhead when not being used).
>
> 10. Use these new methods for logical apply.
>
> 11. Deprecate the multi_insert API.
>
> Thoughts on this plan? Does your patch make sense in v17 as a stepping
> stone, or should we try to make all of these API changes together in
> v18?

I'd like to see the new multi insert API (as proposed in the v17
patches) for PG17 if possible. The basic idea with these new APIs is
to let the AM implementers choose the right buffered insert strategy
(one can choose the AM specific slot type to buffer the tuples, choose
the AM specific memory and flushing decisions etc.). Another advantage
with these new multi insert API is that the CREATE MATERIALIZED VIEW,
REFRESH MATERIALIZED VIEW, CREATE TABLE AS commands for heap AM got
faster by 62.54%, 68.87%, 74.31% or 2.67, 3.21, 3.89 times
respectively. The performance improvement in REFRESH MATERIALIZED VIEW
can benefit customers running analytical workloads on postgres.

I'm fine if we gradually add more infrastructure to support COPY,
INSERT INTO SELECT, Logical Replication Apply, Table Rewrites in
future releases. I'm sure it requires a lot more thoughts and time.

> Also, a sample AM code would be a huge benefit here. Writing a real AM
> is hard, but perhaps we can at least have an example one to demonstrate
> how to use these APIs?

The heap AM implements this new API. Also, there's a default
implementation for the new API falling back on to single inserts.
Aren't these sufficient to help AM implementers to come up with their
own implementations?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: New Table Access Methods for Multi and Single Inserts

From

Bharath Rupireddy

Date:

03 April 2024, 12:25:27

On Wed, Apr 3, 2024 at 2:32 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> I too prefer the latter so that the caller doesn't have to have two
> paths. The new API can just transparently fallback to single inserts.
> I've implemented that in the attached v17 patch. I also tested the
> default APIs manually, but I'll see if I can add some tests to it the
> default API.

Fixed a compiler warning found via CF bot. Please find the attached
v18 patches. I'm sorry for the noise.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM

From

Bharath Rupireddy

Date:

24 April 2024, 12:49:40

On Wed, Apr 3, 2024 at 1:10 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> Here's where I think this API should go:
>
> 1. Have table_modify_begin/end and table_modify_buffer_insert, like
> those that are implemented in your patch.

I added table_modify_begin, table_modify_buffer_insert, table_modify_buffer_flush and table_modify_end. Table Access Method (AM) authors now can define their own buffering strategy and flushing decisions based on their tuple storage kinds and various other AM specific factors. I also added a default implementation that falls back to single inserts when no implementation is provided for these AM by AM authors. See the attached v19-0001 patch.

> 2. Add some kind of flush callback that will be called either while the
> tuples are being flushed or after the tuples are flushed (but before
> they are freed by the AM). (Aside: do we need to call it while the
> tuples are being flushed to get the right visibility semantics for
> after-row triggers?)

I added a flush callback named TableModifyBufferFlushCallback; when provided by callers invoked after tuples are flushed to disk from the buffers but before the AM frees them up. Index insertions and AFTER ROW INSERT triggers can be executed in this callback. See the v19-0001 patch for how AM invokes the flush callback, and see either v19-0003 or v19-0004 or v19-0005 for how a caller can supply the callback and required context to execute index insertions and AR triggers.

> 3. Add table_modify_buffer_{update|delete} APIs.
>
> 9. Use these new methods for DELETE, UPDATE, and MERGE. MERGE can use
> the buffer_insert/update/delete APIs; we don't need a separate merge
> method. This probably requires that the AM maintain 3 separate buffers
> to distinguish different kinds of changes at flush time (obviously
> these can be initialized lazily to avoid overhead when not being used).

I haven't thought about these things yet. I can only focus on them after seeing how the attached patches go from here.

> 4. Some kind of API tweaks to help manage memory when modifying
> pertitioned tables, so that the buffering doesn't get out of control.
> Perhaps just reporting memory usage and allowing the caller to force
> flushes would be enough.

Heap implementation for thes new Table AMs uses a separate memory context for all of the operations. Please have a look and let me know if we need anything more.

> 5. Use these new methods for CREATE/REFRESH MATERIALIZED VIEW. This is
> fairly straightforward, I believe, and handled by your patch. Indexes
> are (re)built afterward, and no triggers are possible.
>
> 6. Use these new methods for CREATE TABLE ... AS. This is fairly
> straightforward, I believe, and handled by your patch. No indexes or
> triggers are possible.

I used multi inserts for all of these including TABLE REWRITE commands such as ALTER TABLE. See the attached v19-0002 patch. Check the testing section below for benefits.

FWIW, following are some of the TABLE REWRITE commands that can get benefitted:

ALTER TABLE tbl ALTER c1 TYPE bigint;
ALTER TABLE itest13 ADD COLUMN c int GENERATED BY DEFAULT AS IDENTITY;
ALTER MATERIALIZED VIEW heapmv SET ACCESS METHOD heap2;
ALTER TABLE itest3 ALTER COLUMN a TYPE int;
ALTER TABLE gtest20 ALTER COLUMN b SET EXPRESSION AS (a * 3);
ALTER TABLE has_volatile ADD col4 int DEFAULT (random() * 10000)::int;
and so on.

> 7. Use these new methods for COPY. We have to be careful to avoid
> regressions for the heap method, because it's already managing its own
> buffers. If the AM manages the buffering, then it may require
> additional copying of slots, which could be a disadvantage. To solve
> this, we may need some minor API tweaks to avoid copying when the
> caller guarantees that the memory will not be freed to early, or
> perhaps expose the AM's memory context to copyfrom.c. Another thing to
> consider is that the buffering in copyfrom.c is also used for FDWs, so
> that buffering code path needs to be preserved in copyfrom.c even if
> not used for AMs.

I modified the COPY FROM code to use the new Table AMs, and performed some tests which show no signs of regression. Check the testing section below for more details. See the attached v19-0005 patch. With this, table_multi_insert can be deprecated.

> 8. Use these new methods for INSERT INTO ... SELECT. One potential
> challenge here is that execution nodes are not always run to
> completion, so we need to be sure that the flush isn't forgotten in
> that case.

I did that in v19-0003. I did place the table_modify_end call in multiple places including ExecEndModifyTable. I didn't find any issues with it. Please have a look and let me know if we need the end call in more places. Check the testing section below for benefits.

> 10. Use these new methods for logical apply.

I used multi inserts for Logical Replication apply. in v19-0004. Check the testing section below for benefits.

FWIW, open-source pglogical does have multi insert support, check code around https://github.com/2ndQuadrant/pglogical/blob/REL2_x_STABLE/pglogical_apply_heap.c#L960.

> 11. Deprecate the multi_insert API.

I did remove both table_multi_insert and table_finish_bulk_insert in v19-0006. Perhaps, removing them isn't a great idea, but adding a deprecation WARNING/ERROR until some more PG releases might be worth looking at.

> Thoughts on this plan? Does your patch make sense in v17 as a stepping
> stone, or should we try to make all of these API changes together in
> v18?

If the design, code and benefits that these new Table AMs bring to the table look good, I hope to see it for PG 18.

> Also, a sample AM code would be a huge benefit here. Writing a real AM
> is hard, but perhaps we can at least have an example one to demonstrate
> how to use these APIs?

The attached patches already have implemented these new Table AMs for Heap. I don't think we need a separate implementation to demonstrate. If others feel so, I'm open to thoughts here.

Having said above, I'd like to reiterate the motivation behind the new Table AMs for multi and single inserts.

1. A scan-like API with state being carried across is thought to be better as suggested by Andres Freund - https://www.postgresql.org/message-id/20200924024128.kyk3r5g7dnu3fxxx@alap3.anarazel.de.
2. Allowing a Table AM to optimize operations across multiple inserts, define its own buffering strategy and take its own flushing decisions based on their tuple storage kinds and various other AM specific factors.
3. Improve performance of various SQL commands with multi inserts for Heap AM.

The attached v19 patches might need some more detailed comments, some documentation and some specific tests ensuring the multi inserts for Heap are kicked-in for various commands. I'm open to thoughts here.

I did some testing to see how various commands benefit with multi inserts using these new Table AM for heap. It's not only the improvement in performance these commands see, but also the amount of WAL that gets generated reduces greatly. After all, multi inserts optimize the insertions by writing less WAL. IOW, writing WAL record per page if multiple rows fit into a single data page as opposed to WAL record per row.

Test case 1: 100 million rows, 2 columns (int and float)

Command | HEAD (sec) | PATCHED (sec) | Faster by % | Faster by X
------------------------------ | ---------- | ------------- | ----------- | -----------
CREATE TABLE AS | 121 | 77 | 36.3 | 1.57
CREATE MATERIALIZED VIEW | 101 | 49 | 51.4 | 2.06
REFRESH MATERIALIZED VIEW | 113 | 54 | 52.2 | 2.09
ALTER TABLE (TABLE REWRITE) | 124 | 81 | 34.6 | 1.53
COPY FROM | 71 | 72 | 0 | 1
INSERT INTO ... SELECT | 117 | 62 | 47 | 1.88
LOGICAL REPLICATION APPLY | 393 | 306 | 22.1 | 1.28

Command | HEAD (WAL in GB) | PATCHED (WAL in GB) | Reduced by % | Reduced by X
------------------------------ | ---------------- | ------------------- | ------------ | -----------
CREATE TABLE AS | 6.8 | 2.4 | 64.7 | 2.83
CREATE MATERIALIZED VIEW | 7.2 | 2.3 | 68 | 3.13
REFRESH MATERIALIZED VIEW | 10 | 5.1 | 49 | 1.96
ALTER TABLE (TABLE REWRITE) | 8 | 3.2 | 60 | 2.5
COPY FROM | 2.9 | 3 | 0 | 1
INSERT INTO ... SELECT | 8 | 3 | 62.5 | 2.66
LOGICAL REPLICATION APPLY | 7.5 | 2.3 | 69.3 | 3.26

Test case 2: 1 billion rows, 1 column (int)

Command | HEAD (sec) | PATCHED (sec) | Faster by % | Faster by X
------------------------------ | ---------- | ------------- | ----------- | -----------
CREATE TABLE AS | 794 | 386 | 51.38 | 2.05
CREATE MATERIALIZED VIEW | 1006 | 563 | 44.03 | 1.78
REFRESH MATERIALIZED VIEW | 977 | 603 | 38.28 | 1.62
ALTER TABLE (TABLE REWRITE) | 1189 | 714 | 39.94 | 1.66
COPY FROM | 321 | 330 | -0.02 | 0.97
INSERT INTO ... SELECT | 1084 | 586 | 45.94 | 1.84
LOGICAL REPLICATION APPLY | 3530 | 2982 | 15.52 | 1.18

Command | HEAD (WAL in GB) | PATCHED (WAL in GB) | Reduced by % | Reduced by X
------------------------------ | ---------------- | ------------------- | ------------ | -----------
CREATE TABLE AS | 60 | 12 | 80 | 5
CREATE MATERIALIZED VIEW | 60 | 12 | 80 | 5
REFRESH MATERIALIZED VIEW | 60 | 12 | 80 | 5
ALTER TABLE (TABLE REWRITE) | 123 | 31 | 60 | 2.5
COPY FROM | 12 | 12 | 0 | 1
INSERT INTO ... SELECT | 120 | 24 | 80 | 5
LOGICAL REPLICATION APPLY | 61 | 12 | 80.32 | 5

Test setup:
./configure --prefix=$PWD/pg17/ --enable-tap-tests CFLAGS="-ggdb3 -O2" > install.log && make -j 8 install > install.log 2>&1 &

wal_level=logical
max_wal_size = 256GB
checkpoint_timeout = 1h

Test system is EC2 instance of type c5.4xlarge:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
CPU family: 6
Model: 85
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
Stepping: 7
BogoMIPS: 5999.99
Caches (sum of all):
L1d: 256 KiB (8 instances)
L1i: 256 KiB (8 instances)
L2: 8 MiB (8 instances)
L3: 35.8 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-15

RAM:

MemTotal: 32036536 kB

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM

From

Pavel Stehule

Date:

24 April 2024, 16:07:03

st 24. 4. 2024 v 14:50 odesílatel Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> napsal:

On Wed, Apr 3, 2024 at 1:10 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> Here's where I think this API should go:
>
> 1. Have table_modify_begin/end and table_modify_buffer_insert, like
> those that are implemented in your patch.

I added table_modify_begin, table_modify_buffer_insert, table_modify_buffer_flush and table_modify_end. Table Access Method (AM) authors now can define their own buffering strategy and flushing decisions based on their tuple storage kinds and various other AM specific factors. I also added a default implementation that falls back to single inserts when no implementation is provided for these AM by AM authors. See the attached v19-0001 patch.

> 2. Add some kind of flush callback that will be called either while the
> tuples are being flushed or after the tuples are flushed (but before
> they are freed by the AM). (Aside: do we need to call it while the
> tuples are being flushed to get the right visibility semantics for
> after-row triggers?)

I added a flush callback named TableModifyBufferFlushCallback; when provided by callers invoked after tuples are flushed to disk from the buffers but before the AM frees them up. Index insertions and AFTER ROW INSERT triggers can be executed in this callback. See the v19-0001 patch for how AM invokes the flush callback, and see either v19-0003 or v19-0004 or v19-0005 for how a caller can supply the callback and required context to execute index insertions and AR triggers.

> 3. Add table_modify_buffer_{update|delete} APIs.
>
> 9. Use these new methods for DELETE, UPDATE, and MERGE. MERGE can use
> the buffer_insert/update/delete APIs; we don't need a separate merge
> method. This probably requires that the AM maintain 3 separate buffers
> to distinguish different kinds of changes at flush time (obviously
> these can be initialized lazily to avoid overhead when not being used).

I haven't thought about these things yet. I can only focus on them after seeing how the attached patches go from here.

> 4. Some kind of API tweaks to help manage memory when modifying
> pertitioned tables, so that the buffering doesn't get out of control.
> Perhaps just reporting memory usage and allowing the caller to force
> flushes would be enough.

Heap implementation for thes new Table AMs uses a separate memory context for all of the operations. Please have a look and let me know if we need anything more.

> 5. Use these new methods for CREATE/REFRESH MATERIALIZED VIEW. This is
> fairly straightforward, I believe, and handled by your patch. Indexes
> are (re)built afterward, and no triggers are possible.
>
> 6. Use these new methods for CREATE TABLE ... AS. This is fairly
> straightforward, I believe, and handled by your patch. No indexes or
> triggers are possible.

I used multi inserts for all of these including TABLE REWRITE commands such as ALTER TABLE. See the attached v19-0002 patch. Check the testing section below for benefits.

FWIW, following are some of the TABLE REWRITE commands that can get benefitted:

ALTER TABLE tbl ALTER c1 TYPE bigint;
ALTER TABLE itest13 ADD COLUMN c int GENERATED BY DEFAULT AS IDENTITY;
ALTER MATERIALIZED VIEW heapmv SET ACCESS METHOD heap2;
ALTER TABLE itest3 ALTER COLUMN a TYPE int;
ALTER TABLE gtest20 ALTER COLUMN b SET EXPRESSION AS (a * 3);
ALTER TABLE has_volatile ADD col4 int DEFAULT (random() * 10000)::int;
and so on.

> 7. Use these new methods for COPY. We have to be careful to avoid
> regressions for the heap method, because it's already managing its own
> buffers. If the AM manages the buffering, then it may require
> additional copying of slots, which could be a disadvantage. To solve
> this, we may need some minor API tweaks to avoid copying when the
> caller guarantees that the memory will not be freed to early, or
> perhaps expose the AM's memory context to copyfrom.c. Another thing to
> consider is that the buffering in copyfrom.c is also used for FDWs, so
> that buffering code path needs to be preserved in copyfrom.c even if
> not used for AMs.

I modified the COPY FROM code to use the new Table AMs, and performed some tests which show no signs of regression. Check the testing section below for more details. See the attached v19-0005 patch. With this, table_multi_insert can be deprecated.

> 8. Use these new methods for INSERT INTO ... SELECT. One potential
> challenge here is that execution nodes are not always run to
> completion, so we need to be sure that the flush isn't forgotten in
> that case.

I did that in v19-0003. I did place the table_modify_end call in multiple places including ExecEndModifyTable. I didn't find any issues with it. Please have a look and let me know if we need the end call in more places. Check the testing section below for benefits.

> 10. Use these new methods for logical apply.

I used multi inserts for Logical Replication apply. in v19-0004. Check the testing section below for benefits.

FWIW, open-source pglogical does have multi insert support, check code around https://github.com/2ndQuadrant/pglogical/blob/REL2_x_STABLE/pglogical_apply_heap.c#L960.

> 11. Deprecate the multi_insert API.

I did remove both table_multi_insert and table_finish_bulk_insert in v19-0006. Perhaps, removing them isn't a great idea, but adding a deprecation WARNING/ERROR until some more PG releases might be worth looking at.

> Thoughts on this plan? Does your patch make sense in v17 as a stepping
> stone, or should we try to make all of these API changes together in
> v18?

If the design, code and benefits that these new Table AMs bring to the table look good, I hope to see it for PG 18.

> Also, a sample AM code would be a huge benefit here. Writing a real AM
> is hard, but perhaps we can at least have an example one to demonstrate
> how to use these APIs?

The attached patches already have implemented these new Table AMs for Heap. I don't think we need a separate implementation to demonstrate. If others feel so, I'm open to thoughts here.

Having said above, I'd like to reiterate the motivation behind the new Table AMs for multi and single inserts.

1. A scan-like API with state being carried across is thought to be better as suggested by Andres Freund - https://www.postgresql.org/message-id/20200924024128.kyk3r5g7dnu3fxxx@alap3.anarazel.de.
2. Allowing a Table AM to optimize operations across multiple inserts, define its own buffering strategy and take its own flushing decisions based on their tuple storage kinds and various other AM specific factors.
3. Improve performance of various SQL commands with multi inserts for Heap AM.

The attached v19 patches might need some more detailed comments, some documentation and some specific tests ensuring the multi inserts for Heap are kicked-in for various commands. I'm open to thoughts here.

I did some testing to see how various commands benefit with multi inserts using these new Table AM for heap. It's not only the improvement in performance these commands see, but also the amount of WAL that gets generated reduces greatly. After all, multi inserts optimize the insertions by writing less WAL. IOW, writing WAL record per page if multiple rows fit into a single data page as opposed to WAL record per row.

Test case 1: 100 million rows, 2 columns (int and float)

Command | HEAD (sec) | PATCHED (sec) | Faster by % | Faster by X
------------------------------ | ---------- | ------------- | ----------- | -----------
CREATE TABLE AS | 121 | 77 | 36.3 | 1.57
CREATE MATERIALIZED VIEW | 101 | 49 | 51.4 | 2.06
REFRESH MATERIALIZED VIEW | 113 | 54 | 52.2 | 2.09
ALTER TABLE (TABLE REWRITE) | 124 | 81 | 34.6 | 1.53
COPY FROM | 71 | 72 | 0 | 1
INSERT INTO ... SELECT | 117 | 62 | 47 | 1.88
LOGICAL REPLICATION APPLY | 393 | 306 | 22.1 | 1.28

Command | HEAD (WAL in GB) | PATCHED (WAL in GB) | Reduced by % | Reduced by X
------------------------------ | ---------------- | ------------------- | ------------ | -----------
CREATE TABLE AS | 6.8 | 2.4 | 64.7 | 2.83
CREATE MATERIALIZED VIEW | 7.2 | 2.3 | 68 | 3.13
REFRESH MATERIALIZED VIEW | 10 | 5.1 | 49 | 1.96
ALTER TABLE (TABLE REWRITE) | 8 | 3.2 | 60 | 2.5
COPY FROM | 2.9 | 3 | 0 | 1
INSERT INTO ... SELECT | 8 | 3 | 62.5 | 2.66
LOGICAL REPLICATION APPLY | 7.5 | 2.3 | 69.3 | 3.26

Test case 2: 1 billion rows, 1 column (int)

Command | HEAD (sec) | PATCHED (sec) | Faster by % | Faster by X
------------------------------ | ---------- | ------------- | ----------- | -----------
CREATE TABLE AS | 794 | 386 | 51.38 | 2.05
CREATE MATERIALIZED VIEW | 1006 | 563 | 44.03 | 1.78
REFRESH MATERIALIZED VIEW | 977 | 603 | 38.28 | 1.62
ALTER TABLE (TABLE REWRITE) | 1189 | 714 | 39.94 | 1.66
COPY FROM | 321 | 330 | -0.02 | 0.97
INSERT INTO ... SELECT | 1084 | 586 | 45.94 | 1.84
LOGICAL REPLICATION APPLY | 3530 | 2982 | 15.52 | 1.18

Command | HEAD (WAL in GB) | PATCHED (WAL in GB) | Reduced by % | Reduced by X
------------------------------ | ---------------- | ------------------- | ------------ | -----------
CREATE TABLE AS | 60 | 12 | 80 | 5
CREATE MATERIALIZED VIEW | 60 | 12 | 80 | 5
REFRESH MATERIALIZED VIEW | 60 | 12 | 80 | 5
ALTER TABLE (TABLE REWRITE) | 123 | 31 | 60 | 2.5
COPY FROM | 12 | 12 | 0 | 1
INSERT INTO ... SELECT | 120 | 24 | 80 | 5
LOGICAL REPLICATION APPLY | 61 | 12 | 80.32 | 5

looks pretty impressive!

Pavel

Test setup:
./configure --prefix=$PWD/pg17/ --enable-tap-tests CFLAGS="-ggdb3 -O2" > install.log && make -j 8 install > install.log 2>&1 &

wal_level=logical
max_wal_size = 256GB
checkpoint_timeout = 1h

Test system is EC2 instance of type c5.4xlarge:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
CPU family: 6
Model: 85
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
Stepping: 7
BogoMIPS: 5999.99
Caches (sum of all):
L1d: 256 KiB (8 instances)
L1i: 256 KiB (8 instances)
L2: 8 MiB (8 instances)
L3: 35.8 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-15
RAM:
MemTotal: 32036536 kB

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM

From

Jeff Davis

Date:

25 April 2024, 16:41:08

On Wed, 2024-04-24 at 18:19 +0530, Bharath Rupireddy wrote:
> I added a flush callback named TableModifyBufferFlushCallback; when
> provided by callers invoked after tuples are flushed to disk from the
> buffers but before the AM frees them up. Index insertions and AFTER
> ROW INSERT triggers can be executed in this callback. See the v19-
> 0001 patch for how AM invokes the flush callback, and see either v19-
> 0003 or v19-0004 or v19-0005 for how a caller can supply the callback
> and required context to execute index insertions and AR triggers.

The flush callback takes a pointer to an array of slot pointers, and I
don't think that's the right API. I think the callback should be called
on each slot individually.

We shouldn't assume that a table AM stores buffered inserts as an array
of slot pointers. A TupleTableSlot has a fair amount of memory overhead
(64 bytes), so most AMs wouldn't want to pay that overhead for every
tuple. COPY does, but that's because the number of buffered tuples is
fairly small.

>
>
> > 11. Deprecate the multi_insert API.
>
> I did remove both table_multi_insert and table_finish_bulk_insert in
> v19-0006.

That's OK with me. Let's leave those functions out for now.

>
> If the design, code and benefits that these new Table AMs bring to
> the table look good, I hope to see it for PG 18.

Sounds good.

Regards,
    Jeff Davis

Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM

From

Bharath Rupireddy

Date:

29 April 2024, 06:06:20

On Thu, Apr 25, 2024 at 10:11 PM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Wed, 2024-04-24 at 18:19 +0530, Bharath Rupireddy wrote:
> > I added a flush callback named TableModifyBufferFlushCallback; when
> > provided by callers invoked after tuples are flushed to disk from the
> > buffers but before the AM frees them up. Index insertions and AFTER
> > ROW INSERT triggers can be executed in this callback. See the v19-
> > 0001 patch for how AM invokes the flush callback, and see either v19-
> > 0003 or v19-0004 or v19-0005 for how a caller can supply the callback
> > and required context to execute index insertions and AR triggers.
>
> The flush callback takes a pointer to an array of slot pointers, and I
> don't think that's the right API. I think the callback should be called
> on each slot individually.
>
> We shouldn't assume that a table AM stores buffered inserts as an array
> of slot pointers. A TupleTableSlot has a fair amount of memory overhead
> (64 bytes), so most AMs wouldn't want to pay that overhead for every
> tuple. COPY does, but that's because the number of buffered tuples is
> fairly small.

I get your point. An AM can choose to implement the buffering strategy
by just storing the plain tuple rather than the tuple slots in which
case the flush callback with an array of tuple slots won't work.
Therefore, I now changed the flush callback to accept only a single
tuple slot.

> > > 11. Deprecate the multi_insert API.
> >
> > I did remove both table_multi_insert and table_finish_bulk_insert in
> > v19-0006.
>
> That's OK with me. Let's leave those functions out for now.

Okay. Dropped the 0006 patch for now.

Please see the attached v20 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM

From

Bharath Rupireddy

Date:

15 May 2024, 07:26:17

On Mon, Apr 29, 2024 at 11:36 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Please see the attached v20 patch set.

It looks like with the use of the new multi insert table access method
(TAM) for COPY (v20-0005), pgbench regressed about 35% [1]. The reason
is that the memory-based flushing decision the new TAM takes [2]
differs from that of what the COPY does today with table_multi_insert.
The COPY with table_multi_insert, maintains exact size of the tuples
in CopyFromState after it does the line parsing. For instance, the
tuple size of a table with two integer columns is 8 (4+4) bytes here.
The new TAM relies on the memory occupied by the slot's memory context
which holds the actual tuple as a good approximation for the tuple
size. But, this memory context size also includes a tuple header, so
the size here is not just 8 (4+4) bytes but more. Because of this, the
buffers get flushed sooner than that of the existing COPY with
table_multi_insert AM causing regression in pgbench which uses COPY
extensively. The new TAM aren't designed to be able to receive tuple
sizes from the callers, even if we do that, the API doesn't look
generic.

Here are couple of ideas to get away with this:

1. Try to get the actual tuple sizes excluding header sizes for each
column in the new TAM.
2. Try not to use the new TAM for COPY in which case the
table_multi_insert stays forever.
3. Try passing a flag to tell the new TAM that the caller does the
flushing and no need for an internal flushing.

I haven't explored the idea (1) in depth yet. If we find a way to do
so, it looks to me that we are going backwards since we need to strip
off headers for each column of a row for all of the rows. I suspect
this would cost a bit more and may not solve the regression.

With an eventual goal to get rid of table_multi_insert, (3) may not be
a better choice.

(3) seems reasonable to implement and reduce the regression. I did so
in the attached v21 patches. A new flag TM_SKIP_INTERNAL_BUFFER_FLUSH
is introduced in v21 patch, when specified, no internal flushing is
done, the caller has to flush the buffered tuples using
table_modify_buffer_flush(). Check the test results [3] HEAD 2.948 s,
PATCHED 2.946 s.

v21 also adds code to maintain tuple size for virtual tuple slots.
This helps make better memory-based flushing decisions in the new TAM.

Thoughts?

[1]
HEAD:
done in 2.84 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 1.99 s, vacuum 0.21 s, primary keys 0.62 s).
done in 2.78 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 1.88 s, vacuum 0.21 s, primary keys 0.69 s).
done in 2.97 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 2.07 s, vacuum 0.21 s, primary keys 0.69 s).
done in 2.86 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 1.96 s, vacuum 0.21 s, primary keys 0.69 s).
done in 2.90 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 2.05 s, vacuum 0.21 s, primary keys 0.64 s).
done in 2.83 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 1.96 s, vacuum 0.21 s, primary keys 0.66 s).
done in 2.80 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 1.95 s, vacuum 0.20 s, primary keys 0.63 s).
done in 2.79 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 1.89 s, vacuum 0.21 s, primary keys 0.69 s).
done in 3.75 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 2.17 s, vacuum 0.32 s, primary keys 1.25 s).
done in 3.86 s (drop tables 0.00 s, create tables 0.08 s, client-side
generate 2.97 s, vacuum 0.21 s, primary keys 0.59 s).

AVG done in 2.948 s

v20 PATCHED:
done in 3.94 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 3.12 s, vacuum 0.19 s, primary keys 0.62 s).
done in 4.04 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 3.22 s, vacuum 0.20 s, primary keys 0.61 s).
done in 3.98 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 3.16 s, vacuum 0.20 s, primary keys 0.61 s).
done in 4.04 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 3.16 s, vacuum 0.20 s, primary keys 0.67 s).
done in 3.98 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 3.16 s, vacuum 0.20 s, primary keys 0.61 s).
done in 4.00 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 3.17 s, vacuum 0.20 s, primary keys 0.63 s).
done in 4.43 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 3.24 s, vacuum 0.21 s, primary keys 0.98 s).
done in 4.16 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 3.36 s, vacuum 0.20 s, primary keys 0.59 s).
done in 3.62 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 2.83 s, vacuum 0.20 s, primary keys 0.58 s).
done in 3.67 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 2.84 s, vacuum 0.21 s, primary keys 0.61 s).

AVG done in 3.986 s

[2]
+    /*
+     * Memory allocated for the whole tuple is in slot's memory context, so
+     * use it keep track of the total space occupied by all buffered tuples.
+     */
+    if (TTS_SHOULDFREE(slot))
+        mistate->cur_size += MemoryContextMemAllocated(slot->tts_mcxt, false);
+
+    if (mistate->cur_slots >= HEAP_MAX_BUFFERED_SLOTS ||
+        mistate->cur_size >= HEAP_MAX_BUFFERED_BYTES)
+        heap_modify_buffer_flush(state);

[3]
v21 PATCHED:
done in 2.92 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 2.12 s, vacuum 0.21 s, primary keys 0.59 s).
done in 2.89 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 2.07 s, vacuum 0.21 s, primary keys 0.61 s).
done in 2.89 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 2.05 s, vacuum 0.21 s, primary keys 0.62 s).
done in 2.90 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 2.07 s, vacuum 0.21 s, primary keys 0.62 s).
done in 2.80 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 2.00 s, vacuum 0.21 s, primary keys 0.59 s).
done in 2.84 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 2.04 s, vacuum 0.20 s, primary keys 0.60 s).
done in 2.84 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 2.03 s, vacuum 0.20 s, primary keys 0.59 s).
done in 2.85 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 2.04 s, vacuum 0.20 s, primary keys 0.60 s).
done in 3.48 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 2.44 s, vacuum 0.23 s, primary keys 0.80 s).
done in 3.05 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 2.28 s, vacuum 0.21 s, primary keys 0.55 s).

AVG done in 2.946 s

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM

From

Alvaro Herrera

Date:

15 May 2024, 09:14:14

Sorry to interject, but --

On 2024-May-15, Bharath Rupireddy wrote:

> It looks like with the use of the new multi insert table access method
> (TAM) for COPY (v20-0005), pgbench regressed about 35% [1].

Where does this acronym "TAM" comes from for "table access method"?  I
find it thoroughly horrible and wish we didn't use it.  What's wrong
with using "table AM"?  It's not that much longer, much clearer and
reuses our well-established acronym AM.

We don't use IAM anywhere, for example (it's always "index AM"), and I
don't think we'd turn "sequence AM" into SAM either, would we?

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM

From

Bharath Rupireddy

Date:

15 May 2024, 09:59:47

On Wed, May 15, 2024 at 2:44 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> > It looks like with the use of the new multi insert table access method
> > (TAM) for COPY (v20-0005), pgbench regressed about 35% [1].
>
> Where does this acronym "TAM" comes from for "table access method"?

Thanks for pointing it out. I used it for just the discussion sake in
this response. Although a few of the previous responses from others in
this thread mentioned that word, none of the patches have it added in
the code. I'll ensure to not use it further in this thread if that
worries one like another acronym is being added.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM

From

Andres Freund

Date:

15 May 2024, 22:03:19

Hi,

On 2024-05-15 11:14:14 +0200, Alvaro Herrera wrote:
> On 2024-May-15, Bharath Rupireddy wrote:
> 
> > It looks like with the use of the new multi insert table access method
> > (TAM) for COPY (v20-0005), pgbench regressed about 35% [1].
> 
> Where does this acronym "TAM" comes from for "table access method"?  I
> find it thoroughly horrible and wish we didn't use it.  What's wrong
> with using "table AM"?  It's not that much longer, much clearer and
> reuses our well-established acronym AM.

Strongly agreed. I don't know why I dislike TAM so much though.

Greetings,

Andres Freund

Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM

From

Michael Paquier

Date:

15 May 2024, 23:07:11

On Wed, May 15, 2024 at 11:14:14AM +0200, Alvaro Herrera wrote:
> We don't use IAM anywhere, for example (it's always "index AM"), and I
> don't think we'd turn "sequence AM" into SAM either, would we?

SAM is not a term I've seen used for sequence AMs in the past, I don't
intend to use it.  TAM is similar strange to me, but perhaps it's just
because I am used to table AMs as a whole.
--
Michael

Attachment

signature.asc

Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM

From

Jeff Davis

Date:

15 May 2024, 23:31:42

On Wed, 2024-05-15 at 12:56 +0530, Bharath Rupireddy wrote:
> Because of this, the
> buffers get flushed sooner than that of the existing COPY with
> table_multi_insert AM causing regression in pgbench which uses COPY
> extensively.

The flushing behavior is entirely controlled by the table AM. The heap
can use the same flushing logic that it did before, which is to hold
1000 tuples.

I like that it's accounting for memory, too, but it doesn't need to be
overly restrictive. Why not just use work_mem? That should hold 1000
reasonably-sized tuples, plus overhead.

Even better would be if we could take into account partitioning. That
might be out of scope for your current work, but it would be very
useful. We could have a couple new GUCs like modify_table_buffer and
modify_table_buffer_per_partition or something like that.

> 1. Try to get the actual tuple sizes excluding header sizes for each
> column in the new TAM.

I don't see the point in arbitrarily excluding the header.

> v21 also adds code to maintain tuple size for virtual tuple slots.
> This helps make better memory-based flushing decisions in the new
> TAM.

That seems wrong. We shouldn't need to change the TupleTableSlot
structure for this patch.

Comments on v21:

* All callers specify TM_FLAG_MULTI_INSERTS. What's the purpose?

* The only caller that doesn't use TM_FLAG_BAS_BULKWRITE is
ExecInsert(). What's the disadvantage to using a bulk insert state
there?

* I'm a bit confused by TableModifyState->modify_end_callback. The AM
both sets the callback and calls the callback -- why can't the code
just go into the table_modify_end method?

* The code structure in table_modify_begin() (and related) is strange.
Can it be simplified or am I missing something?

* Why are table_modify_state and insert_modify_buffer_flush_context
globals? What if there are multiple modify nodes in a plan?

* Can you explain the design in logical rep?

Regards,
    Jeff Davis

Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM

From

Jeff Davis

Date:

16 May 2024, 19:00:36

On Wed, 2024-05-15 at 16:31 -0700, Jeff Davis wrote:
> Even better would be if we could take into account partitioning. That
> might be out of scope for your current work, but it would be very
> useful. We could have a couple new GUCs like modify_table_buffer and
> modify_table_buffer_per_partition or something like that.

To expand on this point:

For heap, the insert buffer is only 1000 tuples, which doesn't take
much memory. But for an AM that does any significant reorganization of
the input data, the buffer may be much larger. For insert into a
partitioned table, that buffer could be multiplied across many
partitions, and start to be a real concern.

We might not need table AM API changes at all here beyond what v21
offers. The ModifyTableState includes the memory context, so that gives
the caller a way to know the memory consumption of a single partition's
buffer. And if it needs to free the resources, it can just call
modify_table_end(), and then _begin() again if more tuples hit that
partition.

So I believe what I'm asking for here is entirely orthogonal to the
current proposal.

However, it got me thinking that we might not want to use work_mem for
controlling the heap's buffer size. Each AM is going to have radically
different memory needs, and may have its own (extension) GUCs to
control that memory usage, so they won't honor work_mem. We could
either have a separate GUC for the heap if it makes sense, or we could
just hard-code a reasonable value.

Regards,
    Jeff Davis

Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM

From

Bharath Rupireddy

Date:

05 June 2024, 07:12:17

Hi,

On Thu, May 16, 2024 at 5:01 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> The flushing behavior is entirely controlled by the table AM. The heap
> can use the same flushing logic that it did before, which is to hold
> 1000 tuples.
>
> I like that it's accounting for memory, too, but it doesn't need to be
> overly restrictive. Why not just use work_mem? That should hold 1000
> reasonably-sized tuples, plus overhead.
>
> Even better would be if we could take into account partitioning. That
> might be out of scope for your current work, but it would be very
> useful. We could have a couple new GUCs like modify_table_buffer and
> modify_table_buffer_per_partition or something like that.

I disagree with inventing more GUCs. Instead, I'd vote for just
holding 1000 tuples in buffers for heap AM. This not only keeps the
code and new table AM simple, but also does not cause regression for
COPY. In my testing, 1000 tuples with 1 int and 1 float columns took
40000 bytes of memory (40 bytes each tuple), whereas with 1 int, 1
float and 1 text columns took 172000 bytes of memory (172 bytes each
tuple) bytes which IMO mustn't be a big problem. Thoughts?

> > 1. Try to get the actual tuple sizes excluding header sizes for each
> > column in the new TAM.
>
> I don't see the point in arbitrarily excluding the header.
>
> > v21 also adds code to maintain tuple size for virtual tuple slots.
> > This helps make better memory-based flushing decisions in the new
> > TAM.
>
> That seems wrong. We shouldn't need to change the TupleTableSlot
> structure for this patch.

I dropped these ideas as I went ahead with the above idea of just
holding 1000 tuples in buffers for heap AM.

> Comments on v21:
>
> * All callers specify TM_FLAG_MULTI_INSERTS. What's the purpose?

Previously, the multi insert state was initialized in modify_begin, so
it was then required to differentiate the code path. But, it's not
needed anymore with the lazy initialization of the multi insert state
moved to modify_buffer_insert. I removed it.

> * The only caller that doesn't use TM_FLAG_BAS_BULKWRITE is
> ExecInsert(). What's the disadvantage to using a bulk insert state
> there?

The subsequent read queries will not find the just-now-inserted tuples
in shared buffers as a separate ring buffer is used with bulk insert
access strategy. The multi inserts is nothing but buffering multiple
tuples plus inserting in bulk. So using the bulk insert strategy might
be worth it for INSERT INTO SELECTs too. Thoughts?

> * I'm a bit confused by TableModifyState->modify_end_callback. The AM
> both sets the callback and calls the callback -- why can't the code
> just go into the table_modify_end method?

I came up with modify_end_callback as per the discussion upthread to
use modify_begin, modify_end in future for UPDATE, DELETE and MERGE,
and not use any operation specific flags to clean the state
appropriately. The operation specific state cleaning logic can go to
the modify_end_callback implementation defined by the AM.

> * The code structure in table_modify_begin() (and related) is strange.
> Can it be simplified or am I missing something?

I previously defined these new table AMs as optional, check
GetTableAmRoutine(). And, there was a point upthread to provide
default/fallback implementation to help not fail insert operations on
tables without the new table AMs implemented. FWIW, the default
implementation was just doing the single inserts. The
table_modify_begin and friends need the logic to fallback making the
code there look different than other AMs. However, I now have a
feeling to drop the idea of having fallback implementation and let the
AMs deal with it. Although it might create some friction with various
non-core AM implementations, it keeps this patch simple which I would
vote for. Thoughts?

> * Why are table_modify_state and insert_modify_buffer_flush_context
> globals? What if there are multiple modify nodes in a plan?

Can you please provide the case that can generate multiple "modify
nodes" in a single plan? AFAICS, multiple "modify nodes" in a plan can
exist for both partitioned tables and tables that get created as part
of CTEs. I disabled multi inserts for both of these cases. The way I
disabled for CTEs looks pretty naive - I just did the following. Any
better suggestions here to deal with all such cases?

+        if (operation == CMD_INSERT &&
+            nodeTag(subplanstate) == T_SeqScanState)
+            canMultiInsert = true;

> * Can you explain the design in logical rep?

Multi inserts for logical replication work at the table level. In
other words, all tuple inserts related to a single table within a
transaction are buffered and written to the corresponding table when
necessary. Whenever inserts pertaining to another table arrive, the
buffered tuples related to the previous table are written to the table
before starting the buffering for the new table. Also, the tuples are
written to the table from the buffer when there arrives a non-INSERT
operation, for example, UPDATE/DELETE/TRUNCATE/COMMIT etc. FWIW,
pglogical has the similar multi inserts logic -
https://github.com/2ndQuadrant/pglogical/blob/REL2_x_STABLE/pglogical_apply_heap.c#L879.

Please find the v22 patches with the above changes.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM

From

Jingtang Zhang

Date:

28 October 2024, 11:43:09

Hi! Glad to see update in this thread.

Little question about v24 0002 patch: would it be better to move the
implementation of TableModifyIsMultiInsertsSupported to somewhere for table AM
level? Seems it is a common function for future use, not a specific one for

matview.

---

Regards, Jingtang

Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> 于2024年10月26日周六 21:31写道：

Hi,

Thanks for looking into this.

On Thu, Aug 29, 2024 at 12:29 PM Jeff Davis <pgsql@j-davis.com> wrote:
>
> I believe we need the branching in the caller anyway:
>
> 1. If there is a BEFORE row trigger with a volatile function, the
> visibility rules[1] mean that the function should see changes from all
> the rows inserted so far this command, which won't work if they are
> still in the buffer.
>
> 2. Similarly, for an INSTEAD OF row trigger, the visibility rules say
> that the function should see all previous rows inserted.
>
> 3. If there are volatile functions in the target list or WHERE clause,
> the same visibility semantics apply.
>
> 4. If there's a "RETURNING ctid" clause, we need to either come up with
> a way to return the tuples after flushing, or we need to use the
> single-tuple path. (Similarly in the future when we support UPDATE ...
> RETURNING, as Matthias pointed out.)
>
> If we need two paths in each caller anyway, it seems cleaner to just
> wrap the check for tuple_modify_buffer_insert in
> table_modify_buffer_enabled().
>
> We could perhaps use a one path and then force a batch size of one or
> something, which is an alternative, but we have to be careful not to
> introduce a regression (and it still requires a solution for #4).

I chose to branch in the caller e.g. if there's a volatile function
SELECT query of REFRESH MATERIALIZED VIEW, the caller goes
table_tuple_insert() path, else multi-insert path.

I am posting the new v24 patch set organized as follows: 0001
introducing the new table AM, 0002 optimizing CTAS, CMV and RMV, 0003
using the new table AM for COPY ... FROM. I, for now, discarded the
INSERT INTO ... SELECT and Logical Replication Apply patches, the idea
is to take the basic stuff forward.

I reworked structure names, members and function names, reworded
comments, addressed review comments in the v24 patches. Please have a
look.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM

From

Jingtang Zhang

Date:

28 October 2024, 17:48:34

Oh, another comments for v24-0001 patch: we are in heam AM now, should we use
something like HEAP_INSERT_BAS_BULKWRITE instead of using table AM option,
just like other heap AM options do?

> +        if ((state->options & TABLE_INSERT_BAS_BULKWRITE) != 0)
> +            istate->bistate = GetBulkInsertState();


—

Regards, Jingtang

Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM

From

Jingtang Zhang

Date:

29 October 2024, 06:18:42

Hi~

Sorry for multiple comments in separate mail. Just found that the initialization
seems redundant since we have used palloc0?

> +        istate = (HeapInsertState *) palloc0(sizeof(HeapInsertState));
> +        istate->bistate = NULL;
> +        istate->mistate = NULL;

---
Regards, Jingtang

Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM

From

Jingtang Zhang

Date:

31 October 2024, 07:17:14

Hi~

I did some performance test these days, and I have some findings.

From the archive months ago, I found there were discussions about which type
of TupleTableSlot to use for buffering tuples. A single column mat view was
used for evaluation. Finally we used virtual one.

However when I test with a 32-columns mat view, I get regression.

Test case:

-- prepare
create table test as
  select
    i as id0,
    i + 1 as id1,
    i + 2 as id2,
    i + 3 as id3,
    i + 4 as id4,
    i + 5 as id5,
    i + 6 as id6,
    i + 7 as id7,
    i + 8 as id8,
    i + 9 as id9,
    i + 10 as id10,
    i + 11 as id11,
    i + 12 as id12,
    i + 13 as id13,
    i + 14 as id14,
    i + 15 as id15,
    i + 0.01 as f0,
    i + 0.1 as f1,
    i + 0.2 as f2,
    i + 0.3 as f3,
    i + 0.4 as f4,
    i + 0.5 as f5,
    i + 0.6 as f6,
    i + 0.7 as f7,
    i + 0.8 as f8,
    i + 0.9 as f9,
    i + 1.01 as f10,
    i + 1.1 as f11,
    i + 1.2 as f12,
    i + 1.3 as f13,
    i + 1.4 as f14,
    i + 1.5 as f15,
    i + 1.6 as f16
  from generate_series(1,5000000) i;

-- run
create materialized view m1 as select * from test;


HEAD:
Time: 13615.542 ms (00:13.616)
Time: 13545.706 ms (00:13.546)
Time: 13578.475 ms (00:13.578)

Patched
Time: 20112.734 ms (00:20.113)
Time: 19996.957 ms (00:19.997)
Time: 19936.871 ms (00:19.937)


I did a quick perf, the overhead seems to come from virtual tuple materialization.

HEAD:
  12.29%  postgres            [.] pg_checksum_block
   6.33%  postgres            [.] GetPrivateRefCountEntry
   5.40%  postgres            [.] pg_comp_crc32c_sse42
   4.54%  [kernel]            [k] copy_user_enhanced_fast_string
   2.69%  postgres            [.] BufferIsValid
   1.52%  postgres            [.] XLogRecordAssemble

Patched:
  11.75%  postgres            [.] tts_virtual_materialize
   8.87%  postgres            [.] pg_checksum_block
   8.17%  postgres            [.] slot_deform_heap_tuple
   8.09%  postgres            [.] heap_compute_data_size
   6.17%  postgres            [.] fill_val
   3.81%  postgres            [.] heap_fill_tuple
   3.37%  postgres            [.] tts_virtual_copyslot
   2.62%  [kernel]            [k] copy_user_enhanced_fast_string

Not sure if it is a universal situation.


—
Regards, Jingtang

Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM

From

Daniil Davydov

Date:

09 March, 14:27:58

Hi,
A few days ago I came up with an idea to implement multi insert
optimization wherever possible. I prepared a raw patch
and it showed a great performance gain (up to 4 times during INSERT
... INTO ... in the best case).
Then I was very happy to find this thread. You did a great job and I
want to help you to bring the matter to an end.

On Thu, Oct 31, 2024 at 11:17 AM Jingtang Zhang <mrdrivingduck@gmail.com> wrote:
> I did some performance test these days, and I have some findings.
> HEAD:
>   12.29%  postgres            [.] pg_checksum_block
>    6.33%  postgres            [.] GetPrivateRefCountEntry
>    5.40%  postgres            [.] pg_comp_crc32c_sse42
>    4.54%  [kernel]            [k] copy_user_enhanced_fast_string
>    2.69%  postgres            [.] BufferIsValid
>    1.52%  postgres            [.] XLogRecordAssemble
>
> Patched:
>   11.75%  postgres            [.] tts_virtual_materialize
>    8.87%  postgres            [.] pg_checksum_block
>    8.17%  postgres            [.] slot_deform_heap_tuple
>    8.09%  postgres            [.] heap_compute_data_size
>    6.17%  postgres            [.] fill_val
>    3.81%  postgres            [.] heap_fill_tuple
>    3.37%  postgres            [.] tts_virtual_copyslot
>    2.62%  [kernel]            [k] copy_user_enhanced_fast_string

I applied v25 patches on the master branch and made some measurements
to find out what is the bottleneck in this case. The 'time' utility
showed that without a patch, this query will run 1.5 times slower. I
also made a few flamegraphs for this test. Most of the time is spent
calling
these two functions : tts_virtual_copyslot and heap_form_tuple.
All tests were run in virtual machine with these CPU characteristics:
Architecture:             x86_64
CPU(s):                   2
  On-line CPU(s) list:    0,1
Virtualization features:
  Virtualization:         AMD-V
  Hypervisor vendor:      KVM
  Virtualization type:    full
Caches (sum of all):
  L1d:                    128 KiB (2 instances)
  L1i:                    128 KiB (2 instances)
  L2:                     1 MiB (2 instances)
  L3:                     32 MiB (2 instances)
NUMA:
  NUMA node(s):           1
  NUMA node0 CPU(s):      0,1

In my implementation, I used Tuplestore functionality to store tuples.
In order to get rid of getting stuck in the above mentioned functions,
I crossed it with the current implementation (v25 patches) and got a
10% increase in performance (for the test above). I also set up v22
patches to
compare performance (with/without tuplestore) for INSERT ... INTO ...
queries (with -j 4 -c 10 parameters for pgbech), and there also was an
increase in TPS (about 3-4%).

I attach a patch that adds Tuplestore to v25. What do you think about this idea?

--
Best regards,
Daniil Davydov

Attachment

0001-Replace-holding-tuples-in-virtual-slots-with-tuplest.patch

Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM

From

Daniil Davydov

Date:

17 March, 07:49:45

Hi,
Recently I took more careful measurements of the performance. I
compared three branches with each other: HEAD, Patched and Patched
with tuplestore.
Here are the results :

1)
Test case : matview creation test attached in the email from Jingtang Zhang.
10 measurements for each branch.
Result in wall clock execution time :

HEAD
30.532 +- 0.59 seconds elapsed
Patched
20.454 +- 0.114 seconds elapsed
Patched with tuplestore
19.653 +- 0.111 seconds elapsed

2)
-- init.sql
drop table test_insert;
vacuum;
checkpoint;
create table test_insert(i int, f float);

-- iowrite.sql
insert into test_insert select g, (g % 100) / 100.0 from
generate_series(1, 1000000) as g;

Test case :
pgbench -f iowrite.sql -n -j 4 -c 10 -T 40
5 measurements for each branch.
Result in tps :

HEAD
1.025 +- 0.009
Patched
2.923 +- 0.032
Patched with tuplestore
2.987 +- 0.011

P.S.
I cannot find a commitfest entry for this patch. Should we add it there?

--
Best regards,
Daniil Davydov

Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM

From

Jingtang Zhang

Date:

06 April, 16:54:44

Hi, Daniil Davydov.

Thanks for sharing your patch!

It was quite a while since I last looked at the patch. I've tested it again,
and still get regression on patched version where a table has many columns.
And it is totally CPU-bounded on tts_virtual_copyslot.

Unpatched version:
1 col:
Time: 8909.714 ms (00:08.910)
Time: 8803.579 ms (00:08.804)
Time: 8600.415 ms (00:08.600)
32 cols:
Time: 12911.699 ms (00:12.912)
Time: 13543.491 ms (00:13.543)
Time: 13325.368 ms (00:13.325)

Patched version:
1 col:
Time: 3532.841 ms (00:03.533)
Time: 3598.223 ms (00:03.598)
Time: 3515.858 ms (00:03.516)
32 cols:
Time: 35647.724 ms (00:35.648)
Time: 35596.233 ms (00:35.596)
Time: 35669.106 ms (00:35.669)

I've tested your patch with tuplestore and found the regression does not exist
anymore, but I haven't look deep inside it.

Patched version (with tuplestore):
1 col:
Time: 3500.502 ms (00:03.501)
Time: 3486.886 ms (00:03.487)
Time: 3514.233 ms (00:03.514)
32 cols:
Time: 10375.391 ms (00:10.375)
Time: 10248.256 ms (00:10.248)
Time: 10248.289 ms (00:10.248)

It seems to be a good idea if there is no other issue with your patch.

--
Regards,
Jingtang

Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM

From

Daniil Davydov

Date:

07 April, 09:26:25

Hi,

On Sun, Apr 6, 2025 at 8:55 PM Jingtang Zhang <mrdrivingduck@gmail.com> wrote:
>
> It was quite a while since I last looked at the patch. I've tested it again,
> and still get regression on patched version where a table has many columns.
> And it is totally CPU-bounded on tts_virtual_copyslot.
>
> Unpatched version:
> 1 col:
> Time: 8909.714 ms (00:08.910)
> Time: 8803.579 ms (00:08.804)
> Time: 8600.415 ms (00:08.600)
> 32 cols:
> Time: 12911.699 ms (00:12.912)
> Time: 13543.491 ms (00:13.543)
> Time: 13325.368 ms (00:13.325)
>
> Patched version:
> 1 col:
> Time: 3532.841 ms (00:03.533)
> Time: 3598.223 ms (00:03.598)
> Time: 3515.858 ms (00:03.516)
> 32 cols:
> Time: 35647.724 ms (00:35.648)
> Time: 35596.233 ms (00:35.596)
> Time: 35669.106 ms (00:35.669)
>

Hm, maybe I didn't choose the best way to measure performance. Can you
please share how you do it?

> I've tested your patch with tuplestore and found the regression does not exist
> anymore, but I haven't look deep inside it.
>
> Patched version (with tuplestore):
> 1 col:
> Time: 3500.502 ms (00:03.501)
> Time: 3486.886 ms (00:03.487)
> Time: 3514.233 ms (00:03.514)
> 32 cols:
> Time: 10375.391 ms (00:10.375)
> Time: 10248.256 ms (00:10.248)
> Time: 10248.289 ms (00:10.248)
>
> It seems to be a good idea if there is no other issue with your patch.

As far as I understand, the use of multi inserts for queries like
"INSERT INTO ... SELECT FROM" is not discussed here anymore due to the
fact that in such cases we will have to take into account the volatile
functions and ROW triggers.
I've been thinking about this for a while and made a patch as an
experiment. The principles that the patch works on are listed below.
1)
Since performance decreases for single INSERTs (within a multi inserts
mechanism), I designed this feature as an option for the table. Thus,
if the user knows that he will perform a lot of inserts on the table,
he can specify "WITH (append_optimized=true)".
2)
The availability of volatile functions is monitored during the
construction of a subtree for a ModifyTable node. I'm not that
familiar with the query plan construction mechanism, but it seems to
me that this way we can track any occurrence of volatile functions.
Of course, most volatile functions don't stop us from using multi
inserts, but checking each such function would take a very long time,
so the very fact of having a volatile function is enough for us to
abandon multi-inserts.
3)
Default expressions of the target table are also checked for volatile
functions. The same rules apply to them as in (2). As an exception, I
allowed the use of SERIAL in the column data type, since this is a
fairly common use case.
4)
If the target table contains any ROW triggers, we don't use multi insert.
5)
Patch also contains a regression test. This is a "sandbox" where you
can do some experiments with append-optimized tables.

I hope that patch (targeted on 'master' branch,
2c7bd2ba507e273f2d7fe1b2f6d30775ed4f3c09) will be useful for this
thread.

--
Best regards,
Daniil Davydov

Attachment

v1-0001-Meet-append-optimized-tables.patch