Thread: New Table Access Methods for Multi and Single Inserts
Hi, Currently, for any component (such as COPY, CTAS[1], CREATE/REFRESH Mat View[1], INSERT INTO SELECTs[2]) multi insert logic such as buffer slots allocation, maintenance, decision to flush and clean up, need to be implemented outside the table_multi_insert() API. The main problem is that it fails to take into consideration the underlying storage engine capabilities, for more details of this point refer to a discussion in multi inserts in CTAS thread[1]. This also creates a lot of duplicate code which is more error prone and not maintainable. More importantly, in another thread [3] @Andres Freund suggested to have table insert APIs in such a way that they look more like 'scan' APIs i.e. insert_begin, insert, insert_end. The main advantages doing this are(quoting from his statement in [3]) - "more importantly it'd allow an AM to optimize operations across multiple inserts, which is important for column stores." I propose to introduce new table access methods for both multi and single inserts based on the prototype suggested by Andres in [3]. Main design goal of these new APIs is to give flexibility to tableam developers in implementing multi insert logic dependent on the underlying storage engine. Below are the APIs. I suggest to have a look at v1-0001-New-Table-AMs-for-Multi-and-Single-Inserts.patch for details of the new data structure and the API functionality. Note that temporarily I used XX_v2, we can change it later. TableInsertState* table_insert_begin(initial_args); void table_insert_v2(TableInsertState *state, TupleTableSlot *slot); void table_multi_insert_v2(TableInsertState *state, TupleTableSlot *slot); void table_multi_insert_flush(TableInsertState *state); void table_insert_end(TableInsertState *state); I'm attaching a few patches(just to show that these APIs work, avoids a lot of duplicate code and makes life easier). Better commenting can be added later. If these APIs and patches look okay, we can even consider replacing them in other places such as nodeModifyTable.c and so on. v1-0001-New-Table-AMs-for-Multi-and-Single-Inserts.patch ---> introduces new table access methods for multi and single inserts. Also implements/rearranges the outside code for heap am into these new APIs. v1-0002-CTAS-and-REFRESH-Mat-View-With-New-Multi-Insert-Table-AM.patch ---> adds new multi insert table access methods to CREATE TABLE AS, CREATE MATERIALIZED VIEW and REFRESH MATERIALIZED VIEW. v1-0003-ATRewriteTable-With-New-Single-Insert-Table-AM.patch ---> adds new single insert table access method to ALTER TABLE rewrite table code. v1-0004-COPY-With-New-Multi-and-Single-Insert-Table-AM.patch ---> adds new single and multi insert table access method to COPY code. Thoughts? Many thanks to Robert, Vignesh and Dilip for offlist discussion. [1] - https://www.postgresql.org/message-id/4eee0730-f6ec-e72d-3477-561643f4b327%40swarm64.com [2] - https://www.postgresql.org/message-id/20201124020020.GK24052%40telsasoft.com [3] - https://www.postgresql.org/message-id/20200924024128.kyk3r5g7dnu3fxxx%40alap3.anarazel.de With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Dec 8, 2020 at 6:27 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
> Hi,
>
> Currently, for any component (such as COPY, CTAS[1], CREATE/REFRESH
> Mat View[1], INSERT INTO SELECTs[2]) multi insert logic such as buffer
> slots allocation, maintenance, decision to flush and clean up, need to
> be implemented outside the table_multi_insert() API. The main problem
> is that it fails to take into consideration the underlying storage
> engine capabilities, for more details of this point refer to a
> discussion in multi inserts in CTAS thread[1]. This also creates a lot
> of duplicate code which is more error prone and not maintainable.
>
> More importantly, in another thread [3] @Andres Freund suggested to
> have table insert APIs in such a way that they look more like 'scan'
> APIs i.e. insert_begin, insert, insert_end. The main advantages doing
> this are(quoting from his statement in [3]) - "more importantly it'd
> allow an AM to optimize operations across multiple inserts, which is
> important for column stores."
>
> I propose to introduce new table access methods for both multi and
> single inserts based on the prototype suggested by Andres in [3]. Main
> design goal of these new APIs is to give flexibility to tableam
> developers in implementing multi insert logic dependent on the
> underlying storage engine.
>
> Below are the APIs. I suggest to have a look at
> v1-0001-New-Table-AMs-for-Multi-and-Single-Inserts.patch for details
> of the new data structure and the API functionality. Note that
> temporarily I used XX_v2, we can change it later.
>
> TableInsertState* table_insert_begin(initial_args);
> void table_insert_v2(TableInsertState *state, TupleTableSlot *slot);
> void table_multi_insert_v2(TableInsertState *state, TupleTableSlot *slot);
> void table_multi_insert_flush(TableInsertState *state);
> void table_insert_end(TableInsertState *state);
>
> I'm attaching a few patches(just to show that these APIs work, avoids
> a lot of duplicate code and makes life easier). Better commenting can
> be added later. If these APIs and patches look okay, we can even
> consider replacing them in other places such as nodeModifyTable.c and
> so on.
>
> v1-0001-New-Table-AMs-for-Multi-and-Single-Inserts.patch --->
> introduces new table access methods for multi and single inserts. Also
> implements/rearranges the outside code for heap am into these new
> APIs.
> v1-0002-CTAS-and-REFRESH-Mat-View-With-New-Multi-Insert-Table-AM.patch
> ---> adds new multi insert table access methods to CREATE TABLE AS,
> CREATE MATERIALIZED VIEW and REFRESH MATERIALIZED VIEW.
> v1-0003-ATRewriteTable-With-New-Single-Insert-Table-AM.patch ---> adds
> new single insert table access method to ALTER TABLE rewrite table
> code.
> v1-0004-COPY-With-New-Multi-and-Single-Insert-Table-AM.patch ---> adds
> new single and multi insert table access method to COPY code.
>
> Thoughts?
>
> [1] - https://www.postgresql.org/message-id/4eee0730-f6ec-e72d-3477-561643f4b327%40swarm64.com
> [2] - https://www.postgresql.org/message-id/20201124020020.GK24052%40telsasoft.com
> [3] - https://www.postgresql.org/message-id/20200924024128.kyk3r5g7dnu3fxxx%40alap3.anarazel.de
Added this to commitfest to get it reviewed further.
https://commitfest.postgresql.org/31/2871/
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
> Hi,
>
> Currently, for any component (such as COPY, CTAS[1], CREATE/REFRESH
> Mat View[1], INSERT INTO SELECTs[2]) multi insert logic such as buffer
> slots allocation, maintenance, decision to flush and clean up, need to
> be implemented outside the table_multi_insert() API. The main problem
> is that it fails to take into consideration the underlying storage
> engine capabilities, for more details of this point refer to a
> discussion in multi inserts in CTAS thread[1]. This also creates a lot
> of duplicate code which is more error prone and not maintainable.
>
> More importantly, in another thread [3] @Andres Freund suggested to
> have table insert APIs in such a way that they look more like 'scan'
> APIs i.e. insert_begin, insert, insert_end. The main advantages doing
> this are(quoting from his statement in [3]) - "more importantly it'd
> allow an AM to optimize operations across multiple inserts, which is
> important for column stores."
>
> I propose to introduce new table access methods for both multi and
> single inserts based on the prototype suggested by Andres in [3]. Main
> design goal of these new APIs is to give flexibility to tableam
> developers in implementing multi insert logic dependent on the
> underlying storage engine.
>
> Below are the APIs. I suggest to have a look at
> v1-0001-New-Table-AMs-for-Multi-and-Single-Inserts.patch for details
> of the new data structure and the API functionality. Note that
> temporarily I used XX_v2, we can change it later.
>
> TableInsertState* table_insert_begin(initial_args);
> void table_insert_v2(TableInsertState *state, TupleTableSlot *slot);
> void table_multi_insert_v2(TableInsertState *state, TupleTableSlot *slot);
> void table_multi_insert_flush(TableInsertState *state);
> void table_insert_end(TableInsertState *state);
>
> I'm attaching a few patches(just to show that these APIs work, avoids
> a lot of duplicate code and makes life easier). Better commenting can
> be added later. If these APIs and patches look okay, we can even
> consider replacing them in other places such as nodeModifyTable.c and
> so on.
>
> v1-0001-New-Table-AMs-for-Multi-and-Single-Inserts.patch --->
> introduces new table access methods for multi and single inserts. Also
> implements/rearranges the outside code for heap am into these new
> APIs.
> v1-0002-CTAS-and-REFRESH-Mat-View-With-New-Multi-Insert-Table-AM.patch
> ---> adds new multi insert table access methods to CREATE TABLE AS,
> CREATE MATERIALIZED VIEW and REFRESH MATERIALIZED VIEW.
> v1-0003-ATRewriteTable-With-New-Single-Insert-Table-AM.patch ---> adds
> new single insert table access method to ALTER TABLE rewrite table
> code.
> v1-0004-COPY-With-New-Multi-and-Single-Insert-Table-AM.patch ---> adds
> new single and multi insert table access method to COPY code.
>
> Thoughts?
>
> [1] - https://www.postgresql.org/message-id/4eee0730-f6ec-e72d-3477-561643f4b327%40swarm64.com
> [2] - https://www.postgresql.org/message-id/20201124020020.GK24052%40telsasoft.com
> [3] - https://www.postgresql.org/message-id/20200924024128.kyk3r5g7dnu3fxxx%40alap3.anarazel.de
Added this to commitfest to get it reviewed further.
https://commitfest.postgresql.org/31/2871/
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
Typos: + * 1) Specify is_multi as true, then multi insert state is allcoated. => allocated + * dropped, short-lived memory context is delted and mistate is freed up. => deleted + * 2) Currently, GetTupleSize() handles the existing heap, buffer, minmal and => minimal + /* Mulit insert state if requested, otherwise NULL. */ => multi + * Buffer the input slots and insert the tuples from the buffered slots at a => *one* at a time ? + * Compute the size of the tuple only if mi_max_size i.e. the total tuple size => I guess you mean max_size This variable could use a better name: +CopyMulitInsertFlushBuffers(List **mirri, .. mirri is fine for a local variable like an element of a struture/array, or a loop variable, but not for a function parameter which is an "List" of arbitrary pointers. I think this comment needs to be updated (again) for the removal of the Info structure. - * CopyMultiInsertBuffer items stored in CopyMultiInsertInfo's + * multi insert buffer items stored in CopyMultiInsertInfo's I think the COPY patch should be 0002 (or maybe merged into 0001). There's some superfluous whitespace (and other) changes there which make the patch unnecessarily long. You made the v2 insert interface a requirement for all table AMs. Should it be optional, and fall back to simple inserts if not implemented ? For CTAS, I think we need to consider Paul's idea here. https://www.postgresql.org/message-id/26C14A63-CCE5-4B46-975A-57C1784B3690%40vmware.com Conceivably, tableam should support something like that for arbitrary AMs ("insert into a new table for which we have exclusive lock"). I think that AM method should also be optional. It should be possible to implement a minimal AM without implementing every available optimization, which may not apply to all AMs, anyway. -- Justin
Thanks a lot for taking a look at the patches. On Thu, Dec 17, 2020 at 10:35 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > Typos: > > + * 1) Specify is_multi as true, then multi insert state is allcoated. > => allocated > + * dropped, short-lived memory context is delted and mistate is freed up. > => deleted > + * 2) Currently, GetTupleSize() handles the existing heap, buffer, minmal and > => minimal > + /* Mulit insert state if requested, otherwise NULL. */ > => multi > + * Buffer the input slots and insert the tuples from the buffered slots at a > => *one* at a time ? > + * Compute the size of the tuple only if mi_max_size i.e. the total tuple size > => I guess you mean max_size > > This variable could use a better name: > +CopyMulitInsertFlushBuffers(List **mirri, .. > mirri is fine for a local variable like an element of a struture/array, or a > loop variable, but not for a function parameter which is an "List" of arbitrary > pointers. > > I think this comment needs to be updated (again) for the removal of the Info > structure. > - * CopyMultiInsertBuffer items stored in CopyMultiInsertInfo's > + * multi insert buffer items stored in CopyMultiInsertInfo's > > There's some superfluous whitespace (and other) changes there which make the > patch unnecessarily long. I will correct them and post the next version of the patch set. Before that, I would like to have the discussion and thoughts on the APIs and their usefulness. > I think the COPY patch should be 0002 (or maybe merged into 0001). I can make it as a 0002 patch. > You made the v2 insert interface a requirement for all table AMs. > Should it be optional, and fall back to simple inserts if not implemented ? I tried to implement the APIs mentioned by Andreas here in [1]. I just used v2 table am APIs in existing table_insert places to show that it works. Having said that, if you notice, I moved the bulk insert allocation and deallocation to the new APIs table_insert_begin() and table_insert_end() respectively, which make them tableam specific. Currently, the bulk insert state is outside and independent of tableam. I think we should not make bulk insert state allocation and deallocation tableam specific. Thoughts? [1] - https://www.postgresql.org/message-id/20200924024128.kyk3r5g7dnu3fxxx%40alap3.anarazel.de > For CTAS, I think we need to consider Paul's idea here. > https://www.postgresql.org/message-id/26C14A63-CCE5-4B46-975A-57C1784B3690%40vmware.com IMO, if we were to allow those raw insert APIs to perform parallel inserts, then we would be reimplementing the existing table_insert or table_mulit_insert API by having some sort of shared memory for coordinating among workers and so on, may be in some other way. Yes, we could avoid all the existing locking and shared buffers with those raw insert APIs, I also feel that we can now do that with the existing insert APIs for unlogged tables and bulk insert state. To me, the raw insert APIs after implementing them for the parallel inserts, they would look like the existing insert APIs for unlogged tables and with bulk insert state. Thoughts? Please have a look at [1] for detailed comment. [1] https://www.postgresql.org/message-id/CALj2ACX0u%3DQvB7GHLEqeVYwvs2eQS7%3D-cEuem7ZaF%3Dp%2BqZ0ikA%40mail.gmail.com > Conceivably, tableam should support something like that for arbitrary AMs > ("insert into a new table for which we have exclusive lock"). I think that AM > method should also be optional. It should be possible to implement a minimal > AM without implementing every available optimization, which may not apply to > all AMs, anyway. I could not understand this point well. Maybe more thoughts help me here. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Thu, Dec 17, 2020 at 04:35:33PM +0530, Bharath Rupireddy wrote: > > You made the v2 insert interface a requirement for all table AMs. > > Should it be optional, and fall back to simple inserts if not implemented ? > > I tried to implement the APIs mentioned by Andreas here in [1]. I just > used v2 table am APIs in existing table_insert places to show that it > works. Having said that, if you notice, I moved the bulk insert > allocation and deallocation to the new APIs table_insert_begin() and > table_insert_end() respectively, which make them tableam specific. I mean I think it should be optional for a tableam to support the optimized insert routines. Here, you've made it a requirement. + Assert(routine->tuple_insert_begin != NULL); + Assert(routine->tuple_insert_v2 != NULL); + Assert(routine->multi_insert_v2 != NULL); + Assert(routine->multi_insert_flush != NULL); + Assert(routine->tuple_insert_end != NULL); +static inline void +table_multi_insert_v2(TableInsertState *state, TupleTableSlot *slot) +{ + state->rel->rd_tableam->multi_insert_v2(state, slot); +} If multi_insert_v2 == NULL, I think table_multi_insert_v2() would just call table_insert_v2(), and begin/flush/end would do nothing. If table_multi_insert_v2!=NULL, then you should assert that the other routines are provided. Are you thinking that TableInsertState would eventually have additional attributes which would apply to other tableams, but not to heap ? Is heap_insert_begin() really specific to heap ? It's allocating and populating a structure based on its arguments, but those same arguments would be passed to every other AM's insert_begin routine, too. Do you need a more flexible data structure, something that would also accomodate extensions? I'm thinking of reloptions as a loose analogy. -- Justin
On Fri, Dec 18, 2020 at 2:14 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > On Thu, Dec 17, 2020 at 04:35:33PM +0530, Bharath Rupireddy wrote: > > > You made the v2 insert interface a requirement for all table AMs. > > > Should it be optional, and fall back to simple inserts if not implemented ? > > > > I tried to implement the APIs mentioned by Andreas here in [1]. I just > > used v2 table am APIs in existing table_insert places to show that it > > works. Having said that, if you notice, I moved the bulk insert > > allocation and deallocation to the new APIs table_insert_begin() and > > table_insert_end() respectively, which make them tableam specific. > > I mean I think it should be optional for a tableam to support the optimized > insert routines. Here, you've made it a requirement. > > + Assert(routine->tuple_insert_begin != NULL); > + Assert(routine->tuple_insert_v2 != NULL); > + Assert(routine->multi_insert_v2 != NULL); > + Assert(routine->multi_insert_flush != NULL); > + Assert(routine->tuple_insert_end != NULL); +1 to make them optional. I will change. > +static inline void > +table_multi_insert_v2(TableInsertState *state, TupleTableSlot *slot) > +{ > + state->rel->rd_tableam->multi_insert_v2(state, slot); > +} > > If multi_insert_v2 == NULL, I think table_multi_insert_v2() would just call > table_insert_v2(), and begin/flush/end would do nothing. If > table_multi_insert_v2!=NULL, then you should assert that the other routines are > provided. What should happen if both multi_insert_v2 and insert_v2 are NULL? Should we error out from table_insert_v2()? > Are you thinking that TableInsertState would eventually have additional > attributes which would apply to other tableams, but not to heap ? Is > heap_insert_begin() really specific to heap ? It's allocating and populating a > structure based on its arguments, but those same arguments would be passed to > every other AM's insert_begin routine, too. Do you need a more flexible data > structure, something that would also accomodate extensions? I'm thinking of > reloptions as a loose analogy. I could not think of other tableam attributes now. But +1 to have that kind of flexible structure for TableInsertState. So, it can have tableam type and attributes within the union for each type. > I moved the bulk insert allocation and deallocation to the new APIs table_insert_begin() > and table_insert_end() respectively, which make them tableam specific. > Currently, the bulk insert state is outside and independent of > tableam. I think we should not make bulk insert state allocation and > deallocation tableam specific. Any thoughts on the above point? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Fri, Dec 18, 2020 at 07:39:14AM +0530, Bharath Rupireddy wrote: > On Fri, Dec 18, 2020 at 2:14 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > Are you thinking that TableInsertState would eventually have additional > > attributes which would apply to other tableams, but not to heap ? Is > > heap_insert_begin() really specific to heap ? It's allocating and populating a > > structure based on its arguments, but those same arguments would be passed to > > every other AM's insert_begin routine, too. Do you need a more flexible data > > structure, something that would also accomodate extensions? I'm thinking of > > reloptions as a loose analogy. > > I could not think of other tableam attributes now. But +1 to have that > kind of flexible structure for TableInsertState. So, it can have > tableam type and attributes within the union for each type. Right now you have heap_insert_begin(), and I asked if it was really heap-specific. Right now, it populates a struct based on a static list of arguments, which are what heap uses. If you were to implement a burp_insert_begin(), how would it differ from heap's? With the current API, they'd (have to) be the same, which means either that it should apply to all AMs (or have a "default" implementation that can be overridden by an AM), or that this API assumes that other AMs will want to do exactly what heap does, and fails to allow other AMs to implement optimizations for bulk inserts as claimed. I don't think using a "union" solves the problem, since it can only accommodate core AMs, and not extensions, so I suggested something like reloptions, which have a "namespace" prefix (and core has toast.*, like ALTER TABLE t SET toast.autovacuum_enabled). -- Justin
On Fri, Dec 18, 2020 at 11:24 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > On Fri, Dec 18, 2020 at 07:39:14AM +0530, Bharath Rupireddy wrote: > > On Fri, Dec 18, 2020 at 2:14 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > Are you thinking that TableInsertState would eventually have additional > > > attributes which would apply to other tableams, but not to heap ? Is > > > heap_insert_begin() really specific to heap ? It's allocating and populating a > > > structure based on its arguments, but those same arguments would be passed to > > > every other AM's insert_begin routine, too. Do you need a more flexible data > > > structure, something that would also accomodate extensions? I'm thinking of > > > reloptions as a loose analogy. > > > > I could not think of other tableam attributes now. But +1 to have that > > kind of flexible structure for TableInsertState. So, it can have > > tableam type and attributes within the union for each type. > > Right now you have heap_insert_begin(), and I asked if it was really > heap-specific. Right now, it populates a struct based on a static list of > arguments, which are what heap uses. > > If you were to implement a burp_insert_begin(), how would it differ from > heap's? With the current API, they'd (have to) be the same, which means either > that it should apply to all AMs (or have a "default" implementation that can be > overridden by an AM), or that this API assumes that other AMs will want to do > exactly what heap does, and fails to allow other AMs to implement optimizations > for bulk inserts as claimed. > > I don't think using a "union" solves the problem, since it can only accommodate > core AMs, and not extensions, so I suggested something like reloptions, which > have a "namespace" prefix (and core has toast.*, like ALTER TABLE t SET > toast.autovacuum_enabled). IIUC, your suggestion is to make the heap options such as alloc_bistate(bulk insert state is required or not), mi_max_slots (number of maximum buffered slots/tuples) and mi_max_size (the maximum tuple size of the buffered slots) as reloptions with some default values in reloptions.c under RELOPT_KIND_HEAP category so that they can be modified by users on a per table basis. And likewise other tableam options can be added by the tableam developers. This way, the APIs will become more generic. The tableam developers need to add reloptions of their choice and use them in the new API implementations. Let me know if I am missing anything from what you have in your mind. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On Fri, Dec 18, 2020 at 11:54:39AM -0600, Justin Pryzby wrote: > On Fri, Dec 18, 2020 at 07:39:14AM +0530, Bharath Rupireddy wrote: > > On Fri, Dec 18, 2020 at 2:14 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > Are you thinking that TableInsertState would eventually have additional > > > attributes which would apply to other tableams, but not to heap ? Is > > > heap_insert_begin() really specific to heap ? It's allocating and populating a > > > structure based on its arguments, but those same arguments would be passed to > > > every other AM's insert_begin routine, too. Do you need a more flexible data > > > structure, something that would also accomodate extensions? I'm thinking of > > > reloptions as a loose analogy. > > > > I could not think of other tableam attributes now. But +1 to have that > > kind of flexible structure for TableInsertState. So, it can have > > tableam type and attributes within the union for each type. > > Right now you have heap_insert_begin(), and I asked if it was really > heap-specific. Right now, it populates a struct based on a static list of > arguments, which are what heap uses. > > If you were to implement a burp_insert_begin(), how would it differ from > heap's? With the current API, they'd (have to) be the same, which means either > that it should apply to all AMs (or have a "default" implementation that can be > overridden by an AM), or that this API assumes that other AMs will want to do > exactly what heap does, and fails to allow other AMs to implement optimizations > for bulk inserts as claimed. > > I don't think using a "union" solves the problem, since it can only accommodate > core AMs, and not extensions, so I suggested something like reloptions, which > have a "namespace" prefix (and core has toast.*, like ALTER TABLE t SET > toast.autovacuum_enabled). I think you'd want to handle things like: - a compressed AM wants to specify a threshold for a tuple's *compressed* size (maybe in addition to the uncompressed size); - a "columnar" AM wants to specify a threshold size for a column, rather than for each tuple; I'm not proposing to handle those specific parameters, but rather pointing out that your implementation doesn't allow handling AM-specific considerations, which I think was the goal. The TableInsertState structure would need to store those, and then the AM's multi_insert_v2 routine would need to make use of them. It feels a bit like we'd introduce the idea of an "AM option", except that it wouldn't be user-facing (or maybe some of them would be?). Maybe I've misunderstood though, so other opinions are welcome. -- Justin
On Mon, Dec 21, 2020 at 1:17 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > On Fri, Dec 18, 2020 at 11:54:39AM -0600, Justin Pryzby wrote: > > On Fri, Dec 18, 2020 at 07:39:14AM +0530, Bharath Rupireddy wrote: > > > On Fri, Dec 18, 2020 at 2:14 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > > Are you thinking that TableInsertState would eventually have additional > > > > attributes which would apply to other tableams, but not to heap ? Is > > > > heap_insert_begin() really specific to heap ? It's allocating and populating a > > > > structure based on its arguments, but those same arguments would be passed to > > > > every other AM's insert_begin routine, too. Do you need a more flexible data > > > > structure, something that would also accomodate extensions? I'm thinking of > > > > reloptions as a loose analogy. > > > > > > I could not think of other tableam attributes now. But +1 to have that > > > kind of flexible structure for TableInsertState. So, it can have > > > tableam type and attributes within the union for each type. > > > > Right now you have heap_insert_begin(), and I asked if it was really > > heap-specific. Right now, it populates a struct based on a static list of > > arguments, which are what heap uses. > > > > If you were to implement a burp_insert_begin(), how would it differ from > > heap's? With the current API, they'd (have to) be the same, which means either > > that it should apply to all AMs (or have a "default" implementation that can be > > overridden by an AM), or that this API assumes that other AMs will want to do > > exactly what heap does, and fails to allow other AMs to implement optimizations > > for bulk inserts as claimed. > > > > I don't think using a "union" solves the problem, since it can only accommodate > > core AMs, and not extensions, so I suggested something like reloptions, which > > have a "namespace" prefix (and core has toast.*, like ALTER TABLE t SET > > toast.autovacuum_enabled). > > I think you'd want to handle things like: > > - a compressed AM wants to specify a threshold for a tuple's *compressed* size > (maybe in addition to the uncompressed size); > - a "columnar" AM wants to specify a threshold size for a column, rather > than for each tuple; > > I'm not proposing to handle those specific parameters, but rather pointing out > that your implementation doesn't allow handling AM-specific considerations, > which I think was the goal. > > The TableInsertState structure would need to store those, and then the AM's > multi_insert_v2 routine would need to make use of them. > > It feels a bit like we'd introduce the idea of an "AM option", except that it > wouldn't be user-facing (or maybe some of them would be?). Maybe I've > misunderstood though, so other opinions are welcome. Attaching a v2 patch for the new table AMs. This patch has following changes: 1) Made the TableInsertState structure generic by having a void pointer for multi insert state and defined the heap specific multi insert state information in heapam.h. This way each AM can have it's own multi insert state structure and dereference the void pointer using that structure inside the respective AM implementations. 2) Earlier in the v1 patch, the bulk insert state allocation/deallocation was moved to AM level, but I see that there's nothing specific in doing so and I think it should be independent of AM. So I'm doing that in table_insert_begin() and table_insert_end(). Because of this, I had to move the BulkInsert function declarations from heapam.h to tableam.h 3) Corrected the typos and tried to adjust indentation of the code. Note that I have not yet made the multi_insert_v2 API optional as suggested earlier. I will think more on this and update. I'm not posting the updated 0002 to 0004 patches, I plan to do so after a couple of reviews happen on the design of the APIs in 0001. Thoughts? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Thu, Dec 24, 2020 at 05:48:42AM +0530, Bharath Rupireddy wrote: > I'm not posting the updated 0002 to 0004 patches, I plan to do so > after a couple of reviews happen on the design of the APIs in 0001. > > Thoughts? Are you familiar with this work ? https://commitfest.postgresql.org/31/2717/ Reloptions for table access methods It seems like that can be relevant for your patch, and I think some of what your patch needs might be provided by AM opts. It's difficult to generalize AMs when we have only one, but your use-case might be a concrete example which would help to answer some questions on the other thread. @Jeff: https://commitfest.postgresql.org/31/2871/ -- Justin
On Fri, Dec 25, 2020 at 8:10 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > On Thu, Dec 24, 2020 at 05:48:42AM +0530, Bharath Rupireddy wrote: > > I'm not posting the updated 0002 to 0004 patches, I plan to do so > > after a couple of reviews happen on the design of the APIs in 0001. > > > > Thoughts? > > Are you familiar with this work ? > > https://commitfest.postgresql.org/31/2717/ > Reloptions for table access methods > > It seems like that can be relevant for your patch, and I think some of what > your patch needs might be provided by AM opts. > > It's difficult to generalize AMs when we have only one, but your use-case might > be a concrete example which would help to answer some questions on the other > thread. > > @Jeff: https://commitfest.postgresql.org/31/2871/ Note that I have not gone through the entire thread at [1]. On some initial study, that patch is proposing to allow different table AMs to have custom rel options. In the v2 patch that I sent upthread [2] for new table AMs has heap AM multi insert code moved inside the new heap AM implementation and I don't see any need of having rel options. In case, any other AMs want to have the control for their multi insert API implementation via rel options, I think the proposal at [1] can be useful. IIUC, there's no dependency or anything as such for the new table AM patch with the rel options thread [1]. If I'm right, can this new table AM patch [2] be reviewed further? Thoughts? [1] - https://commitfest.postgresql.org/31/2717/ [2] - https://www.postgresql.org/message-id/CALj2ACWMnZZCu%3DG0PJkEeYYicKeuJ-X%3DSU19i6vQ1%2B%3DuXz8u0Q%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On 28-12-2020 13:48, Bharath Rupireddy wrote: > On Fri, Dec 25, 2020 at 8:10 AM Justin Pryzby <pryzby@telsasoft.com> wrote: >> On Thu, Dec 24, 2020 at 05:48:42AM +0530, Bharath Rupireddy wrote: >>> I'm not posting the updated 0002 to 0004 patches, I plan to do so >>> after a couple of reviews happen on the design of the APIs in 0001. >>> >>> Thoughts? >> >> Are you familiar with this work ? >> >> https://commitfest.postgresql.org/31/2717/ >> Reloptions for table access methods >> >> It seems like that can be relevant for your patch, and I think some of what >> your patch needs might be provided by AM opts. >> >> It's difficult to generalize AMs when we have only one, but your use-case might >> be a concrete example which would help to answer some questions on the other >> thread. >> >> @Jeff: https://commitfest.postgresql.org/31/2871/ > > Note that I have not gone through the entire thread at [1]. On some > initial study, that patch is proposing to allow different table AMs to > have custom rel options. > > In the v2 patch that I sent upthread [2] for new table AMs has heap AM > multi insert code moved inside the new heap AM implementation and I > don't see any need of having rel options. In case, any other AMs want > to have the control for their multi insert API implementation via rel > options, I think the proposal at [1] can be useful. > > > Thoughts? > > [1] - https://commitfest.postgresql.org/31/2717/ > [2] - https://www.postgresql.org/message-id/CALj2ACWMnZZCu%3DG0PJkEeYYicKeuJ-X%3DSU19i6vQ1%2B%3DuXz8u0Q%40mail.gmail.com > > With Regards, > Bharath Rupireddy. > EnterpriseDB: http://www.enterprisedb.com > Hi, > IIUC, there's no dependency or anything as such for the new table AM > patch with the rel options thread [1]. If I'm right, can this new > table AM patch [2] be reviewed further? To me this seems good enough. Reason is that I anticipate that there would not necessarily be per-table options for now but rather global options, if any. Moreover, if we want to make these kind of tradeoffs user-controllable I would argue this should be done in a different patch-set either way. Reason is that there are parameters in heap already that are computed / hardcoded as well (see e.g. RelationAddExtraBlocks). === As to the patches themselves: I think the API is a huge step forward! I assume that we want to have a single-insert API like heap_insert_v2 so that we can encode the knowledge that there will just be a single insert coming and likely a commit afterwards? Reason I'm asking is that I quite liked the heap_insert_begin parameter is_multi, which could even be turned into a "expected_rowcount" of the amount of rows expected to be commited in the transaction (e.g. single, several, thousands/stream). If we were to make the API based on expected rowcounts, the whole heap_insert_v2, heap_insert and heap_multi_insert could be turned into a single function heap_insert, as the knowledge about buffering of the slots is then already stored in the TableInsertState, creating an API like: // expectedRows: -1 = streaming, otherwise expected rowcount. TableInsertState* heap_insert_begin(Relation rel, CommandId cid, int options, int expectedRows); heap_insert(TableInsertState *state, TupleTableSlot *slot); Do you think that's a good idea? Two smaller things I'm wondering: - the clear_mi_slots; why is this not in the HeapMultiInsertState? the slots themselves are declared there? also, the boolean themselves is somewhat problematic I think because it would only work if you specified is_multi=true which would depend on the actual tableam to implement this then in a way that copy/ctas/etc can also use the slot properly, which I think would severely limit their freedom to store the slots more efficiently? Also, why do we want to do ExecClearTuple() anyway? Isn't it good enough that the next call to ExecCopySlot will effectively clear it out? - flushed -> why is this a stored boolean? isn't this indirectly encoded by cur_slots/cur_size == 0? For patches 02-04 I quickly skimmed through them as I assume we first want the API agreed upon. Generally they look nice and like a big step forward. What I'm just wondering about is the usage of the implementation details like mistate->slots[X]. It makes a lot of sense to do so but also makes for a difficult compromise, because now the tableam has to guarantee a copy of the slot, and hopefully even one in a somewhat efficient form. Kind regards, Luc
On Mon, Jan 4, 2021 at 1:29 PM Luc Vlaming <luc@swarm64.com> wrote:
> > table AM patch [2] be reviewed further?
> As to the patches themselves:
>
> I think the API is a huge step forward! I assume that we want to have a
> single-insert API like heap_insert_v2 so that we can encode the
> knowledge that there will just be a single insert coming and likely a
> commit afterwards?
>
> Reason I'm asking is that I quite liked the heap_insert_begin parameter
> is_multi, which could even be turned into a "expected_rowcount" of the
> amount of rows expected to be commited in the transaction (e.g. single,
> several, thousands/stream).
> If we were to make the API based on expected rowcounts, the whole
> heap_insert_v2, heap_insert and heap_multi_insert could be turned into a
> single function heap_insert, as the knowledge about buffering of the
> slots is then already stored in the TableInsertState, creating an API like:
>
> // expectedRows: -1 = streaming, otherwise expected rowcount.
> TableInsertState* heap_insert_begin(Relation rel, CommandId cid, int
> options, int expectedRows);
> heap_insert(TableInsertState *state, TupleTableSlot *slot);
>
> Do you think that's a good idea?
IIUC, your suggestion is to use expectedRows and move the multi insert implementation heap_multi_insert_v2 to heap_insert_v2. If that's correct, so heap_insert_v2 will look something like this:
heap_insert_v2()
{
if (single_insert)
//do single insertion work, the code in existing heap_insert_v2 comes here
else
//do multi insertion work, the code in existing heap_multi_insert_v2 comes here
}
I don't see any problem in combining single and multi insert APIs into one. Having said that, will the APIs be cleaner then? Isn't it going to be confusing if a single heap_insert_v2 API does both the works? With the existing separate APIs, for single insertion, the sequence of the API can be like begin, insert_v2, end and for multi inserts it's like begin, multi_insert_v2, flush, end. I prefer to have a separate multi insert API so that it will make the code look readable.
Thoughts?
> Two smaller things I'm wondering:
> - the clear_mi_slots; why is this not in the HeapMultiInsertState? the
> slots themselves are declared there?
Firstly, we need to have the buffered slots sometimes(please have a look at the comments in TableInsertState structure) outside the multi_insert API. And we need to have cleared the previously flushed slots before we start buffering in heap_multi_insert_v2(). I can remove the clear_mi_slots flag altogether and do as follows: I will not set mistate->cur_slots to 0 in heap_multi_insert_flush after the flush, I will only set state->flushed to true. In heap_multi_insert_v2,
void
heap_multi_insert_v2(TableInsertState *state, TupleTableSlot *slot)
{
TupleTableSlot *batchslot;
HeapMultiInsertState *mistate = (HeapMultiInsertState *)state->mistate;
Size sz;
Assert(mistate && mistate->slots);
/* if the slots are flushed previously then clear them off before using them again. */
if (state->flushed)
{
int i;
for (i = 0; i < mistate->cur_slots; i++)
ExecClearTuple(mistate->slots[i]);
mistate->cur_slots = 0;
state->flushed = false
}
if (mistate->slots[mistate->cur_slots] == NULL)
mistate->slots[mistate->cur_slots] =
table_slot_create(state->rel, NULL);
batchslot = mistate->slots[mistate->cur_slots];
ExecCopySlot(batchslot, slot);
> Also, why do we want to do ExecClearTuple() anyway? Isn't
> it good enough that the next call to ExecCopySlot will effectively clear
> it out?
For virtual, heap, minimal tuple slots, yes ExecCopySlot slot clears the slot before copying. But, for buffer heap slots, the tts_buffer_heap_copyslot does not always clear the destination slot, see below. If we fall into else condition, we might get some issues. And also note that, once the slot is cleared in ExecClearTuple, it will not be cleared again in ExecCopySlot because TTS_SHOULDFREE(slot) will be false. That is why, let's have ExecClearTuple as is.
/*
* If the source slot is of a different kind, or is a buffer slot that has
* been materialized / is virtual, make a new copy of the tuple. Otherwise
* make a new reference to the in-buffer tuple.
*/
if (dstslot->tts_ops != srcslot->tts_ops ||
TTS_SHOULDFREE(srcslot) ||
!bsrcslot->base.tuple)
{
MemoryContext oldContext;
ExecClearTuple(dstslot);
}
else
{
Assert(BufferIsValid(bsrcslot->buffer));
tts_buffer_heap_store_tuple(dstslot, bsrcslot->base.tuple,
bsrcslot->buffer, false);
> - flushed -> why is this a stored boolean? isn't this indirectly encoded
> by cur_slots/cur_size == 0?
Note that cur_slots is in HeapMultiInsertState and outside of the new APIs i.e. in TableInsertState, mistate is a void pointer, and we can't really access the cur_slots. I mean, we can access but we need to be dereferencing using the tableam kind. Instead of doing all of that, to keep the API cleaner, I chose to have a boolean in the TableInsertState which we can see and use outside of the new APIs. Hope that's fine.
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
> > table AM patch [2] be reviewed further?
> As to the patches themselves:
>
> I think the API is a huge step forward! I assume that we want to have a
> single-insert API like heap_insert_v2 so that we can encode the
> knowledge that there will just be a single insert coming and likely a
> commit afterwards?
>
> Reason I'm asking is that I quite liked the heap_insert_begin parameter
> is_multi, which could even be turned into a "expected_rowcount" of the
> amount of rows expected to be commited in the transaction (e.g. single,
> several, thousands/stream).
> If we were to make the API based on expected rowcounts, the whole
> heap_insert_v2, heap_insert and heap_multi_insert could be turned into a
> single function heap_insert, as the knowledge about buffering of the
> slots is then already stored in the TableInsertState, creating an API like:
>
> // expectedRows: -1 = streaming, otherwise expected rowcount.
> TableInsertState* heap_insert_begin(Relation rel, CommandId cid, int
> options, int expectedRows);
> heap_insert(TableInsertState *state, TupleTableSlot *slot);
>
> Do you think that's a good idea?
IIUC, your suggestion is to use expectedRows and move the multi insert implementation heap_multi_insert_v2 to heap_insert_v2. If that's correct, so heap_insert_v2 will look something like this:
heap_insert_v2()
{
if (single_insert)
//do single insertion work, the code in existing heap_insert_v2 comes here
else
//do multi insertion work, the code in existing heap_multi_insert_v2 comes here
}
I don't see any problem in combining single and multi insert APIs into one. Having said that, will the APIs be cleaner then? Isn't it going to be confusing if a single heap_insert_v2 API does both the works? With the existing separate APIs, for single insertion, the sequence of the API can be like begin, insert_v2, end and for multi inserts it's like begin, multi_insert_v2, flush, end. I prefer to have a separate multi insert API so that it will make the code look readable.
Thoughts?
> Two smaller things I'm wondering:
> - the clear_mi_slots; why is this not in the HeapMultiInsertState? the
> slots themselves are declared there?
Firstly, we need to have the buffered slots sometimes(please have a look at the comments in TableInsertState structure) outside the multi_insert API. And we need to have cleared the previously flushed slots before we start buffering in heap_multi_insert_v2(). I can remove the clear_mi_slots flag altogether and do as follows: I will not set mistate->cur_slots to 0 in heap_multi_insert_flush after the flush, I will only set state->flushed to true. In heap_multi_insert_v2,
void
heap_multi_insert_v2(TableInsertState *state, TupleTableSlot *slot)
{
TupleTableSlot *batchslot;
HeapMultiInsertState *mistate = (HeapMultiInsertState *)state->mistate;
Size sz;
Assert(mistate && mistate->slots);
/* if the slots are flushed previously then clear them off before using them again. */
if (state->flushed)
{
int i;
for (i = 0; i < mistate->cur_slots; i++)
ExecClearTuple(mistate->slots[i]);
mistate->cur_slots = 0;
state->flushed = false
}
if (mistate->slots[mistate->cur_slots] == NULL)
mistate->slots[mistate->cur_slots] =
table_slot_create(state->rel, NULL);
batchslot = mistate->slots[mistate->cur_slots];
ExecCopySlot(batchslot, slot);
Thoughts?
> it good enough that the next call to ExecCopySlot will effectively clear
> it out?
For virtual, heap, minimal tuple slots, yes ExecCopySlot slot clears the slot before copying. But, for buffer heap slots, the tts_buffer_heap_copyslot does not always clear the destination slot, see below. If we fall into else condition, we might get some issues. And also note that, once the slot is cleared in ExecClearTuple, it will not be cleared again in ExecCopySlot because TTS_SHOULDFREE(slot) will be false. That is why, let's have ExecClearTuple as is.
/*
* If the source slot is of a different kind, or is a buffer slot that has
* been materialized / is virtual, make a new copy of the tuple. Otherwise
* make a new reference to the in-buffer tuple.
*/
if (dstslot->tts_ops != srcslot->tts_ops ||
TTS_SHOULDFREE(srcslot) ||
!bsrcslot->base.tuple)
{
MemoryContext oldContext;
ExecClearTuple(dstslot);
}
else
{
Assert(BufferIsValid(bsrcslot->buffer));
tts_buffer_heap_store_tuple(dstslot, bsrcslot->base.tuple,
bsrcslot->buffer, false);
> - flushed -> why is this a stored boolean? isn't this indirectly encoded
> by cur_slots/cur_size == 0?
Note that cur_slots is in HeapMultiInsertState and outside of the new APIs i.e. in TableInsertState, mistate is a void pointer, and we can't really access the cur_slots. I mean, we can access but we need to be dereferencing using the tableam kind. Instead of doing all of that, to keep the API cleaner, I chose to have a boolean in the TableInsertState which we can see and use outside of the new APIs. Hope that's fine.
With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
On Mon, 2021-01-04 at 08:59 +0100, Luc Vlaming wrote: > Reason I'm asking is that I quite liked the heap_insert_begin > parameter > is_multi, which could even be turned into a "expected_rowcount" of > the > amount of rows expected to be commited in the transaction (e.g. > single, > several, thousands/stream). Do you mean "written by the statement" instead of "committed in the transaction"? It doesn't look like the TableInsertState state will survive across statement boundaries. Though that is an important question to consider. If the premise is that a given custom AM may be much more efficient at bulk inserts than retail inserts (which is reasonable), then it makes sense to handle the case of a transaction with many single-tuple inserts. But keeping insert state across statement boundaries also raises a few potential problems. Regards, Jeff Davis
On 05-01-2021 22:28, Jeff Davis wrote: > On Mon, 2021-01-04 at 08:59 +0100, Luc Vlaming wrote: >> Reason I'm asking is that I quite liked the heap_insert_begin >> parameter >> is_multi, which could even be turned into a "expected_rowcount" of >> the >> amount of rows expected to be commited in the transaction (e.g. >> single, >> several, thousands/stream). > > Do you mean "written by the statement" instead of "committed in the > transaction"? It doesn't look like the TableInsertState state will > survive across statement boundaries. > > Though that is an important question to consider. If the premise is > that a given custom AM may be much more efficient at bulk inserts than > retail inserts (which is reasonable), then it makes sense to handle the > case of a transaction with many single-tuple inserts. But keeping > insert state across statement boundaries also raises a few potential > problems. > > Regards, > Jeff Davis > > I did actually mean until the end of the transaction. I know this is currently not possible with the current design but I think it would be cool to start going that way (even if slightly). Creating some more freedom on how a tableam optimizes inserts, when one syncs to disk, etc would be good imo. It would allow one to create e.g. a tableam that would not have as a high overhead when doing single statement inserts. Kind regards, Luc
On 05-01-2021 11:06, Bharath Rupireddy wrote: > On Mon, Jan 4, 2021 at 1:29 PM Luc Vlaming <luc@swarm64.com > <mailto:luc@swarm64.com>> wrote: > > > table AM patch [2] be reviewed further? > > As to the patches themselves: > > > > I think the API is a huge step forward! I assume that we want to have a > > single-insert API like heap_insert_v2 so that we can encode the > > knowledge that there will just be a single insert coming and likely a > > commit afterwards? > > > > Reason I'm asking is that I quite liked the heap_insert_begin parameter > > is_multi, which could even be turned into a "expected_rowcount" of the > > amount of rows expected to be commited in the transaction (e.g. single, > > several, thousands/stream). > > If we were to make the API based on expected rowcounts, the whole > > heap_insert_v2, heap_insert and heap_multi_insert could be turned into a > > single function heap_insert, as the knowledge about buffering of the > > slots is then already stored in the TableInsertState, creating an API > like: > > > > // expectedRows: -1 = streaming, otherwise expected rowcount. > > TableInsertState* heap_insert_begin(Relation rel, CommandId cid, int > > options, int expectedRows); > > heap_insert(TableInsertState *state, TupleTableSlot *slot); > > > > Do you think that's a good idea? > > IIUC, your suggestion is to use expectedRows and move the multi insert > implementation heap_multi_insert_v2 to heap_insert_v2. If that's > correct, so heap_insert_v2 will look something like this: > > heap_insert_v2() > { > if (single_insert) > //do single insertion work, the code in existing heap_insert_v2 > comes here > else > //do multi insertion work, the code in existing > heap_multi_insert_v2 comes here > } > > I don't see any problem in combining single and multi insert APIs into > one. Having said that, will the APIs be cleaner then? Isn't it going to > be confusing if a single heap_insert_v2 API does both the works? With > the existing separate APIs, for single insertion, the sequence of the > API can be like begin, insert_v2, end and for multi inserts it's like > begin, multi_insert_v2, flush, end. I prefer to have a separate multi > insert API so that it will make the code look readable. > > Thoughts? The main reason for me for wanting a single API is that I would like the decision of using single or multi inserts to move to inside the tableam. For e.g. a heap insert we might want to put the threshold at e.g. 100 rows so that the overhead of buffering the tuples is actually compensated. For other tableam this logic might also be quite different, and I think therefore that it shouldn't be e.g. COPY or CTAS deciding whether or not multi inserts should be used. Because otherwise the thing we'll get is that there will be tableams that will ignore this flag and do their own thing anyway. I'd rather have an API that gives all necessary information to the tableam and then make the tableam do "the right thing". Another reason I'm suggesting this API is that I would expect that the begin is called in a different place in the code for the (multiple) inserts than the actual insert statement. To me conceptually the begin and end are like e.g. the executor begin and end: you prepare the inserts with the knowledge you have at that point. I assumed (wrongly?) that during the start of the statement one knows best how many rows are coming; and then the actual insertion of the row doesn't have to deal anymore with multi/single inserts, choosing when to buffer or not, because that information has already been given during the initial phase. One of the reasons this is appealing to me is that e.g. in [1] there was discussion on when to switch to a multi insert state, and imo this should be up to the tableam. > > > Two smaller things I'm wondering: > > - the clear_mi_slots; why is this not in the HeapMultiInsertState? the > > slots themselves are declared there? > > Firstly, we need to have the buffered slots sometimes(please have a look > at the comments in TableInsertState structure) outside the multi_insert > API. And we need to have cleared the previously flushed slots before we > start buffering in heap_multi_insert_v2(). I can remove the > clear_mi_slots flag altogether and do as follows: I will not set > mistate->cur_slots to 0 in heap_multi_insert_flush after the flush, I > will only set state->flushed to true. In heap_multi_insert_v2, > > void > heap_multi_insert_v2(TableInsertState *state, TupleTableSlot *slot) > { > TupleTableSlot *batchslot; > HeapMultiInsertState *mistate = (HeapMultiInsertState *)state->mistate; > Size sz; > > Assert(mistate && mistate->slots); > > * /* if the slots are flushed previously then clear them off before > using them again. */ > if (state->flushed) > { > int i; > > for (i = 0; i < mistate->cur_slots; i++) > ExecClearTuple(mistate->slots[i]); > > mistate->cur_slots = 0; > state->flushed = false > }* > > if (mistate->slots[mistate->cur_slots] == NULL) > mistate->slots[mistate->cur_slots] = > table_slot_create(state->rel, NULL); > > batchslot = mistate->slots[mistate->cur_slots]; > > ExecCopySlot(batchslot, slot); > > Thoughts? From what I can see you can just keep the v2-0001 patch and: - remove the flushed variable alltogether. mistate->cur_slots == 0 encodes this already and the variable is never actually checked on. - call ExecClearTuple just before ExecCopySlot() Which would make the code something like: void heap_multi_insert_v2(TableInsertState *state, TupleTableSlot *slot) { TupleTableSlot *batchslot; HeapMultiInsertState *mistate = (HeapMultiInsertState *)state->mistate; Size sz; Assert(mistate && mistate->slots); if (mistate->slots[mistate->cur_slots] == NULL) mistate->slots[mistate->cur_slots] = table_slot_create(state->rel, NULL); batchslot = mistate->slots[mistate->cur_slots]; ExecClearTuple(batchslot); ExecCopySlot(batchslot, slot); /* * Calculate the tuple size after the original slot is copied, because the * copied slot type and the tuple size may change. */ sz = GetTupleSize(batchslot, mistate->max_size); Assert(sz > 0); mistate->cur_slots++; mistate->cur_size += sz; if (mistate->cur_slots >= mistate->max_slots || mistate->cur_size >= mistate->max_size) heap_multi_insert_flush(state); } void heap_multi_insert_flush(TableInsertState *state) { HeapMultiInsertState *mistate = (HeapMultiInsertState *)state->mistate; MemoryContext oldcontext; Assert(mistate && mistate->slots && mistate->cur_slots >= 0 && mistate->context); if (mistate->cur_slots == 0) return; oldcontext = MemoryContextSwitchTo(mistate->context); heap_multi_insert(state->rel, mistate->slots, mistate->cur_slots, state->cid, state->options, state->bistate); MemoryContextReset(mistate->context); MemoryContextSwitchTo(oldcontext); /* * Do not clear the slots always. Sometimes callers may want the slots for * index insertions or after row trigger executions in which case they have * to clear the tuples before using for the next insert batch. */ if (state->clear_mi_slots) { int i; for (i = 0; i < mistate->cur_slots; i++) ExecClearTuple(mistate->slots[i]); } mistate->cur_slots = 0; mistate->cur_size = 0; } > > > Also, why do we want to do ExecClearTuple() anyway? Isn't > > it good enough that the next call to ExecCopySlot will effectively clear > > it out? > > For virtual, heap, minimal tuple slots, yes ExecCopySlot slot clears the > slot before copying. But, for buffer heap slots, the > tts_buffer_heap_copyslot does not always clear the destination slot, see > below. If we fall into else condition, we might get some issues. And > also note that, once the slot is cleared in ExecClearTuple, it will not > be cleared again in ExecCopySlot because TTS_SHOULDFREE(slot) will be > false. That is why, let's have ExecClearTuple as is. > I had no idea the buffer heap slot doesn't unconditionally clear out the slot :( So yes lets call it unconditionally ourselves. See also suggestion above. > /* > * If the source slot is of a different kind, or is a buffer slot > that has > * been materialized / is virtual, make a new copy of the tuple. > Otherwise > * make a new reference to the in-buffer tuple. > */ > if (dstslot->tts_ops != srcslot->tts_ops || > TTS_SHOULDFREE(srcslot) || > !bsrcslot->base.tuple) > { > MemoryContext oldContext; > > ExecClearTuple(dstslot); > } > else > { > Assert(BufferIsValid(bsrcslot->buffer)); > > tts_buffer_heap_store_tuple(dstslot, bsrcslot->base.tuple, > bsrcslot->buffer, false); > > > - flushed -> why is this a stored boolean? isn't this indirectly encoded > > by cur_slots/cur_size == 0? > > Note that cur_slots is in HeapMultiInsertState and outside of the new > APIs i.e. in TableInsertState, mistate is a void pointer, and we can't > really access the cur_slots. I mean, we can access but we need to be > dereferencing using the tableam kind. Instead of doing all of that, to > keep the API cleaner, I chose to have a boolean in the TableInsertState > which we can see and use outside of the new APIs. Hope that's fine. > So you mean the flushed variable is actually there to tell the user of the API that they are supposed to call flush before end? Why can't the end call flush itself then? I guess I completely misunderstood the purpose of table_multi_insert_flush being public. I had assumed it is there to from the usage site indicate that now would be a good time to flush, e.g. because of a statement ending or something. I had not understood this is a requirement that its always required to do table_multi_insert_flush + table_insert_end. IMHO I would hide this from the callee, given that you would only really call flush yourself when you immediately after would call end, or are there other cases where one would be required to explicitly call flush? > With Regards, > Bharath Rupireddy. > EnterpriseDB: http://www.enterprisedb.com <http://www.enterprisedb.com> Kind regards, Luc
On Wed, Jan 6, 2021 at 12:56 PM Luc Vlaming <luc@swarm64.com> wrote: > The main reason for me for wanting a single API is that I would like the > decision of using single or multi inserts to move to inside the tableam. > For e.g. a heap insert we might want to put the threshold at e.g. 100 > rows so that the overhead of buffering the tuples is actually > compensated. For other tableam this logic might also be quite different, > and I think therefore that it shouldn't be e.g. COPY or CTAS deciding > whether or not multi inserts should be used. Because otherwise the thing > we'll get is that there will be tableams that will ignore this flag and > do their own thing anyway. I'd rather have an API that gives all > necessary information to the tableam and then make the tableam do "the > right thing". > > Another reason I'm suggesting this API is that I would expect that the > begin is called in a different place in the code for the (multiple) > inserts than the actual insert statement. > To me conceptually the begin and end are like e.g. the executor begin > and end: you prepare the inserts with the knowledge you have at that > point. I assumed (wrongly?) that during the start of the statement one > knows best how many rows are coming; and then the actual insertion of > the row doesn't have to deal anymore with multi/single inserts, choosing > when to buffer or not, because that information has already been given > during the initial phase. One of the reasons this is appealing to me is > that e.g. in [1] there was discussion on when to switch to a multi > insert state, and imo this should be up to the tableam. Agree that whether to go with the multi or single inserts should be completely left to tableam implementation, we, as callers of those API just need to inform whether we expect single or multiple rows, and it should be left to tableam implementation whether to actually go with buffering or single inserts. ISTM that it's an elegant way of making the API generic and abstracting everything from the callers. What I wonder is how can we know in advance the expected row count that we need to pass in to heap_insert_begin()? IIUC, we can not estimate the upcoming rows in COPY, Insert Into Select, or Refresh Mat View or some other insert queries? Of course, we can look at the planner's estimated row count for the selects in COPY, Insert Into Select or Refresh Mat View after the planning, but to me that's not something we can depend on and pass in the row count to the insert APIs. When we don't know the expected row count, why can't we(as callers of the APIs) tell the APIs something like, "I'm intending to perform multi inserts, so if possible and if you have a mechanism to buffer the slots, do it, otherwise insert the tuples one by one, or else do whatever you want to do with the tuples I give it you". So, in case of COPY we can ask the API for multi inserts and call heap_insert_begin() and heap_insert_v2(). Given the above explanation, I still feel bool is_multi would suffice. Thoughts? On dynamically, switching from single to multi inserts, this can be done by heap_insert_v2 itself. The way I think it's possible is that, say we have some threshold row count 1000(can be a macro) after inserting those many tuples, heap_insert_v2 can switch to buffering mode. Thoughts? > Which would make the code something like: > > void > heap_multi_insert_v2(TableInsertState *state, TupleTableSlot *slot) > { > TupleTableSlot *batchslot; > HeapMultiInsertState *mistate = (HeapMultiInsertState *)state->mistate; > Size sz; > > Assert(mistate && mistate->slots); > > if (mistate->slots[mistate->cur_slots] == NULL) > mistate->slots[mistate->cur_slots] = > table_slot_create(state->rel, NULL); > > batchslot = mistate->slots[mistate->cur_slots]; > > ExecClearTuple(batchslot); > ExecCopySlot(batchslot, slot); > > /* > * Calculate the tuple size after the original slot is copied, because the > * copied slot type and the tuple size may change. > */ > sz = GetTupleSize(batchslot, mistate->max_size); > > Assert(sz > 0); > > mistate->cur_slots++; > mistate->cur_size += sz; > > if (mistate->cur_slots >= mistate->max_slots || > mistate->cur_size >= mistate->max_size) > heap_multi_insert_flush(state); > } I think clearing tuples before copying the slot as you suggested may work without the need of clear_slots flag. > > > > Also, why do we want to do ExecClearTuple() anyway? Isn't > > > it good enough that the next call to ExecCopySlot will effectively clear > > > it out? > > > > For virtual, heap, minimal tuple slots, yes ExecCopySlot slot clears the > > slot before copying. But, for buffer heap slots, the > > tts_buffer_heap_copyslot does not always clear the destination slot, see > > below. If we fall into else condition, we might get some issues. And > > also note that, once the slot is cleared in ExecClearTuple, it will not > > be cleared again in ExecCopySlot because TTS_SHOULDFREE(slot) will be > > false. That is why, let's have ExecClearTuple as is. > > > I had no idea the buffer heap slot doesn't unconditionally clear out the > slot :( So yes lets call it unconditionally ourselves. See also > suggestion above. Yeah, we will clear the tuple slot before copy to be on the safer side. > > /* > > * If the source slot is of a different kind, or is a buffer slot > > that has > > * been materialized / is virtual, make a new copy of the tuple. > > Otherwise > > * make a new reference to the in-buffer tuple. > > */ > > if (dstslot->tts_ops != srcslot->tts_ops || > > TTS_SHOULDFREE(srcslot) || > > !bsrcslot->base.tuple) > > { > > MemoryContext oldContext; > > > > ExecClearTuple(dstslot); > > } > > else > > { > > Assert(BufferIsValid(bsrcslot->buffer)); > > > > tts_buffer_heap_store_tuple(dstslot, bsrcslot->base.tuple, > > bsrcslot->buffer, false); > > > > > - flushed -> why is this a stored boolean? isn't this indirectly encoded > > > by cur_slots/cur_size == 0? > > > > Note that cur_slots is in HeapMultiInsertState and outside of the new > > APIs i.e. in TableInsertState, mistate is a void pointer, and we can't > > really access the cur_slots. I mean, we can access but we need to be > > dereferencing using the tableam kind. Instead of doing all of that, to > > keep the API cleaner, I chose to have a boolean in the TableInsertState > > which we can see and use outside of the new APIs. Hope that's fine. > > > So you mean the flushed variable is actually there to tell the user of > the API that they are supposed to call flush before end? Why can't the > end call flush itself then? I guess I completely misunderstood the > purpose of table_multi_insert_flush being public. I had assumed it is > there to from the usage site indicate that now would be a good time to > flush, e.g. because of a statement ending or something. I had not > understood this is a requirement that its always required to do > table_multi_insert_flush + table_insert_end. > IMHO I would hide this from the callee, given that you would only really > call flush yourself when you immediately after would call end, or are > there other cases where one would be required to explicitly call flush? We need to know outside the multi_insert API whether the buffered slots in case of multi inserts are flushed. Reason is that if we have indexes or after row triggers, currently we call ExecInsertIndexTuples or ExecARInsertTriggers on the buffered slots outside the API in a loop after the flush. If we agree on removing heap_multi_insert_v2 API and embed that logic inside heap_insert_v2, then we can do this - pass the required information and the functions ExecInsertIndexTuples and ExecARInsertTriggers as callbacks so that, whether or not heap_insert_v2 choses single or multi inserts, it can callback these functions with the required information passed after the flush. We can add the callback and required information into TableInsertState. But, I'm not quite sure, we would make ExecInsertIndexTuples and ExecARInsertTriggers. And in If we don't want to go with callback way, then at least we need to know whether or not heap_insert_v2 has chosen multi inserts, if yes, the buffered slots array, and the number of current buffered slots, whether they are flushed or not in the TableInsertState. Then, eventually, we might need all the HeapMultiInsertState info in the TableInsertState. Thoughts? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
On 06-01-2021 14:06, Bharath Rupireddy wrote: > On Wed, Jan 6, 2021 at 12:56 PM Luc Vlaming <luc@swarm64.com> wrote: >> The main reason for me for wanting a single API is that I would like the >> decision of using single or multi inserts to move to inside the tableam. >> For e.g. a heap insert we might want to put the threshold at e.g. 100 >> rows so that the overhead of buffering the tuples is actually >> compensated. For other tableam this logic might also be quite different, >> and I think therefore that it shouldn't be e.g. COPY or CTAS deciding >> whether or not multi inserts should be used. Because otherwise the thing >> we'll get is that there will be tableams that will ignore this flag and >> do their own thing anyway. I'd rather have an API that gives all >> necessary information to the tableam and then make the tableam do "the >> right thing". >> >> Another reason I'm suggesting this API is that I would expect that the >> begin is called in a different place in the code for the (multiple) >> inserts than the actual insert statement. >> To me conceptually the begin and end are like e.g. the executor begin >> and end: you prepare the inserts with the knowledge you have at that >> point. I assumed (wrongly?) that during the start of the statement one >> knows best how many rows are coming; and then the actual insertion of >> the row doesn't have to deal anymore with multi/single inserts, choosing >> when to buffer or not, because that information has already been given >> during the initial phase. One of the reasons this is appealing to me is >> that e.g. in [1] there was discussion on when to switch to a multi >> insert state, and imo this should be up to the tableam. > > Agree that whether to go with the multi or single inserts should be > completely left to tableam implementation, we, as callers of those API > just need to inform whether we expect single or multiple rows, and it > should be left to tableam implementation whether to actually go with > buffering or single inserts. ISTM that it's an elegant way of making > the API generic and abstracting everything from the callers. What I > wonder is how can we know in advance the expected row count that we > need to pass in to heap_insert_begin()? IIUC, we can not estimate the > upcoming rows in COPY, Insert Into Select, or Refresh Mat View or some > other insert queries? Of course, we can look at the planner's > estimated row count for the selects in COPY, Insert Into Select or > Refresh Mat View after the planning, but to me that's not something we > can depend on and pass in the row count to the insert APIs. > > When we don't know the expected row count, why can't we(as callers of > the APIs) tell the APIs something like, "I'm intending to perform > multi inserts, so if possible and if you have a mechanism to buffer > the slots, do it, otherwise insert the tuples one by one, or else do > whatever you want to do with the tuples I give it you". So, in case of > COPY we can ask the API for multi inserts and call heap_insert_begin() > and heap_insert_v2(). > I thought that when it is available (because of planning) it would be nice to pass it in. If you don't know you could pass in a 1 for doing single inserts, and e.g. -1 or max-int for streaming. The reason I proposed it is so that tableam's have as much knowledge as posisble to do the right thing. is_multi does also work of course but is just somewhat less informative. What to me seemed somewhat counterintuitive is that with the proposed API it is possible to say is_multi=true and then still call heap_insert_v2 to do a single insert. > Given the above explanation, I still feel bool is_multi would suffice. > > Thoughts? > > On dynamically, switching from single to multi inserts, this can be > done by heap_insert_v2 itself. The way I think it's possible is that, > say we have some threshold row count 1000(can be a macro) after > inserting those many tuples, heap_insert_v2 can switch to buffering > mode. For that I thought it'd be good to use the expected row count, but yeah dynamically switching also works and might work better if the expected row counts are usually off. > > Thoughts? > >> Which would make the code something like: >> >> void >> heap_multi_insert_v2(TableInsertState *state, TupleTableSlot *slot) >> { >> TupleTableSlot *batchslot; >> HeapMultiInsertState *mistate = (HeapMultiInsertState *)state->mistate; >> Size sz; >> >> Assert(mistate && mistate->slots); >> >> if (mistate->slots[mistate->cur_slots] == NULL) >> mistate->slots[mistate->cur_slots] = >> table_slot_create(state->rel, NULL); >> >> batchslot = mistate->slots[mistate->cur_slots]; >> >> ExecClearTuple(batchslot); >> ExecCopySlot(batchslot, slot); >> >> /* >> * Calculate the tuple size after the original slot is copied, because the >> * copied slot type and the tuple size may change. >> */ >> sz = GetTupleSize(batchslot, mistate->max_size); >> >> Assert(sz > 0); >> >> mistate->cur_slots++; >> mistate->cur_size += sz; >> >> if (mistate->cur_slots >= mistate->max_slots || >> mistate->cur_size >= mistate->max_size) >> heap_multi_insert_flush(state); >> } > > I think clearing tuples before copying the slot as you suggested may > work without the need of clear_slots flag. ok, cool :) > >> >>> > Also, why do we want to do ExecClearTuple() anyway? Isn't >>> > it good enough that the next call to ExecCopySlot will effectively clear >>> > it out? >>> >>> For virtual, heap, minimal tuple slots, yes ExecCopySlot slot clears the >>> slot before copying. But, for buffer heap slots, the >>> tts_buffer_heap_copyslot does not always clear the destination slot, see >>> below. If we fall into else condition, we might get some issues. And >>> also note that, once the slot is cleared in ExecClearTuple, it will not >>> be cleared again in ExecCopySlot because TTS_SHOULDFREE(slot) will be >>> false. That is why, let's have ExecClearTuple as is. >>> >> I had no idea the buffer heap slot doesn't unconditionally clear out the >> slot :( So yes lets call it unconditionally ourselves. See also >> suggestion above. > > Yeah, we will clear the tuple slot before copy to be on the safer side. > ok >>> /* >>> * If the source slot is of a different kind, or is a buffer slot >>> that has >>> * been materialized / is virtual, make a new copy of the tuple. >>> Otherwise >>> * make a new reference to the in-buffer tuple. >>> */ >>> if (dstslot->tts_ops != srcslot->tts_ops || >>> TTS_SHOULDFREE(srcslot) || >>> !bsrcslot->base.tuple) >>> { >>> MemoryContext oldContext; >>> >>> ExecClearTuple(dstslot); >>> } >>> else >>> { >>> Assert(BufferIsValid(bsrcslot->buffer)); >>> >>> tts_buffer_heap_store_tuple(dstslot, bsrcslot->base.tuple, >>> bsrcslot->buffer, false); >>> >>> > - flushed -> why is this a stored boolean? isn't this indirectly encoded >>> > by cur_slots/cur_size == 0? >>> >>> Note that cur_slots is in HeapMultiInsertState and outside of the new >>> APIs i.e. in TableInsertState, mistate is a void pointer, and we can't >>> really access the cur_slots. I mean, we can access but we need to be >>> dereferencing using the tableam kind. Instead of doing all of that, to >>> keep the API cleaner, I chose to have a boolean in the TableInsertState >>> which we can see and use outside of the new APIs. Hope that's fine. >>> >> So you mean the flushed variable is actually there to tell the user of >> the API that they are supposed to call flush before end? Why can't the >> end call flush itself then? I guess I completely misunderstood the >> purpose of table_multi_insert_flush being public. I had assumed it is >> there to from the usage site indicate that now would be a good time to >> flush, e.g. because of a statement ending or something. I had not >> understood this is a requirement that its always required to do >> table_multi_insert_flush + table_insert_end. >> IMHO I would hide this from the callee, given that you would only really >> call flush yourself when you immediately after would call end, or are >> there other cases where one would be required to explicitly call flush? > > We need to know outside the multi_insert API whether the buffered > slots in case of multi inserts are flushed. Reason is that if we have > indexes or after row triggers, currently we call ExecInsertIndexTuples > or ExecARInsertTriggers on the buffered slots outside the API in a > loop after the flush. > > If we agree on removing heap_multi_insert_v2 API and embed that logic > inside heap_insert_v2, then we can do this - pass the required > information and the functions ExecInsertIndexTuples and > ExecARInsertTriggers as callbacks so that, whether or not > heap_insert_v2 choses single or multi inserts, it can callback these > functions with the required information passed after the flush. We can > add the callback and required information into TableInsertState. But, > I'm not quite sure, we would make ExecInsertIndexTuples and > ExecARInsertTriggers. And in > > If we don't want to go with callback way, then at least we need to > know whether or not heap_insert_v2 has chosen multi inserts, if yes, > the buffered slots array, and the number of current buffered slots, > whether they are flushed or not in the TableInsertState. Then, > eventually, we might need all the HeapMultiInsertState info in the > TableInsertState. > To me the callback API seems cleaner, that on heap_insert_begin we can pass in a callback that is called on every flushed slot, or only on multi-insert flushes. Is there a reason it would only be done for multi-insert flushes or can it be generic? > Thoughts? > > With Regards, > Bharath Rupireddy. > EnterpriseDB: http://www.enterprisedb.com > Hi, Replied inline. Kind regards, Luc
> If we agree on removing heap_multi_insert_v2 API and embed that logic > inside heap_insert_v2, then we can do this - pass the required > information and the functions ExecInsertIndexTuples and > ExecARInsertTriggers as callbacks so that, whether or not > heap_insert_v2 choses single or multi inserts, it can callback these > functions with the required information passed after the flush. We > can > add the callback and required information into TableInsertState. But, > I'm not quite sure, we would make ExecInsertIndexTuples and > ExecARInsertTriggers. How should the API interact with INSERT INTO ... SELECT? Right now it doesn't appear to be integrated at all, but that seems like a fairly important path for bulk inserts. Regards, Jeff Davis
On 17-01-2021 00:04, Jeff Davis wrote: > >> If we agree on removing heap_multi_insert_v2 API and embed that logic >> inside heap_insert_v2, then we can do this - pass the required >> information and the functions ExecInsertIndexTuples and >> ExecARInsertTriggers as callbacks so that, whether or not >> heap_insert_v2 choses single or multi inserts, it can callback these >> functions with the required information passed after the flush. We >> can >> add the callback and required information into TableInsertState. But, >> I'm not quite sure, we would make ExecInsertIndexTuples and >> ExecARInsertTriggers. > > How should the API interact with INSERT INTO ... SELECT? Right now it > doesn't appear to be integrated at all, but that seems like a fairly > important path for bulk inserts. > > Regards, > Jeff Davis > > Hi, You mean how it could because of that the table modification API uses the table_tuple_insert_speculative ? Just wondering if you think if it generally cannot work or would like to see that path / more paths integrated in to the patch. Kind regards, Luc
On Mon, 2021-01-18 at 08:58 +0100, Luc Vlaming wrote: > You mean how it could because of that the table modification API > uses > the table_tuple_insert_speculative ? Just wondering if you think if > it > generally cannot work or would like to see that path / more paths > integrated in to the patch. I think the patch should support INSERT INTO ... SELECT, and it will be easier to tell if we have the right API when that's integrated. Regards, Jeff Davis
Hi, I addressed the following review comments and attaching v3 patch set. 1) ExecClearTuple happens before ExecCopySlot in heap_multi_insert_v2 and this allowed us to remove clear_mi_slots flag from TableInsertState. 2) I retained the flushed variable inside TableInsertState so that the callers can know whether the buffered slots have been flushed. If yes, the callers can execute after insert row triggers or perform index insertions. This is easier than passing the after insert row triggers info and index info to new multi insert table am and let it do. This way the functionalities can be kept separate i.e. multi insert ams do only buffering, decisions on when to flush, insertions and the callers will execute triggers or index insertions. And also none of the existing table ams are performing these operations within them, so this is inline with the current design of the table ams. 3) I have kept the single and multi insert API separate. The previous suggestion was to have only a single insert API and let the callers provide initially whether they want multi or single inserts. One problem with that approach is that we have to allow table ams to execute the after row triggers or index insertions. That is something I personally don't like. 0001 - new table ams implementation 0002 - the new multi table ams used in CREATE TABLE AS and REFRESH MATERIALIZED VIEW 0003 - the new multi table ams used in COPY Please review the v3 patch set further. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Wed, Mar 10, 2021 at 10:21 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > Attaching the v4 patch set. Please review it further. Attaching v5 patch set after rebasing onto the latest master. Please review it further. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Apr 5, 2021 at 9:49 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Wed, Mar 10, 2021 at 10:21 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > Attaching the v4 patch set. Please review it further. > > Attaching v5 patch set after rebasing onto the latest master. Another rebase due to conflicts in 0003. Attaching v6 for review. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, 19 Apr 2021 at 06:52, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Mon, Apr 5, 2021 at 9:49 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > On Wed, Mar 10, 2021 at 10:21 AM Bharath Rupireddy > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > Attaching the v4 patch set. Please review it further. > > > > Attaching v5 patch set after rebasing onto the latest master. > > Another rebase due to conflicts in 0003. Attaching v6 for review. I recently touched the topic of multi_insert, and I remembered this patch. I had to dig a bit to find it, but as it's still open I've added some comments: > diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h > +#define MAX_BUFFERED_TUPLES 1000 > +#define MAX_BUFFERED_BYTES 65535 It looks like these values were copied over from copyfrom.c, but are now expressed in the context of the heapam. As these values are now heap-specific (as opposed to the TableAM-independent COPY infrastructure), shouldn't we instead optimize for heap page insertions? That is, I suggest using a multiple (1 or more) of MaxHeapTuplesPerPage for _TUPLES, and that same / a similar multiple of BLCKSZ for MAX_BUFFERED_BYTES. > TableInsertState->flushed > TableInsertState->mi_slots I don't quite like the current storage-and-feedback mechanism for flushed tuples. The current assumptions in this mechanism seem to be that 1.) access methods always want to flush all available tuples at once, 2.) access methods want to maintain the TupleTableSlots for all inserted tuples that have not yet had all triggers handled, and 3.) we need access to the not-yet-flushed TupleTableSlots. I think that that is not a correct set of assumptions; I think that only flushed tuples need to be visible to the tableam-agnostic code; and that tableams should be allowed to choose which tuples to flush at which point; as long as they're all flushed after a final multi_insert_flush. Examples: A heap-based access method might want bin-pack tuples and write out full pages only; and thus keep some tuples in the buffers as they didn't fill a page; thus having flushed only a subset of the current buffered tuples. A columnstore-based access method might not want to maintain the TupleTableSlots of original tuples, but instead use temporary columnar storage: TupleTableSlots are quite large when working with huge amounts of tuples; and keeping lots of tuple data in memory is expensive. As such, I think that this should be replaced with a TableInsertState->mi_flushed_slots + TableInsertState->mi_flushed_len, managed by the tableAM, in which only the flushed tuples are visible to the AM-agnostic code. This is not much different from how the implementation currently works; except that ->mi_slots now does not expose unflushed tuples; and that ->flushed is replaced by an integer value of number of flushed tuples. A further improvement (in my opinion) would be the change from a single multi_insert_flush() to a signalling-based multi_insert_flush: It is not unreasonable for e.g. a columnstore to buffer tens of thousands of inserts; but doing so in TupleTableSlots would introduce a high memory usage. Allowing for batched retrieval of flushed tuples would help in memory usage; which is why multiple calls to multi_insert_flush() could be useful. To handle this gracefully, we'd probably add TIS->mi_flush_remaining, where each insert adds one to mi_flush_remaining; and each time mi_flushed_slots has been handled mi_flush_remaining is decreased by mi_flushed_len by the handler code. Once we're done inserting into the table, we keep calling multi_insert_flush until no more tuples are being flushed (and error out if we're still waiting for flushes but no new flushed tuples are returned). - Matthias
On Fri, Mar 4, 2022 at 8:07 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > Another rebase due to conflicts in 0003. Attaching v6 for review. > > I recently touched the topic of multi_insert, and I remembered this > patch. I had to dig a bit to find it, but as it's still open I've > added some comments: Thanks for reviving the thread. I almost lost hope in it. In fact, it took me a while to recollect the work and respond to your comments. I'm now happy to answer or continue working on this patch if you or someone is really interested to review it and take it to the end. > > diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h > > +#define MAX_BUFFERED_TUPLES 1000 > > +#define MAX_BUFFERED_BYTES 65535 > > It looks like these values were copied over from copyfrom.c, but are > now expressed in the context of the heapam. > As these values are now heap-specific (as opposed to the > TableAM-independent COPY infrastructure), shouldn't we instead > optimize for heap page insertions? That is, I suggest using a multiple > (1 or more) of MaxHeapTuplesPerPage for _TUPLES, and that same / a > similar multiple of BLCKSZ for MAX_BUFFERED_BYTES. We can do that. In fact, it is a good idea to let callers pass in as an input to tuple_insert_begin and have it as part of TableInsertState. If okay, I will do that in the next patch. > > TableInsertState->flushed > > TableInsertState->mi_slots > > I don't quite like the current storage-and-feedback mechanism for > flushed tuples. The current assumptions in this mechanism seem to be > that > 1.) access methods always want to flush all available tuples at once, > 2.) access methods want to maintain the TupleTableSlots for all > inserted tuples that have not yet had all triggers handled, and > 3.) we need access to the not-yet-flushed TupleTableSlots. > > I think that that is not a correct set of assumptions; I think that > only flushed tuples need to be visible to the tableam-agnostic code; > and that tableams should be allowed to choose which tuples to flush at > which point; as long as they're all flushed after a final > multi_insert_flush. > > Examples: > A heap-based access method might want bin-pack tuples and write out > full pages only; and thus keep some tuples in the buffers as they > didn't fill a page; thus having flushed only a subset of the current > buffered tuples. > A columnstore-based access method might not want to maintain the > TupleTableSlots of original tuples, but instead use temporary columnar > storage: TupleTableSlots are quite large when working with huge > amounts of tuples; and keeping lots of tuple data in memory is > expensive. > > As such, I think that this should be replaced with a > TableInsertState->mi_flushed_slots + TableInsertState->mi_flushed_len, > managed by the tableAM, in which only the flushed tuples are visible > to the AM-agnostic code. This is not much different from how the > implementation currently works; except that ->mi_slots now does not > expose unflushed tuples; and that ->flushed is replaced by an integer > value of number of flushed tuples. Yeah, that makes sense. Let's table AMs expose the flushed tuples outside on which the callers can handle the after-insert row triggers upon them. IIUC, TableInsertState needs to have few other variables: /* Below members are only used for multi inserts. */ /* Array of buffered slots. */ TupleTableSlot **mi_slots; /* Number of slots that are currently buffered. */ int32 mi_cur_slots; /* Array of flushed slots that will be used by callers to handle after-insert row triggers or similar events outside */ TupleTableSlot **mi_flushed_slots ; /* Number of slots that are currently buffered. */ int32 mi_no_of_flushed_slots; The implementation of heap_multi_insert_flush will just set the mi_slots to mi_flushed_slots. > A further improvement (in my opinion) would be the change from a > single multi_insert_flush() to a signalling-based multi_insert_flush: > It is not unreasonable for e.g. a columnstore to buffer tens of > thousands of inserts; but doing so in TupleTableSlots would introduce > a high memory usage. Allowing for batched retrieval of flushed tuples > would help in memory usage; which is why multiple calls to > multi_insert_flush() could be useful. To handle this gracefully, we'd > probably add TIS->mi_flush_remaining, where each insert adds one to > mi_flush_remaining; and each time mi_flushed_slots has been handled > mi_flush_remaining is decreased by mi_flushed_len by the handler code. > Once we're done inserting into the table, we keep calling > multi_insert_flush until no more tuples are being flushed (and error > out if we're still waiting for flushes but no new flushed tuples are > returned). The current approach is signalling-based right? heap_multi_insert_v2 if (state->mi_cur_slots >= mistate->max_slots || mistate->cur_size >= mistate->max_size) heap_multi_insert_flush(state); The table_multi_insert_v2 am implementers will have to carefully choose buffering strategy i.e. number of tuples, size to buffer and decide rightly without hitting memory usages. Thoughts? Regards, Bharath Rupireddy.
On Sun, 6 Mar 2022 at 12:12, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Fri, Mar 4, 2022 at 8:07 PM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > > Another rebase due to conflicts in 0003. Attaching v6 for review. > > > > I recently touched the topic of multi_insert, and I remembered this > > patch. I had to dig a bit to find it, but as it's still open I've > > added some comments: > > Thanks for reviving the thread. I almost lost hope in it. In fact, it > took me a while to recollect the work and respond to your comments. > I'm now happy to answer or continue working on this patch if you or > someone is really interested to review it and take it to the end. > > > > diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h > > > +#define MAX_BUFFERED_TUPLES 1000 > > > +#define MAX_BUFFERED_BYTES 65535 > > > > It looks like these values were copied over from copyfrom.c, but are > > now expressed in the context of the heapam. > > As these values are now heap-specific (as opposed to the > > TableAM-independent COPY infrastructure), shouldn't we instead > > optimize for heap page insertions? That is, I suggest using a multiple > > (1 or more) of MaxHeapTuplesPerPage for _TUPLES, and that same / a > > similar multiple of BLCKSZ for MAX_BUFFERED_BYTES. > > We can do that. In fact, it is a good idea to let callers pass in as > an input to tuple_insert_begin and have it as part of > TableInsertState. If okay, I will do that in the next patch. > > > > TableInsertState->flushed > > > TableInsertState->mi_slots > > > > I don't quite like the current storage-and-feedback mechanism for > > flushed tuples. The current assumptions in this mechanism seem to be > > that > > 1.) access methods always want to flush all available tuples at once, > > 2.) access methods want to maintain the TupleTableSlots for all > > inserted tuples that have not yet had all triggers handled, and > > 3.) we need access to the not-yet-flushed TupleTableSlots. > > > > I think that that is not a correct set of assumptions; I think that > > only flushed tuples need to be visible to the tableam-agnostic code; > > and that tableams should be allowed to choose which tuples to flush at > > which point; as long as they're all flushed after a final > > multi_insert_flush. > > > > Examples: > > A heap-based access method might want bin-pack tuples and write out > > full pages only; and thus keep some tuples in the buffers as they > > didn't fill a page; thus having flushed only a subset of the current > > buffered tuples. > > A columnstore-based access method might not want to maintain the > > TupleTableSlots of original tuples, but instead use temporary columnar > > storage: TupleTableSlots are quite large when working with huge > > amounts of tuples; and keeping lots of tuple data in memory is > > expensive. > > > > As such, I think that this should be replaced with a > > TableInsertState->mi_flushed_slots + TableInsertState->mi_flushed_len, > > managed by the tableAM, in which only the flushed tuples are visible > > to the AM-agnostic code. This is not much different from how the > > implementation currently works; except that ->mi_slots now does not > > expose unflushed tuples; and that ->flushed is replaced by an integer > > value of number of flushed tuples. > > Yeah, that makes sense. Let's table AMs expose the flushed tuples > outside on which the callers can handle the after-insert row triggers > upon them. > > IIUC, TableInsertState needs to have few other variables: > > /* Below members are only used for multi inserts. */ > /* Array of buffered slots. */ > TupleTableSlot **mi_slots; Not quite: there's no need for TupleTableSlot **mi_slots in the TableInsertState; as the buffer used by the tableAM to buffer unflushed tuples shouldn't be publicly visible. I suspect that moving that field to HeapMultiInsertState instead would be the prudent thing to do; limiting the external access of AM-specific buffers. > /* Number of slots that are currently buffered. */ > int32 mi_cur_slots; > /* Array of flushed slots that will be used by callers to handle > after-insert row triggers or similar events outside */ > TupleTableSlot **mi_flushed_slots ; > /* Number of slots that are currently buffered. */ > int32 mi_no_of_flushed_slots; > > The implementation of heap_multi_insert_flush will just set the > mi_slots to mi_flushed_slots. Yes. > > A further improvement (in my opinion) would be the change from a > > single multi_insert_flush() to a signalling-based multi_insert_flush: > > It is not unreasonable for e.g. a columnstore to buffer tens of > > thousands of inserts; but doing so in TupleTableSlots would introduce > > a high memory usage. Allowing for batched retrieval of flushed tuples > > would help in memory usage; which is why multiple calls to > > multi_insert_flush() could be useful. To handle this gracefully, we'd > > probably add TIS->mi_flush_remaining, where each insert adds one to > > mi_flush_remaining; and each time mi_flushed_slots has been handled > > mi_flush_remaining is decreased by mi_flushed_len by the handler code. > > Once we're done inserting into the table, we keep calling > > multi_insert_flush until no more tuples are being flushed (and error > > out if we're still waiting for flushes but no new flushed tuples are > > returned). > > The current approach is signalling-based right? > heap_multi_insert_v2 > if (state->mi_cur_slots >= mistate->max_slots || > mistate->cur_size >= mistate->max_size) > heap_multi_insert_flush(state); That's for the AM-internal flushing; yes. I was thinking about the AM api for flushing that's used when finalizing the batched insert; i.e. table_multi_insert_flush. Currently it assumes that all buffered tuples will be flushed after one call (which is correct for heap), but putting those unflushed tuples all at once back in memory might not be desirable or possible (for e.g. columnar); so we might need to call table_multi_insert_flush until there's no more buffered tuples. > The table_multi_insert_v2 am implementers will have to carefully > choose buffering strategy i.e. number of tuples, size to buffer and > decide rightly without hitting memory usages. Agreed -Matthias
On Mon, Mar 07, 2022 at 05:09:23PM +0100, Matthias van de Meent wrote: > That's for the AM-internal flushing; yes. I was thinking about the AM > api for flushing that's used when finalizing the batched insert; i.e. > table_multi_insert_flush. > > Currently it assumes that all buffered tuples will be flushed after > one call (which is correct for heap), but putting those unflushed > tuples all at once back in memory might not be desirable or possible > (for e.g. columnar); so we might need to call table_multi_insert_flush > until there's no more buffered tuples. This thread has been idle for 6 months now, so I have marked it as returned with feedback as of what looks like a lack of activity. I have looked at what's been proposed, and I am not really sure if the direction taken is correct, though there may be a potential gain in consolidating the multi-insert path within the table AM set of callbacks. -- Michael
Attachment
On Wed, Oct 12, 2022 at 11:01 AM Michael Paquier <michael@paquier.xyz> wrote: > > This thread has been idle for 6 months now, so I have marked it as > returned with feedback as of what looks like a lack of activity. I > have looked at what's been proposed, and I am not really sure if the > direction taken is correct, though there may be a potential gain in > consolidating the multi-insert path within the table AM set of > callbacks. Thanks. Unfortunately, I'm not finding enough cycles to work on this feature. I'm happy to help if others have any further thoughts and take it from here. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, This patch was referenced in a discussion at pgcon, so I thought I'd give it a look, even though Bharat said that he won't have time to drive it forward... On 2021-04-19 10:21:36 +0530, Bharath Rupireddy wrote: > diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c > index bd5faf0c1f..655de8e6b7 100644 > --- a/src/backend/access/heap/heapam_handler.c > +++ b/src/backend/access/heap/heapam_handler.c > @@ -2558,6 +2558,11 @@ static const TableAmRoutine heapam_methods = { > .tuple_insert_speculative = heapam_tuple_insert_speculative, > .tuple_complete_speculative = heapam_tuple_complete_speculative, > .multi_insert = heap_multi_insert, > + .tuple_insert_begin = heap_insert_begin, > + .tuple_insert_v2 = heap_insert_v2, > + .multi_insert_v2 = heap_multi_insert_v2, > + .multi_insert_flush = heap_multi_insert_flush, > + .tuple_insert_end = heap_insert_end, > .tuple_delete = heapam_tuple_delete, > .tuple_update = heapam_tuple_update, > .tuple_lock = heapam_tuple_lock, I don't think we should have multiple callback for the insertion APIs in tableam.h. I think it'd be good to continue supporting the old table_*() functions, but supporting multiple insert APIs in each AM doesn't make much sense to me. > +/* > + * GetTupleSize - Compute the tuple size given a table slot. > + * > + * For heap tuple, buffer tuple and minimal tuple slot types return the actual > + * tuple size that exists. For virtual tuple, the size is calculated as the > + * slot does not have the tuple size. If the computed size exceeds the given > + * maxsize for the virtual tuple, this function exits, not investing time in > + * further unnecessary calculation. > + * > + * Important Notes: > + * 1) Size calculation code for virtual slots is being used from > + * tts_virtual_materialize(), hence ensure to have the same changes or fixes > + * here and also there. > + * 2) Currently, GetTupleSize() handles the existing heap, buffer, minimal and > + * virtual slots. Ensure to add related code in case any new slot type is > + * introduced. > + */ > +inline Size > +GetTupleSize(TupleTableSlot *slot, Size maxsize) > +{ > + Size sz = 0; > + HeapTuple tuple = NULL; > + > + if (TTS_IS_HEAPTUPLE(slot)) > + tuple = ((HeapTupleTableSlot *) slot)->tuple; > + else if(TTS_IS_BUFFERTUPLE(slot)) > + tuple = ((BufferHeapTupleTableSlot *) slot)->base.tuple; > + else if(TTS_IS_MINIMALTUPLE(slot)) > + tuple = ((MinimalTupleTableSlot *) slot)->tuple; > + else if(TTS_IS_VIRTUAL(slot)) I think this embeds too much knowledge of the set of slot types in core code. I don't see why it's needed either? > diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h > index 414b6b4d57..2a1470a7b6 100644 > --- a/src/include/access/tableam.h > +++ b/src/include/access/tableam.h > @@ -229,6 +229,32 @@ typedef struct TM_IndexDeleteOp > TM_IndexStatus *status; > } TM_IndexDeleteOp; > > +/* Holds table insert state. */ > +typedef struct TableInsertState I suspect we should design it to be usable for updates and deletes in the future, and thus name it TableModifyState. > +{ > + Relation rel; > + /* Bulk insert state if requested, otherwise NULL. */ > + struct BulkInsertStateData *bistate; > + CommandId cid; Hm - I'm not sure it's a good idea to force the cid to be the same for all inserts done via one TableInsertState. > + int options; > + /* Below members are only used for multi inserts. */ > + /* Array of buffered slots. */ > + TupleTableSlot **mi_slots; > + /* Number of slots that are currently buffered. */ > + int32 mi_cur_slots; > + /* > + * Access method specific information such as parameters that are needed > + * for buffering and flushing decisions can go here. > + */ > + void *mistate; I think we should instead have a generic TableModifyState, which each AM then embeds into an AM specific AM state. Forcing two very related structs to be allocated separately doesn't seem wise in this case. > @@ -1430,6 +1473,50 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots, > cid, options, bistate); > } > > +static inline TableInsertState* > +table_insert_begin(Relation rel, CommandId cid, int options, > + bool alloc_bistate, bool is_multi) Why have alloc_bistate and options? > +static inline void > +table_insert_end(TableInsertState *state) > +{ > + /* Deallocate bulk insert state here, since it's AM independent. */ > + if (state->bistate) > + FreeBulkInsertState(state->bistate); > + > + state->rel->rd_tableam->tuple_insert_end(state); > +} Seems like the order in here should be swapped? Greetings, Andres Freund
On Sun, Jun 4, 2023 at 4:08 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > This patch was referenced in a discussion at pgcon, so I thought I'd give it a > look, even though Bharat said that he won't have time to drive it forward... Thanks. I'm glad to know that the feature was discussed at PGCon. If there's an interest, I'm happy to spend time again on it. I'll look into the review comments and respond soon. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Sun, Jun 4, 2023 at 4:08 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > This patch was referenced in a discussion at pgcon, so I thought I'd give it a > look, even though Bharat said that he won't have time to drive it forward... Thanks. Finally, I started to spend time on this. Just curious - may I know the discussion in/for which this patch is referenced? What was the motive? Is it captured somewhere? > On 2021-04-19 10:21:36 +0530, Bharath Rupireddy wrote: > > + .tuple_insert_begin = heap_insert_begin, > > + .tuple_insert_v2 = heap_insert_v2, > > + .multi_insert_v2 = heap_multi_insert_v2, > > + .multi_insert_flush = heap_multi_insert_flush, > > + .tuple_insert_end = heap_insert_end, > > I don't think we should have multiple callback for the insertion APIs in > tableam.h. I think it'd be good to continue supporting the old table_*() > functions, but supporting multiple insert APIs in each AM doesn't make much > sense to me. I named these new functions XXX_v2 for compatibility reasons. Because, it's quite possible for external modules to use existing table_tuple_insert, table_multi_insert functions. If we were to change the existing insert tableams, all the external modules using them would have to change their code, is that okay? > > +/* > > + * GetTupleSize - Compute the tuple size given a table slot. > > +inline Size > > I think this embeds too much knowledge of the set of slot types in core > code. I don't see why it's needed either? The heapam multi-insert implementation needs to know the tuple size from the slot to decide whether or not to flush the tuples from the buffers. I couldn't find a direct way then to know the tuple size from the slot, so added that helper function. With a better understanding now, I think we can rely on the memory allocated for TupleTableSlot's tts_mcxt. While this works for the materialized slots passed in to the insert functions, for non-materialized slots the flushing decision can be solely on the number of tuples stored in the buffers. Another way is to add a get_tuple_size callback to TupleTableSlotOps and let the tuple slot providers give us the tuple size. > > diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h > > index 414b6b4d57..2a1470a7b6 100644 > > --- a/src/include/access/tableam.h > > +++ b/src/include/access/tableam.h > > @@ -229,6 +229,32 @@ typedef struct TM_IndexDeleteOp > > TM_IndexStatus *status; > > } TM_IndexDeleteOp; > > > > +/* Holds table insert state. */ > > +typedef struct TableInsertState > > I suspect we should design it to be usable for updates and deletes in the > future, and thus name it TableModifyState. There are different parameters that insert/update/delete would want to pass across in the state. So, having Table{Insert/Update/Delete}State may be a better idea than having the unneeded variables lying around or having a union and state_type as INSERT/UPDATE/DELETE, no? Do you have a different thought here? > I think we should instead have a generic TableModifyState, which each AM then > embeds into an AM specific AM state. Forcing two very related structs to be > allocated separately doesn't seem wise in this case. The v7 patches have largely changed the way these options and parameters are passed, please have a look. > > +{ > > + Relation rel; > > + /* Bulk insert state if requested, otherwise NULL. */ > > + struct BulkInsertStateData *bistate; > > + CommandId cid; > > Hm - I'm not sure it's a good idea to force the cid to be the same for all > inserts done via one TableInsertState. If required, someone can always pass a new CID before every tuple_insert_v2/tuple_multi_insert_v2 call via TableInsertState. Isn't it sufficient? > > @@ -1430,6 +1473,50 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots, > > cid, options, bistate); > > } > > > > +static inline TableInsertState* > > +table_insert_begin(Relation rel, CommandId cid, int options, > > + bool alloc_bistate, bool is_multi) > > Why have alloc_bistate and options? "alloc_bistate" is for the caller to specify if they need a bulk insert state or not. "options" is for the caller to specify if they need table_tuple_insert performance options such as TABLE_INSERT_SKIP_FSM, TABLE_INSERT_FROZEN, TABLE_INSERT_NO_LOGICAL. The v7 patches have changed the way these options and parameters are passed, please have a look. > > +static inline void > > +table_insert_end(TableInsertState *state) > > +{ > > + /* Deallocate bulk insert state here, since it's AM independent. */ > > + if (state->bistate) > > + FreeBulkInsertState(state->bistate); > > + > > + state->rel->rd_tableam->tuple_insert_end(state); > > +} > > Seems like the order in here should be swapped? Right. It looks like BulkInsertState is for heapam, it really doesn't have to be in table_XXX functions, hence it all the way down to heap_insert_XXX functions. I'm attaching the v7 patch set with the above review comments addressed. My initial idea behind these new insert APIs was the ability to re-use the multi insert code in COPY for CTAS and REFRESH MATERIALIZED VIEW. I'm open to more thoughts here. The v7 patches have largely changed the way state structure (heapam specific things are moved all the way down to heapam.c) is defined, the parameters are passed, and simplified the multi insert logic a lot. 0001 - introduces new single and multi insert table AM and heapam implementation of the new AM. 0002 - optimizes CREATE TABLE AS to use the new multi inserts table AM making it faster by 2.13X or 53%. 0003 - optimizes REFRESH MATERIALIZED VIEW to use the new multi inserts table AM making it faster by 1.52X or 34%. 0004 - uses the new multi inserts table AM for COPY FROM - I'm yet to spend time on this, I'll share the patch when ready. Thoughts? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Aug 1, 2023 at 9:31 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > Thanks. Finally, I started to spend time on this. Just curious - may > I know the discussion in/for which this patch is referenced? What was > the motive? Is it captured somewhere? It may not have been the only place, but we at least touched on it during the unconference: https://wiki.postgresql.org/wiki/PgCon_2023_Developer_Unconference#Table_AMs We discussed two related-but-separate ideas: 1) bulk/batch operations and 2) maintenance of TAM state across multiple related operations. --Jacob
On Tue, Aug 1, 2023 at 10:32 PM Jacob Champion <jchampion@timescale.com> wrote: > > On Tue, Aug 1, 2023 at 9:31 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > Thanks. Finally, I started to spend time on this. Just curious - may > > I know the discussion in/for which this patch is referenced? What was > > the motive? Is it captured somewhere? > > It may not have been the only place, but we at least touched on it > during the unconference: > > https://wiki.postgresql.org/wiki/PgCon_2023_Developer_Unconference#Table_AMs > > We discussed two related-but-separate ideas: > 1) bulk/batch operations and > 2) maintenance of TAM state across multiple related operations. Thank you. I'm attaching v8 patch-set here which includes use of new insert TAMs for COPY FROM. With this, postgres not only will have the new TAM for inserts, but they also can make the following commands faster - CREATE TABLE AS, SELECT INTO, CREATE MATERIALIZED VIEW, REFRESH MATERIALIZED VIEW and COPY FROM. I'll perform some testing in the coming days and post the results here, until then I appreciate any feedback on the patches. I've also added this proposal to CF - https://commitfest.postgresql.org/47/4777/. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Wed, Jan 17, 2024 at 10:57 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > Thank you. I'm attaching v8 patch-set here which includes use of new > insert TAMs for COPY FROM. With this, postgres not only will have the > new TAM for inserts, but they also can make the following commands > faster - CREATE TABLE AS, SELECT INTO, CREATE MATERIALIZED VIEW, > REFRESH MATERIALIZED VIEW and COPY FROM. I'll perform some testing in > the coming days and post the results here, until then I appreciate any > feedback on the patches. > > I've also added this proposal to CF - > https://commitfest.postgresql.org/47/4777/. Some of the tests related to Incremental Sort added by a recent commit 0452b461bc4 in aggregates.sql are failing when the multi inserts feature is used for CTAS (like done in 0002 patch). I'm not so sure if it's because of the reduction in the CTAS execution times. Execution time for table 'btg' created with CREATE TABLE AS added by commit 0452b461bc4 with single inserts is 25.3 msec, with multi inserts is 17.7 msec. This means that the multi inserts are about 1.43 times or 30.04% faster than the single inserts. Couple of ways to make these tests pick Incremental Sort as expected - 1) CLUSTER btg USING abc; or 2) increase the number of rows in table btg to 100K from 10K. FWIW, if I reduce the number of rows in the table from 10K to 1K, the Incremental Sort won't get picked on HEAD with CTAS using single inserts. Hence, I chose option (2) to fix the issue. Please find the attached v9 patch set. [1] -- Engage incremental sort explain (COSTS OFF) SELECT x,y FROM btg GROUP BY x,y,z,w; - QUERY PLAN -------------------------------------------------- + QUERY PLAN +------------------------------ Group Group Key: x, y, z, w - -> Incremental Sort + -> Sort Sort Key: x, y, z, w - Presorted Key: x, y - -> Index Scan using btg_x_y_idx on btg -(6 rows) + -> Seq Scan on btg +(5 rows) -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Jan 29, 2024 at 12:57 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Wed, Jan 17, 2024 at 10:57 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > Thank you. I'm attaching v8 patch-set here which includes use of new > > insert TAMs for COPY FROM. With this, postgres not only will have the > > new TAM for inserts, but they also can make the following commands > > faster - CREATE TABLE AS, SELECT INTO, CREATE MATERIALIZED VIEW, > > REFRESH MATERIALIZED VIEW and COPY FROM. I'll perform some testing in > > the coming days and post the results here, until then I appreciate any > > feedback on the patches. > > > > I've also added this proposal to CF - > > https://commitfest.postgresql.org/47/4777/. > > Some of the tests related to Incremental Sort added by a recent commit > 0452b461bc4 in aggregates.sql are failing when the multi inserts > feature is used for CTAS (like done in 0002 patch). I'm not so sure if > it's because of the reduction in the CTAS execution times. Execution > time for table 'btg' created with CREATE TABLE AS added by commit > 0452b461bc4 with single inserts is 25.3 msec, with multi inserts is > 17.7 msec. This means that the multi inserts are about 1.43 times or > 30.04% faster than the single inserts. Couple of ways to make these > tests pick Incremental Sort as expected - 1) CLUSTER btg USING abc; or > 2) increase the number of rows in table btg to 100K from 10K. FWIW, if > I reduce the number of rows in the table from 10K to 1K, the > Incremental Sort won't get picked on HEAD with CTAS using single > inserts. Hence, I chose option (2) to fix the issue. > > Please find the attached v9 patch set. > > [1] > -- Engage incremental sort > explain (COSTS OFF) SELECT x,y FROM btg GROUP BY x,y,z,w; > - QUERY PLAN > -------------------------------------------------- > + QUERY PLAN > +------------------------------ > Group > Group Key: x, y, z, w > - -> Incremental Sort > + -> Sort > Sort Key: x, y, z, w > - Presorted Key: x, y > - -> Index Scan using btg_x_y_idx on btg > -(6 rows) > + -> Seq Scan on btg > +(5 rows) CF bot machine with Windows isn't happy with the compilation [1], so fixed those warnings and attached v10 patch set. [1] [07:35:25.458] [632/2212] Compiling C object src/backend/postgres_lib.a.p/commands_copyfrom.c.obj [07:35:25.458] c:\cirrus\src\include\access\tableam.h(1574) : warning C4715: 'table_multi_insert_slots': not all control paths return a value [07:35:25.458] c:\cirrus\src\include\access\tableam.h(1522) : warning C4715: 'table_insert_begin': not all control paths return a value [07:35:25.680] c:\cirrus\src\include\access\tableam.h(1561) : warning C4715: 'table_multi_insert_next_free_slot': not all control paths return a value [07:35:25.680] [633/2212] Compiling C object src/backend/postgres_lib.a.p/commands_createas.c.obj [07:35:25.680] c:\cirrus\src\include\access\tableam.h(1522) : warning C4715: 'table_insert_begin': not all control paths return a value [07:35:26.310] [646/2212] Compiling C object src/backend/postgres_lib.a.p/commands_matview.c.obj [07:35:26.310] c:\cirrus\src\include\access\tableam.h(1522) : warning C4715: 'table_insert_begin': not all control paths return a value -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Jan 29, 2024 at 5:16 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > > Please find the attached v9 patch set. I've had to rebase the patches due to commit 874d817, please find the attached v11 patch set. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Sat, Mar 2, 2024 at 12:02 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Mon, Jan 29, 2024 at 5:16 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > Please find the attached v9 patch set. > > I've had to rebase the patches due to commit 874d817, please find the > attached v11 patch set. Rebase needed. Please see the v12 patch set. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
Hi, On Fri, Mar 8, 2024 at 7:37 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Sat, Mar 2, 2024 at 12:02 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > On Mon, Jan 29, 2024 at 5:16 PM Bharath Rupireddy > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > > Please find the attached v9 patch set. > > > > I've had to rebase the patches due to commit 874d817, please find the > > attached v11 patch set. > > Rebase needed. Please see the v12 patch set. > I've not reviewed the patches in depth yet, but run performance tests for CREATE MATERIALIZED VIEW. The test scenarios is: -- setup create unlogged table test (c int); insert into test select generate_series(1, 10000000); -- run create materialized view test_mv as select * from test; Here are the results: * HEAD 3775.221 ms 3744.039 ms 3723.228 ms * v12 patch 6289.972 ms 5880.674 ms 7663.509 ms I can see performance regressions and the perf report says that CPU spent most time on extending the ResourceOwner's array while copying the buffer-heap tuple: - 52.26% 0.18% postgres postgres [.] intorel_receive 52.08% intorel_receive table_multi_insert_v2 (inlined) - heap_multi_insert_v2 - 51.53% ExecCopySlot (inlined) tts_buffer_heap_copyslot tts_buffer_heap_store_tuple (inlined) - IncrBufferRefCount - ResourceOwnerEnlarge ResourceOwnerAddToHash (inlined) Is there any reason why we copy a buffer-heap tuple to another buffer-heap tuple? Which results in that we increments the buffer refcount and register it to ResourceOwner for every tuples. I guess that the destination tuple slot is not necessarily a buffer-heap, and we could use VirtualTupleTableSlot instead. It would in turn require copying a heap tuple. I might be missing something but it improved the performance at least in my env. The change I made was: - dstslot = table_slot_create(state->rel, NULL); + //dstslot = table_slot_create(state->rel, NULL); + dstslot = MakeTupleTableSlot(RelationGetDescr(state->rel), + &TTSOpsVirtual); + And the execution times are: 1588.984 ms 1591.618 ms 1582.519 ms Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Mar 19, 2024 at 10:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I've not reviewed the patches in depth yet, but run performance tests > for CREATE MATERIALIZED VIEW. The test scenarios is: Thanks for looking into this. > Is there any reason why we copy a buffer-heap tuple to another > buffer-heap tuple? Which results in that we increments the buffer > refcount and register it to ResourceOwner for every tuples. I guess > that the destination tuple slot is not necessarily a buffer-heap, and > we could use VirtualTupleTableSlot instead. It would in turn require > copying a heap tuple. I might be missing something but it improved the > performance at least in my env. The change I made was: > > - dstslot = table_slot_create(state->rel, NULL); > + //dstslot = table_slot_create(state->rel, NULL); > + dstslot = MakeTupleTableSlot(RelationGetDescr(state->rel), > + &TTSOpsVirtual); > + > > And the execution times are: > 1588.984 ms > 1591.618 ms > 1582.519 ms Yes, usingVirtualTupleTableSlot helps improve the performance a lot. Below are results from my testing. Note that CMV, RMV, CTAS stand for CREATE MATERIALIZED VIEW, REFRESH MATERIALIZED VIEW, CREATE TABLE AS respectively. These commands got faster by 62.54%, 68.87%, 74.31% or 2.67, 3.21, 3.89 times respectively. I've used the test case specified at [1]. HEAD: CMV: Time: 6276.468 ms (00:06.276) CTAS: Time: 8141.632 ms (00:08.142) RMV: Time: 14747.139 ms (00:14.747) PATCHED: CMV: Time: 2350.282 ms (00:02.350) CTAS: Time: 2091.427 ms (00:02.091) RMV: Time: 4590.180 ms (00:04.590) I quickly looked at the description of what a "virtual" tuple is from src/include/executor/tuptable.h [2]. IIUC, it is invented for minimizing data copying, but it also says that it's the responsibility of the generating plan node to be sure these resources are not released for as long as the virtual tuple needs to be valid or is materialized. While it says this, as far as this patch is concerned, the virtual slot gets materialized when we copy the tuples from source slot (can be any type of slot) to destination slot (which is virtual slot). See ExecCopySlot-> tts_virtual_copyslot->tts_virtual_materialize. This way, tts_virtual_copyslot ensures the tuples storage doesn't depend on external memory because all the datums that aren't passed by value are copied into the slot's memory context. With the above understanding, it looks safe to use virtual slots for the multi insert buffered slots. I'm not so sure if I'm missing anything here. [1] cd $PWD/pg17/bin rm -rf data logfile ./initdb -D data ./pg_ctl -D data -l logfile start ./psql -d postgres \timing drop table test cascade; create unlogged table test (c int); insert into test select generate_series(1, 10000000); create materialized view test_mv as select * from test; create table test_copy as select * from test; insert into test select generate_series(1, 10000000); refresh materialized view test_mv; [2] * A "virtual" tuple is an optimization used to minimize physical data copying * in a nest of plan nodes. Until materialized pass-by-reference Datums in * the slot point to storage that is not directly associated with the * TupleTableSlot; generally they will point to part of a tuple stored in a * lower plan node's output TupleTableSlot, or to a function result * constructed in a plan node's per-tuple econtext. It is the responsibility * of the generating plan node to be sure these resources are not released for * as long as the virtual tuple needs to be valid or is materialized. Note * also that a virtual tuple does not have any "system columns". -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Mar 21, 2024 at 9:44 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > Yes, usingVirtualTupleTableSlot helps improve the performance a lot. > Below are results from my testing. Note that CMV, RMV, CTAS stand for > CREATE MATERIALIZED VIEW, REFRESH MATERIALIZED VIEW, CREATE TABLE AS > respectively. These commands got faster by 62.54%, 68.87%, 74.31% or > 2.67, 3.21, 3.89 times respectively. I've used the test case specified > at [1]. > > HEAD: > CMV: > Time: 6276.468 ms (00:06.276) > CTAS: > Time: 8141.632 ms (00:08.142) > RMV: > Time: 14747.139 ms (00:14.747) > > PATCHED: > CMV: > Time: 2350.282 ms (00:02.350) > CTAS: > Time: 2091.427 ms (00:02.091) > RMV: > Time: 4590.180 ms (00:04.590) > > I quickly looked at the description of what a "virtual" tuple is from > src/include/executor/tuptable.h [2]. IIUC, it is invented for > minimizing data copying, but it also says that it's the responsibility > of the generating plan node to be sure these resources are not > released for as long as the virtual tuple needs to be valid or is > materialized. While it says this, as far as this patch is concerned, > the virtual slot gets materialized when we copy the tuples from source > slot (can be any type of slot) to destination slot (which is virtual > slot). See ExecCopySlot-> > tts_virtual_copyslot->tts_virtual_materialize. This way, > tts_virtual_copyslot ensures the tuples storage doesn't depend on > external memory because all the datums that aren't passed by value are > copied into the slot's memory context. > > With the above understanding, it looks safe to use virtual slots for > the multi insert buffered slots. I'm not so sure if I'm missing > anything here. I'm attaching the v13 patches using virtual tuple slots for buffered tuples for multi inserts. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, 2024-03-21 at 13:10 +0530, Bharath Rupireddy wrote: > I'm attaching the v13 patches using virtual tuple slots for buffered > tuples for multi inserts. Comments: * Do I understand correctly that CMV, RMV, and CTAS experience a performance benefit, but COPY FROM does not? And is that because COPY already used table_multi_insert, whereas CMV and RMV did not? * In the COPY FROM code, it looks like it's deciding whether to flush based on MAX_BUFFERED_TUPLES, but the slot array is allocated with MAX_BUFFERED_SLOTS (they happen to be the same for heap, but perhaps not for other AMs). The copy code shouldn't be using internal knowledge of the multi-insert code; it should know somehow from the API when the right time is to flush. * How is the memory management expected to work? It looks like COPY FROM is using the ExprContext when running the input functions, but we really want to be using a memory context owned by the table AM, right? * What's the point of the table_multi_insert_slots() "num_slots" argument? The only caller simply discards it. * table_tuple_insert_v2 isn't called anywhere, what's it for? * the "v2" naming is inconsistent -- it seems you only added it in places where there's a name conflict, which makes it hard to tell which API methods go together. I'm not sure how widely table_multi_insert* is used outside of core, so it's possible that we may even be able to just change those APIs and the few extensions that call it can be updated. * Memory usage tracking should be done in the AM by allocating everything in a single context so it's easy to check the size. Don't manually add up memory. * I don't understand: "Caller may have got the slot using heap_multi_insert_next_free_slot, filled it and passed. So, skip copying in such a case." If the COPY FROM had a WHERE clause and skipped a tuple after filling the slot, doesn't that mean the slot has bogus data from the last tuple? * We'd like this to work for insert-into-select (IIS) and logical replication, too. Do you see any problem there, or is it just a matter of code? * Andres had some comments[1] that don't seem entirely addressed. - You are still allocating the AM-specific part of TableModifyState as a separately-allocated chunk. - It's still called TableInsertState rather than TableModifyState as he suggested. If you change that, you should also change to table_modify_begin/end. - CID: I suppose Andres is considering the use case of "BEGIN; ... ten thousand inserts ... COMMIT;". I don't think this problem is really solvable (discussed below) but we should have some response/consensus on that point. - He mentioned that we only need one new method implemented by the AM. I don't know if one is enough, but 7 does seem excessive. I have some simplification ideas below. Overall: If I understand correctly, there are two ways to use the API: 1. used by CTAS, MV: tistate = table_insert_begin(...); table_multi_insert_v2(tistate, tup1); ... table_multi_insert_v2(tistate, tupN); table_insert_end(tistate); 2. used by COPY ... FROM: tistate = table_insert_begin(..., SKIP_FLUSH); if (multi_insert_slot_array_is_full()) table_multi_insert_flush(tistate); slot = table_insert_next_free_slot(tistate); ... fill slot with tup1 table_multi_insert_v2(tistate, tup1); ... slot = table_insert_next_free_slot(tistate); ... fill slot with tupN table_multi_insert_v2(tistate, tupN); table_insert_end(tistate); Those two uses need comments explaining what's going on. It appears the SKIP_FLUSH flag is used to indicate which use the caller intends. Use #2 is not enforced well by either the API or runtime checks. If the caller neglects to check for a full buffer, it appears that it will just overrun the slots array. Also, for use #2, table_multi_insert_v2() doesn't do much other than incrementing the memory used. The slot will never be NULL because it was obtained with table_multi_insert_next_free_slot(), and the other two branches don't happen when SKIP_FLUSH is true. The real benefit to COPY of your new API is that the AM can manage slots for itself, and how many tuples may be tracked (which might be a lot higher for non-heap AMs). I agree with Luc Vlaming's comment[2] that more should be left to the table AM. Your patch tries too hard to work with the copyfrom.c slot array, somehow sharing it with the table AM. That adds complexity to the API and feels like a layering violation. We also shouldn't mandate a slot array in the API. Each slot is 64 bytes -- a lot of overhead for small tuples. For a non-heap AM, it's much better to store the tuple data in a big contiguous chunk with minimal overhead. Let's just have a simple API like: tmstate = table_modify_begin(...); table_modify_save_insert(tmstate, tup1); ... table_modify_save_insert(tmstate, tupN); table_modify_end(tmstate); and leave it up to the AM to do all the buffering and flushing work (as Luc Vlaming suggested[2]). That leaves one problem, which is: how do we update the indexes and call AR triggers while flushing? I think the best way is to just have a callback in the TableModifyState that is called during flush. (I don't think that would affect performance, but worth double-checking.) We have to disable this whole multi-insert mechanism if there are volatile BR/AR triggers, because those are supposed to see already- inserted tuples. That's not a problem with your patch but it is a bit unfortunate -- triggers can be costly already, but this increases the penalty. There may be some theoretical ways to avoid this problem, like reading tuples out of the unflushed buffer during a SELECT, which sounds a little too clever (though perhaps not completely crazy if the AM is in control of both?). For potentially working with multi-updates/deletes, it might be as simple as tracking the old TIDs along with the slots and having new _save_update and _save_delete methods. I haven't thought deeply about that, and I'm not sure we have a good example AM to work with, but it seems plausible that we could make something useful here. To batch multiple different INSERT statements within a transaction just seems like a really hard problem. That could mean different CIDs, but also different subtransaction IDs. Constraint violation errors will happen at the time of flushing, which could be many commands later from the one that actually violates the constraint. And what if someone issues a SELECT in the middle of the transaction, how does it see the already-inserted-but-not-flushed tuples? If that's not hard enough already, then you would also need to extend low-level APIs to accept arbitrary CIDs and subxact IDs when storing tuples during a flush. The only way I could imagine solving all of these problems is declaring somehow that your transaction won't do any of these complicated things, and that you don't mind getting constraint violations at the wrong time. So I recommend that you punt on this problem. Regards, Jeff Davis [1] https://www.postgresql.org/message-id/20230603223824.o7iyochli2dwwi7k%40alap3.anarazel.de [2] https://www.postgresql.org/message-id/508af801-6356-d36b-1867-011ac6df8f55%40swarm64.com
On Sat, Mar 23, 2024 at 5:47 AM Jeff Davis <pgsql@j-davis.com> wrote: > > Comments: Thanks for looking into it. > * Do I understand correctly that CMV, RMV, and CTAS experience a > performance benefit, but COPY FROM does not? And is that because COPY > already used table_multi_insert, whereas CMV and RMV did not? Yes, that's right. COPY FROM is already optimized with multi inserts. I now have a feeling that I need to simplify the patches. I'm thinking of dropping the COPY FROM patch using the new multi insert API for the following reasons: 1. We can now remove some of the new APIs (table_multi_insert_slots and table_multi_insert_next_free_slot) that were just invented for COPY FROM. 2. COPY FROM is already optimized with multi inserts, so no real gain is expected with the new multi insert API. 3. As we are inching towards feature freeze, simplifying the patches by having only the necessary things increases the probability of getting this in. 4. The real benefit of this whole new multi insert API is seen if used for the commands CMV, RMV, CTAS. These commands got faster by 62.54%, 68.87%, 74.31% or 2.67, 3.21, 3.89 times respectively. 5. This leaves with really simple APIs. No need for callback stuff for dealing with indexes, triggers etc. as CMV, RMV, CTAS cannot have any of them. The new APIs are more extensible, memory management is taken care of by AM, and with TableModifyState as the structure name and more meaningful API names. The callback for triggers/indexes etc. aren't taken care of as I'm now only focusing on CTAS, CMV, RMV optimizations. Please see the attached v14 patches. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, 2024-03-26 at 01:28 +0530, Bharath Rupireddy wrote: > I'm thinking > of dropping the COPY FROM patch using the new multi insert API for > the > following reasons: ... I agree with all of this. We do want COPY ... FROM support, but there are some details to work out and we don't want to make a big code change at this point in the cycle. > The new APIs are more extensible, memory management is taken care of > by AM, and with TableModifyState as the structure name and more > meaningful API names. The callback for triggers/indexes etc. aren't > taken care of as I'm now only focusing on CTAS, CMV, RMV > optimizations. > > Please see the attached v14 patches. * No need for a 'kind' field in TableModifyState. The state should be aware of the kinds of changes that it has received and that may need to be flushed later -- for now, only inserts, but possibly updates/deletes in the future. * If the AM doesn't support the bulk methods, fall back to retail inserts instead of throwing an error. * It seems like this API will eventually replace table_multi_insert and table_finish_bulk_insert completely. Do those APIs have any advantage remaining over the new one proposed here? * Right now I don't any important use of the flush method. It seems that could be accomplished in the finish method, and flush could just be an internal detail when the memory is exhausted. If we find a use for it later, we can always add it, but right now it seems unnecessary. * We need to be careful about cases where the command can be successful but the writes are not flushed. I don't tihnk that's a problem with the current patch, but we will need to do something here when we expand to INSERT INTO ... SELECT. Andres, is this patch overall closer to what you had in mind in the email here: https://www.postgresql.org/message-id/20230603223824.o7iyochli2dwwi7k@alap3.anarazel.de ? Regards, Jeff Davis
On Tue, Mar 26, 2024 at 9:07 PM Jeff Davis <pgsql@j-davis.com> wrote: > > On Tue, 2024-03-26 at 01:28 +0530, Bharath Rupireddy wrote: > > I'm thinking > > of dropping the COPY FROM patch using the new multi insert API for > > the > > following reasons: ... > > I agree with all of this. We do want COPY ... FROM support, but there > are some details to work out and we don't want to make a big code > change at this point in the cycle. Right. > > Please see the attached v14 patches. > > * No need for a 'kind' field in TableModifyState. The state should be > aware of the kinds of changes that it has received and that may need to > be flushed later -- for now, only inserts, but possibly updates/deletes > in the future. Removed 'kind' field with lazy initialization of required AM specific modify (insert in this case) state. Since we don't have 'kind', I chose the callback approach to cleanup the modify (insert in this case) specific state at the end. > * If the AM doesn't support the bulk methods, fall back to retail > inserts instead of throwing an error. For instance, CREATE MATERIALIZED VIEW foo_mv AS SELECT * FROM foo USING bar_tam; doesn't work if bar_tam doesn't have the table_tuple_insert implemented. Similarly, with this new AM, the onus lies on the table AM implementers to provide an implementation for these new AMs even if they just do single inserts. But, I do agree that we must catch this ahead during parse analysis itself, so I've added assertions in GetTableAmRoutine(). > * It seems like this API will eventually replace table_multi_insert and > table_finish_bulk_insert completely. Do those APIs have any advantage > remaining over the new one proposed here? table_multi_insert needs to be there for sure as COPY ... FROM uses it. Not sure if we need to remove the optional callback table_finish_bulk_insert though. Heap AM doesn't implement one, but some other AM might. Having said that, with this new AM, whatever the logic that used to be there in table_finish_bulk_insert previously, table AM implementers will have to move them to table_modify_end. FWIW, I can try writing a test table AM that uses this new AM but just does single inserts, IOW, equivalent to table_tuple_insert(). Thoughts? > * Right now I don't any important use of the flush method. It seems > that could be accomplished in the finish method, and flush could just > be an internal detail when the memory is exhausted. If we find a use > for it later, we can always add it, but right now it seems unnecessary. Firstly, we are not storing CommandId and options in TableModifyState, because we expect CommandId to be changing (per Andres comment). Secondly, we don't want to pass just the CommandId and options to table_modify_end(). Thirdly, one just has to call the table_modify_buffer_flush before the table_modify_end. Do you have any other thoughts here? > * We need to be careful about cases where the command can be successful > but the writes are not flushed. I don't tihnk that's a problem with the > current patch, but we will need to do something here when we expand to > INSERT INTO ... SELECT. You mean, writes are not flushed to the disk? Can you please elaborate why it's different for INSERT INTO ... SELECT and not others? Can't the new flush AM be helpful here to implement any flush related things? Please find the attached v15 patches with the above review comments addressed. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Wed, 2024-03-27 at 01:19 +0530, Bharath Rupireddy wrote: > > Similarly, with this new AM, the onus lies on the table AM > implementers to provide an implementation for these new AMs even if > they just do single inserts. Why not fall back to using the plain tuple_insert? Surely some table AMs might be simple and limited, and we shouldn't break them just because they don't implement the new APIs. > > table_multi_insert needs to be there for sure as COPY ... FROM uses > it. After we have these new APIs fully in place and used by COPY, what will happen to those other APIs? Will they be deprecated or will there be a reason to keep them? > FWIW, I can try writing a test table AM that uses this new AM but > just > does single inserts, IOW, equivalent to table_tuple_insert(). > Thoughts? More table AMs to test against would be great, but I also know that can be a lot of work. > > Firstly, we are not storing CommandId and options in > TableModifyState, > because we expect CommandId to be changing (per Andres comment). Trying to make this feature work across multiple commands poses a lot of challenges: what happens when there are SELECTs and subtransactions and non-deferrable constraints? Regardless, if we care about multiple CIDs, they should be stored along with the tuples, not supplied at the time of flushing. > You mean, writes are not flushed to the disk? Can you please > elaborate > why it's different for INSERT INTO ... SELECT and not others? Can't > the new flush AM be helpful here to implement any flush related > things? Not a major problem. We can discuss while working on IIS support. I am concnerned that the flush callback is not a part of the API. We will clearly need that to support index insertions for COPY/IIS, so as- is the API feels incomplete. Thoughts? Regards, Jeff Davis
On Wed, Mar 27, 2024 at 1:42 PM Jeff Davis <pgsql@j-davis.com> wrote: > > On Wed, 2024-03-27 at 01:19 +0530, Bharath Rupireddy wrote: > > > > Similarly, with this new AM, the onus lies on the table AM > > implementers to provide an implementation for these new AMs even if > > they just do single inserts. > > Why not fall back to using the plain tuple_insert? Surely some table > AMs might be simple and limited, and we shouldn't break them just > because they don't implement the new APIs. Hm. That might complicate table_modify_begin, table_modify_buffer_insert and table_modify_end a bit. What do we put in TableModifyState then? Do we create the bulk insert state (BulkInsertStateData) outside? I think to give a better interface, can we let TAM implementers support these new APIs in their own way? If this sounds rather intrusive, we can just implement the fallback to tuple_insert if these new API are not supported in the caller, for example, do something like below in createas.c and matview.c. Thoughts? if (table_modify_buffer_insert() is defined) table_modify_buffer_insert(...); else { myState->bistate = GetBulkInsertState(); table_tuple_insert(...); } > > table_multi_insert needs to be there for sure as COPY ... FROM uses > > it. > > After we have these new APIs fully in place and used by COPY, what will > happen to those other APIs? Will they be deprecated or will there be a > reason to keep them? Deprecated perhaps? Please find the attached v16 patches for further review. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Sun, 2024-03-31 at 21:18 +0530, Bharath Rupireddy wrote: > if (table_modify_buffer_insert() is defined) > table_modify_buffer_insert(...); > else > { > myState->bistate = GetBulkInsertState(); > table_tuple_insert(...); > } We can't alloc/free the bulk insert state for every insert call. I see two options: * Each caller needs to support two code paths: if the buffered insert APIs are defined, then use those; otherwise the caller needs to manage the bulk insert state itself and call the plain insert API. * Have default implementation for the new API methods, so that the default for the begin method would allocate the bulk insert state, and the default for the buffered insert method would be to call plain insert using the bulk insert state. I'd prefer the latter, at least in the long term. But I haven't really thought through the details, so perhaps we'd need to use the former. > > > > After we have these new APIs fully in place and used by COPY, what > > will > > happen to those other APIs? Will they be deprecated or will there > > be a > > reason to keep them? > > Deprecated perhaps? Including Alexander on this thread, because he's making changes to the multi-insert API. We need some consensus on where we are going with these APIs before we make more changes, and what incremental steps make sense in v17. Here's where I think this API should go: 1. Have table_modify_begin/end and table_modify_buffer_insert, like those that are implemented in your patch. 2. Add some kind of flush callback that will be called either while the tuples are being flushed or after the tuples are flushed (but before they are freed by the AM). (Aside: do we need to call it while the tuples are being flushed to get the right visibility semantics for after-row triggers?) 3. Add table_modify_buffer_{update|delete} APIs. 4. Some kind of API tweaks to help manage memory when modifying pertitioned tables, so that the buffering doesn't get out of control. Perhaps just reporting memory usage and allowing the caller to force flushes would be enough. 5. Use these new methods for CREATE/REFRESH MATERIALIZED VIEW. This is fairly straightforward, I believe, and handled by your patch. Indexes are (re)built afterward, and no triggers are possible. 6. Use these new methods for CREATE TABLE ... AS. This is fairly straightforward, I believe, and handled by your patch. No indexes or triggers are possible. 7. Use these new methods for COPY. We have to be careful to avoid regressions for the heap method, because it's already managing its own buffers. If the AM manages the buffering, then it may require additional copying of slots, which could be a disadvantage. To solve this, we may need some minor API tweaks to avoid copying when the caller guarantees that the memory will not be freed to early, or perhaps expose the AM's memory context to copyfrom.c. Another thing to consider is that the buffering in copyfrom.c is also used for FDWs, so that buffering code path needs to be preserved in copyfrom.c even if not used for AMs. 8. Use these new methods for INSERT INTO ... SELECT. One potential challenge here is that execution nodes are not always run to completion, so we need to be sure that the flush isn't forgotten in that case. 9. Use these new methods for DELETE, UPDATE, and MERGE. MERGE can use the buffer_insert/update/delete APIs; we don't need a separate merge method. This probably requires that the AM maintain 3 separate buffers to distinguish different kinds of changes at flush time (obviously these can be initialized lazily to avoid overhead when not being used). 10. Use these new methods for logical apply. 11. Deprecate the multi_insert API. Thoughts on this plan? Does your patch make sense in v17 as a stepping stone, or should we try to make all of these API changes together in v18? Also, a sample AM code would be a huge benefit here. Writing a real AM is hard, but perhaps we can at least have an example one to demonstrate how to use these APIs? Regards, Jeff Davis
On Wed, Apr 3, 2024 at 1:10 AM Jeff Davis <pgsql@j-davis.com> wrote: > > On Sun, 2024-03-31 at 21:18 +0530, Bharath Rupireddy wrote: > > if (table_modify_buffer_insert() is defined) > > table_modify_buffer_insert(...); > > else > > { > > myState->bistate = GetBulkInsertState(); > > table_tuple_insert(...); > > } > > We can't alloc/free the bulk insert state for every insert call. I see > two options: > > * Each caller needs to support two code paths: if the buffered insert > APIs are defined, then use those; otherwise the caller needs to manage > the bulk insert state itself and call the plain insert API. > > * Have default implementation for the new API methods, so that the > default for the begin method would allocate the bulk insert state, and > the default for the buffered insert method would be to call plain > insert using the bulk insert state. > > I'd prefer the latter, at least in the long term. But I haven't really > thought through the details, so perhaps we'd need to use the former. I too prefer the latter so that the caller doesn't have to have two paths. The new API can just transparently fallback to single inserts. I've implemented that in the attached v17 patch. I also tested the default APIs manually, but I'll see if I can add some tests to it the default API. > > > After we have these new APIs fully in place and used by COPY, what > > > will > > > happen to those other APIs? Will they be deprecated or will there > > > be a > > > reason to keep them? > > > > Deprecated perhaps? > > Including Alexander on this thread, because he's making changes to the > multi-insert API. We need some consensus on where we are going with > these APIs before we make more changes, and what incremental steps make > sense in v17. > > Here's where I think this API should go: > > 1. Have table_modify_begin/end and table_modify_buffer_insert, like > those that are implemented in your patch. > > 2. Add some kind of flush callback that will be called either while the > tuples are being flushed or after the tuples are flushed (but before > they are freed by the AM). (Aside: do we need to call it while the > tuples are being flushed to get the right visibility semantics for > after-row triggers?) > > 3. Add table_modify_buffer_{update|delete} APIs. > > 4. Some kind of API tweaks to help manage memory when modifying > pertitioned tables, so that the buffering doesn't get out of control. > Perhaps just reporting memory usage and allowing the caller to force > flushes would be enough. > > 5. Use these new methods for CREATE/REFRESH MATERIALIZED VIEW. This is > fairly straightforward, I believe, and handled by your patch. Indexes > are (re)built afterward, and no triggers are possible. > > 6. Use these new methods for CREATE TABLE ... AS. This is fairly > straightforward, I believe, and handled by your patch. No indexes or > triggers are possible. > > 7. Use these new methods for COPY. We have to be careful to avoid > regressions for the heap method, because it's already managing its own > buffers. If the AM manages the buffering, then it may require > additional copying of slots, which could be a disadvantage. To solve > this, we may need some minor API tweaks to avoid copying when the > caller guarantees that the memory will not be freed to early, or > perhaps expose the AM's memory context to copyfrom.c. Another thing to > consider is that the buffering in copyfrom.c is also used for FDWs, so > that buffering code path needs to be preserved in copyfrom.c even if > not used for AMs. > > 8. Use these new methods for INSERT INTO ... SELECT. One potential > challenge here is that execution nodes are not always run to > completion, so we need to be sure that the flush isn't forgotten in > that case. > > 9. Use these new methods for DELETE, UPDATE, and MERGE. MERGE can use > the buffer_insert/update/delete APIs; we don't need a separate merge > method. This probably requires that the AM maintain 3 separate buffers > to distinguish different kinds of changes at flush time (obviously > these can be initialized lazily to avoid overhead when not being used). > > 10. Use these new methods for logical apply. > > 11. Deprecate the multi_insert API. > > Thoughts on this plan? Does your patch make sense in v17 as a stepping > stone, or should we try to make all of these API changes together in > v18? I'd like to see the new multi insert API (as proposed in the v17 patches) for PG17 if possible. The basic idea with these new APIs is to let the AM implementers choose the right buffered insert strategy (one can choose the AM specific slot type to buffer the tuples, choose the AM specific memory and flushing decisions etc.). Another advantage with these new multi insert API is that the CREATE MATERIALIZED VIEW, REFRESH MATERIALIZED VIEW, CREATE TABLE AS commands for heap AM got faster by 62.54%, 68.87%, 74.31% or 2.67, 3.21, 3.89 times respectively. The performance improvement in REFRESH MATERIALIZED VIEW can benefit customers running analytical workloads on postgres. I'm fine if we gradually add more infrastructure to support COPY, INSERT INTO SELECT, Logical Replication Apply, Table Rewrites in future releases. I'm sure it requires a lot more thoughts and time. > Also, a sample AM code would be a huge benefit here. Writing a real AM > is hard, but perhaps we can at least have an example one to demonstrate > how to use these APIs? The heap AM implements this new API. Also, there's a default implementation for the new API falling back on to single inserts. Aren't these sufficient to help AM implementers to come up with their own implementations? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Wed, Apr 3, 2024 at 2:32 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > I too prefer the latter so that the caller doesn't have to have two > paths. The new API can just transparently fallback to single inserts. > I've implemented that in the attached v17 patch. I also tested the > default APIs manually, but I'll see if I can add some tests to it the > default API. Fixed a compiler warning found via CF bot. Please find the attached v18 patches. I'm sorry for the noise. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM
From
Bharath Rupireddy
Date:
On Wed, Apr 3, 2024 at 1:10 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> Here's where I think this API should go:
>
> 1. Have table_modify_begin/end and table_modify_buffer_insert, like
> those that are implemented in your patch.
I added table_modify_begin, table_modify_buffer_insert, table_modify_buffer_flush and table_modify_end. Table Access Method (AM) authors now can define their own buffering strategy and flushing decisions based on their tuple storage kinds and various other AM specific factors. I also added a default implementation that falls back to single inserts when no implementation is provided for these AM by AM authors. See the attached v19-0001 patch.
> 2. Add some kind of flush callback that will be called either while the
> tuples are being flushed or after the tuples are flushed (but before
> they are freed by the AM). (Aside: do we need to call it while the
> tuples are being flushed to get the right visibility semantics for
> after-row triggers?)
I added a flush callback named TableModifyBufferFlushCallback; when provided by callers invoked after tuples are flushed to disk from the buffers but before the AM frees them up. Index insertions and AFTER ROW INSERT triggers can be executed in this callback. See the v19-0001 patch for how AM invokes the flush callback, and see either v19-0003 or v19-0004 or v19-0005 for how a caller can supply the callback and required context to execute index insertions and AR triggers.
> 3. Add table_modify_buffer_{update|delete} APIs.
>
> 9. Use these new methods for DELETE, UPDATE, and MERGE. MERGE can use
> the buffer_insert/update/delete APIs; we don't need a separate merge
> method. This probably requires that the AM maintain 3 separate buffers
> to distinguish different kinds of changes at flush time (obviously
> these can be initialized lazily to avoid overhead when not being used).
I haven't thought about these things yet. I can only focus on them after seeing how the attached patches go from here.
> 4. Some kind of API tweaks to help manage memory when modifying
> pertitioned tables, so that the buffering doesn't get out of control.
> Perhaps just reporting memory usage and allowing the caller to force
> flushes would be enough.
Heap implementation for thes new Table AMs uses a separate memory context for all of the operations. Please have a look and let me know if we need anything more.
> 5. Use these new methods for CREATE/REFRESH MATERIALIZED VIEW. This is
> fairly straightforward, I believe, and handled by your patch. Indexes
> are (re)built afterward, and no triggers are possible.
>
> 6. Use these new methods for CREATE TABLE ... AS. This is fairly
> straightforward, I believe, and handled by your patch. No indexes or
> triggers are possible.
I used multi inserts for all of these including TABLE REWRITE commands such as ALTER TABLE. See the attached v19-0002 patch. Check the testing section below for benefits.
FWIW, following are some of the TABLE REWRITE commands that can get benefitted:
ALTER TABLE tbl ALTER c1 TYPE bigint;
ALTER TABLE itest13 ADD COLUMN c int GENERATED BY DEFAULT AS IDENTITY;
ALTER MATERIALIZED VIEW heapmv SET ACCESS METHOD heap2;
ALTER TABLE itest3 ALTER COLUMN a TYPE int;
ALTER TABLE gtest20 ALTER COLUMN b SET EXPRESSION AS (a * 3);
ALTER TABLE has_volatile ADD col4 int DEFAULT (random() * 10000)::int;
and so on.
> 7. Use these new methods for COPY. We have to be careful to avoid
> regressions for the heap method, because it's already managing its own
> buffers. If the AM manages the buffering, then it may require
> additional copying of slots, which could be a disadvantage. To solve
> this, we may need some minor API tweaks to avoid copying when the
> caller guarantees that the memory will not be freed to early, or
> perhaps expose the AM's memory context to copyfrom.c. Another thing to
> consider is that the buffering in copyfrom.c is also used for FDWs, so
> that buffering code path needs to be preserved in copyfrom.c even if
> not used for AMs.
I modified the COPY FROM code to use the new Table AMs, and performed some tests which show no signs of regression. Check the testing section below for more details. See the attached v19-0005 patch. With this, table_multi_insert can be deprecated.
> 8. Use these new methods for INSERT INTO ... SELECT. One potential
> challenge here is that execution nodes are not always run to
> completion, so we need to be sure that the flush isn't forgotten in
> that case.
I did that in v19-0003. I did place the table_modify_end call in multiple places including ExecEndModifyTable. I didn't find any issues with it. Please have a look and let me know if we need the end call in more places. Check the testing section below for benefits.
> 10. Use these new methods for logical apply.
I used multi inserts for Logical Replication apply. in v19-0004. Check the testing section below for benefits.
>
> Here's where I think this API should go:
>
> 1. Have table_modify_begin/end and table_modify_buffer_insert, like
> those that are implemented in your patch.
I added table_modify_begin, table_modify_buffer_insert, table_modify_buffer_flush and table_modify_end. Table Access Method (AM) authors now can define their own buffering strategy and flushing decisions based on their tuple storage kinds and various other AM specific factors. I also added a default implementation that falls back to single inserts when no implementation is provided for these AM by AM authors. See the attached v19-0001 patch.
> 2. Add some kind of flush callback that will be called either while the
> tuples are being flushed or after the tuples are flushed (but before
> they are freed by the AM). (Aside: do we need to call it while the
> tuples are being flushed to get the right visibility semantics for
> after-row triggers?)
I added a flush callback named TableModifyBufferFlushCallback; when provided by callers invoked after tuples are flushed to disk from the buffers but before the AM frees them up. Index insertions and AFTER ROW INSERT triggers can be executed in this callback. See the v19-0001 patch for how AM invokes the flush callback, and see either v19-0003 or v19-0004 or v19-0005 for how a caller can supply the callback and required context to execute index insertions and AR triggers.
> 3. Add table_modify_buffer_{update|delete} APIs.
>
> 9. Use these new methods for DELETE, UPDATE, and MERGE. MERGE can use
> the buffer_insert/update/delete APIs; we don't need a separate merge
> method. This probably requires that the AM maintain 3 separate buffers
> to distinguish different kinds of changes at flush time (obviously
> these can be initialized lazily to avoid overhead when not being used).
I haven't thought about these things yet. I can only focus on them after seeing how the attached patches go from here.
> 4. Some kind of API tweaks to help manage memory when modifying
> pertitioned tables, so that the buffering doesn't get out of control.
> Perhaps just reporting memory usage and allowing the caller to force
> flushes would be enough.
Heap implementation for thes new Table AMs uses a separate memory context for all of the operations. Please have a look and let me know if we need anything more.
> 5. Use these new methods for CREATE/REFRESH MATERIALIZED VIEW. This is
> fairly straightforward, I believe, and handled by your patch. Indexes
> are (re)built afterward, and no triggers are possible.
>
> 6. Use these new methods for CREATE TABLE ... AS. This is fairly
> straightforward, I believe, and handled by your patch. No indexes or
> triggers are possible.
I used multi inserts for all of these including TABLE REWRITE commands such as ALTER TABLE. See the attached v19-0002 patch. Check the testing section below for benefits.
FWIW, following are some of the TABLE REWRITE commands that can get benefitted:
ALTER TABLE tbl ALTER c1 TYPE bigint;
ALTER TABLE itest13 ADD COLUMN c int GENERATED BY DEFAULT AS IDENTITY;
ALTER MATERIALIZED VIEW heapmv SET ACCESS METHOD heap2;
ALTER TABLE itest3 ALTER COLUMN a TYPE int;
ALTER TABLE gtest20 ALTER COLUMN b SET EXPRESSION AS (a * 3);
ALTER TABLE has_volatile ADD col4 int DEFAULT (random() * 10000)::int;
and so on.
> 7. Use these new methods for COPY. We have to be careful to avoid
> regressions for the heap method, because it's already managing its own
> buffers. If the AM manages the buffering, then it may require
> additional copying of slots, which could be a disadvantage. To solve
> this, we may need some minor API tweaks to avoid copying when the
> caller guarantees that the memory will not be freed to early, or
> perhaps expose the AM's memory context to copyfrom.c. Another thing to
> consider is that the buffering in copyfrom.c is also used for FDWs, so
> that buffering code path needs to be preserved in copyfrom.c even if
> not used for AMs.
I modified the COPY FROM code to use the new Table AMs, and performed some tests which show no signs of regression. Check the testing section below for more details. See the attached v19-0005 patch. With this, table_multi_insert can be deprecated.
> 8. Use these new methods for INSERT INTO ... SELECT. One potential
> challenge here is that execution nodes are not always run to
> completion, so we need to be sure that the flush isn't forgotten in
> that case.
I did that in v19-0003. I did place the table_modify_end call in multiple places including ExecEndModifyTable. I didn't find any issues with it. Please have a look and let me know if we need the end call in more places. Check the testing section below for benefits.
> 10. Use these new methods for logical apply.
I used multi inserts for Logical Replication apply. in v19-0004. Check the testing section below for benefits.
FWIW, open-source pglogical does have multi insert support, check code around https://github.com/2ndQuadrant/pglogical/blob/REL2_x_STABLE/pglogical_apply_heap.c#L960.
> 11. Deprecate the multi_insert API.
I did remove both table_multi_insert and table_finish_bulk_insert in v19-0006. Perhaps, removing them isn't a great idea, but adding a deprecation WARNING/ERROR until some more PG releases might be worth looking at.
> Thoughts on this plan? Does your patch make sense in v17 as a stepping
> stone, or should we try to make all of these API changes together in
> v18?
If the design, code and benefits that these new Table AMs bring to the table look good, I hope to see it for PG 18.
> Also, a sample AM code would be a huge benefit here. Writing a real AM
> is hard, but perhaps we can at least have an example one to demonstrate
> how to use these APIs?
The attached patches already have implemented these new Table AMs for Heap. I don't think we need a separate implementation to demonstrate. If others feel so, I'm open to thoughts here.
Having said above, I'd like to reiterate the motivation behind the new Table AMs for multi and single inserts.
1. A scan-like API with state being carried across is thought to be better as suggested by Andres Freund - https://www.postgresql.org/message-id/20200924024128.kyk3r5g7dnu3fxxx@alap3.anarazel.de.
2. Allowing a Table AM to optimize operations across multiple inserts, define its own buffering strategy and take its own flushing decisions based on their tuple storage kinds and various other AM specific factors.
3. Improve performance of various SQL commands with multi inserts for Heap AM.
The attached v19 patches might need some more detailed comments, some documentation and some specific tests ensuring the multi inserts for Heap are kicked-in for various commands. I'm open to thoughts here.
I did some testing to see how various commands benefit with multi inserts using these new Table AM for heap. It's not only the improvement in performance these commands see, but also the amount of WAL that gets generated reduces greatly. After all, multi inserts optimize the insertions by writing less WAL. IOW, writing WAL record per page if multiple rows fit into a single data page as opposed to WAL record per row.
Test case 1: 100 million rows, 2 columns (int and float)
Command | HEAD (sec) | PATCHED (sec) | Faster by % | Faster by X
------------------------------ | ---------- | ------------- | ----------- | -----------
CREATE TABLE AS | 121 | 77 | 36.3 | 1.57
CREATE MATERIALIZED VIEW | 101 | 49 | 51.4 | 2.06
REFRESH MATERIALIZED VIEW | 113 | 54 | 52.2 | 2.09
ALTER TABLE (TABLE REWRITE) | 124 | 81 | 34.6 | 1.53
COPY FROM | 71 | 72 | 0 | 1
INSERT INTO ... SELECT | 117 | 62 | 47 | 1.88
LOGICAL REPLICATION APPLY | 393 | 306 | 22.1 | 1.28
Command | HEAD (WAL in GB) | PATCHED (WAL in GB) | Reduced by % | Reduced by X
------------------------------ | ---------------- | ------------------- | ------------ | -----------
CREATE TABLE AS | 6.8 | 2.4 | 64.7 | 2.83
CREATE MATERIALIZED VIEW | 7.2 | 2.3 | 68 | 3.13
REFRESH MATERIALIZED VIEW | 10 | 5.1 | 49 | 1.96
ALTER TABLE (TABLE REWRITE) | 8 | 3.2 | 60 | 2.5
COPY FROM | 2.9 | 3 | 0 | 1
INSERT INTO ... SELECT | 8 | 3 | 62.5 | 2.66
LOGICAL REPLICATION APPLY | 7.5 | 2.3 | 69.3 | 3.26
Test case 2: 1 billion rows, 1 column (int)
Command | HEAD (sec) | PATCHED (sec) | Faster by % | Faster by X
------------------------------ | ---------- | ------------- | ----------- | -----------
CREATE TABLE AS | 794 | 386 | 51.38 | 2.05
CREATE MATERIALIZED VIEW | 1006 | 563 | 44.03 | 1.78
REFRESH MATERIALIZED VIEW | 977 | 603 | 38.28 | 1.62
ALTER TABLE (TABLE REWRITE) | 1189 | 714 | 39.94 | 1.66
COPY FROM | 321 | 330 | -0.02 | 0.97
INSERT INTO ... SELECT | 1084 | 586 | 45.94 | 1.84
LOGICAL REPLICATION APPLY | 3530 | 2982 | 15.52 | 1.18
Command | HEAD (WAL in GB) | PATCHED (WAL in GB) | Reduced by % | Reduced by X
------------------------------ | ---------------- | ------------------- | ------------ | -----------
CREATE TABLE AS | 60 | 12 | 80 | 5
CREATE MATERIALIZED VIEW | 60 | 12 | 80 | 5
REFRESH MATERIALIZED VIEW | 60 | 12 | 80 | 5
ALTER TABLE (TABLE REWRITE) | 123 | 31 | 60 | 2.5
COPY FROM | 12 | 12 | 0 | 1
INSERT INTO ... SELECT | 120 | 24 | 80 | 5
LOGICAL REPLICATION APPLY | 61 | 12 | 80.32 | 5
Test setup:
./configure --prefix=$PWD/pg17/ --enable-tap-tests CFLAGS="-ggdb3 -O2" > install.log && make -j 8 install > install.log 2>&1 &
wal_level=logical
max_wal_size = 256GB
checkpoint_timeout = 1h
Test system is EC2 instance of type c5.4xlarge:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
CPU family: 6
Model: 85
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
Stepping: 7
BogoMIPS: 5999.99
Caches (sum of all):
L1d: 256 KiB (8 instances)
L1i: 256 KiB (8 instances)
L2: 8 MiB (8 instances)
L3: 35.8 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-15
RAM:
MemTotal: 32036536 kB
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachment
- v19-0001-Introduce-new-Table-Access-Methods-for-single-an.patch
- v19-0002-Optimize-CTAS-CMV-RMV-and-TABLE-REWRITES-with-mu.patch
- v19-0003-Optimize-INSERT-INTO-.-SELECT-with-multi-inserts.patch
- v19-0004-Optimize-Logical-Replication-apply-with-multi-in.patch
- v19-0005-Use-new-multi-insert-Table-AM-for-COPY-FROM.patch
- v19-0006-Remove-table_multi_insert-and-table_finish_bulk_.patch
Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM
From
Pavel Stehule
Date:
st 24. 4. 2024 v 14:50 odesílatel Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> napsal:
On Wed, Apr 3, 2024 at 1:10 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> Here's where I think this API should go:
>
> 1. Have table_modify_begin/end and table_modify_buffer_insert, like
> those that are implemented in your patch.
I added table_modify_begin, table_modify_buffer_insert, table_modify_buffer_flush and table_modify_end. Table Access Method (AM) authors now can define their own buffering strategy and flushing decisions based on their tuple storage kinds and various other AM specific factors. I also added a default implementation that falls back to single inserts when no implementation is provided for these AM by AM authors. See the attached v19-0001 patch.
> 2. Add some kind of flush callback that will be called either while the
> tuples are being flushed or after the tuples are flushed (but before
> they are freed by the AM). (Aside: do we need to call it while the
> tuples are being flushed to get the right visibility semantics for
> after-row triggers?)
I added a flush callback named TableModifyBufferFlushCallback; when provided by callers invoked after tuples are flushed to disk from the buffers but before the AM frees them up. Index insertions and AFTER ROW INSERT triggers can be executed in this callback. See the v19-0001 patch for how AM invokes the flush callback, and see either v19-0003 or v19-0004 or v19-0005 for how a caller can supply the callback and required context to execute index insertions and AR triggers.
> 3. Add table_modify_buffer_{update|delete} APIs.
>
> 9. Use these new methods for DELETE, UPDATE, and MERGE. MERGE can use
> the buffer_insert/update/delete APIs; we don't need a separate merge
> method. This probably requires that the AM maintain 3 separate buffers
> to distinguish different kinds of changes at flush time (obviously
> these can be initialized lazily to avoid overhead when not being used).
I haven't thought about these things yet. I can only focus on them after seeing how the attached patches go from here.
> 4. Some kind of API tweaks to help manage memory when modifying
> pertitioned tables, so that the buffering doesn't get out of control.
> Perhaps just reporting memory usage and allowing the caller to force
> flushes would be enough.
Heap implementation for thes new Table AMs uses a separate memory context for all of the operations. Please have a look and let me know if we need anything more.
> 5. Use these new methods for CREATE/REFRESH MATERIALIZED VIEW. This is
> fairly straightforward, I believe, and handled by your patch. Indexes
> are (re)built afterward, and no triggers are possible.
>
> 6. Use these new methods for CREATE TABLE ... AS. This is fairly
> straightforward, I believe, and handled by your patch. No indexes or
> triggers are possible.
I used multi inserts for all of these including TABLE REWRITE commands such as ALTER TABLE. See the attached v19-0002 patch. Check the testing section below for benefits.
FWIW, following are some of the TABLE REWRITE commands that can get benefitted:
ALTER TABLE tbl ALTER c1 TYPE bigint;
ALTER TABLE itest13 ADD COLUMN c int GENERATED BY DEFAULT AS IDENTITY;
ALTER MATERIALIZED VIEW heapmv SET ACCESS METHOD heap2;
ALTER TABLE itest3 ALTER COLUMN a TYPE int;
ALTER TABLE gtest20 ALTER COLUMN b SET EXPRESSION AS (a * 3);
ALTER TABLE has_volatile ADD col4 int DEFAULT (random() * 10000)::int;
and so on.
> 7. Use these new methods for COPY. We have to be careful to avoid
> regressions for the heap method, because it's already managing its own
> buffers. If the AM manages the buffering, then it may require
> additional copying of slots, which could be a disadvantage. To solve
> this, we may need some minor API tweaks to avoid copying when the
> caller guarantees that the memory will not be freed to early, or
> perhaps expose the AM's memory context to copyfrom.c. Another thing to
> consider is that the buffering in copyfrom.c is also used for FDWs, so
> that buffering code path needs to be preserved in copyfrom.c even if
> not used for AMs.
I modified the COPY FROM code to use the new Table AMs, and performed some tests which show no signs of regression. Check the testing section below for more details. See the attached v19-0005 patch. With this, table_multi_insert can be deprecated.
> 8. Use these new methods for INSERT INTO ... SELECT. One potential
> challenge here is that execution nodes are not always run to
> completion, so we need to be sure that the flush isn't forgotten in
> that case.
I did that in v19-0003. I did place the table_modify_end call in multiple places including ExecEndModifyTable. I didn't find any issues with it. Please have a look and let me know if we need the end call in more places. Check the testing section below for benefits.
> 10. Use these new methods for logical apply.
I used multi inserts for Logical Replication apply. in v19-0004. Check the testing section below for benefits.FWIW, open-source pglogical does have multi insert support, check code around https://github.com/2ndQuadrant/pglogical/blob/REL2_x_STABLE/pglogical_apply_heap.c#L960.
> 11. Deprecate the multi_insert API.
I did remove both table_multi_insert and table_finish_bulk_insert in v19-0006. Perhaps, removing them isn't a great idea, but adding a deprecation WARNING/ERROR until some more PG releases might be worth looking at.
> Thoughts on this plan? Does your patch make sense in v17 as a stepping
> stone, or should we try to make all of these API changes together in
> v18?
If the design, code and benefits that these new Table AMs bring to the table look good, I hope to see it for PG 18.
> Also, a sample AM code would be a huge benefit here. Writing a real AM
> is hard, but perhaps we can at least have an example one to demonstrate
> how to use these APIs?
The attached patches already have implemented these new Table AMs for Heap. I don't think we need a separate implementation to demonstrate. If others feel so, I'm open to thoughts here.
Having said above, I'd like to reiterate the motivation behind the new Table AMs for multi and single inserts.
1. A scan-like API with state being carried across is thought to be better as suggested by Andres Freund - https://www.postgresql.org/message-id/20200924024128.kyk3r5g7dnu3fxxx@alap3.anarazel.de.
2. Allowing a Table AM to optimize operations across multiple inserts, define its own buffering strategy and take its own flushing decisions based on their tuple storage kinds and various other AM specific factors.
3. Improve performance of various SQL commands with multi inserts for Heap AM.
The attached v19 patches might need some more detailed comments, some documentation and some specific tests ensuring the multi inserts for Heap are kicked-in for various commands. I'm open to thoughts here.
I did some testing to see how various commands benefit with multi inserts using these new Table AM for heap. It's not only the improvement in performance these commands see, but also the amount of WAL that gets generated reduces greatly. After all, multi inserts optimize the insertions by writing less WAL. IOW, writing WAL record per page if multiple rows fit into a single data page as opposed to WAL record per row.
Test case 1: 100 million rows, 2 columns (int and float)
Command | HEAD (sec) | PATCHED (sec) | Faster by % | Faster by X
------------------------------ | ---------- | ------------- | ----------- | -----------
CREATE TABLE AS | 121 | 77 | 36.3 | 1.57
CREATE MATERIALIZED VIEW | 101 | 49 | 51.4 | 2.06
REFRESH MATERIALIZED VIEW | 113 | 54 | 52.2 | 2.09
ALTER TABLE (TABLE REWRITE) | 124 | 81 | 34.6 | 1.53
COPY FROM | 71 | 72 | 0 | 1
INSERT INTO ... SELECT | 117 | 62 | 47 | 1.88
LOGICAL REPLICATION APPLY | 393 | 306 | 22.1 | 1.28
Command | HEAD (WAL in GB) | PATCHED (WAL in GB) | Reduced by % | Reduced by X
------------------------------ | ---------------- | ------------------- | ------------ | -----------
CREATE TABLE AS | 6.8 | 2.4 | 64.7 | 2.83
CREATE MATERIALIZED VIEW | 7.2 | 2.3 | 68 | 3.13
REFRESH MATERIALIZED VIEW | 10 | 5.1 | 49 | 1.96
ALTER TABLE (TABLE REWRITE) | 8 | 3.2 | 60 | 2.5
COPY FROM | 2.9 | 3 | 0 | 1
INSERT INTO ... SELECT | 8 | 3 | 62.5 | 2.66
LOGICAL REPLICATION APPLY | 7.5 | 2.3 | 69.3 | 3.26
Test case 2: 1 billion rows, 1 column (int)
Command | HEAD (sec) | PATCHED (sec) | Faster by % | Faster by X
------------------------------ | ---------- | ------------- | ----------- | -----------
CREATE TABLE AS | 794 | 386 | 51.38 | 2.05
CREATE MATERIALIZED VIEW | 1006 | 563 | 44.03 | 1.78
REFRESH MATERIALIZED VIEW | 977 | 603 | 38.28 | 1.62
ALTER TABLE (TABLE REWRITE) | 1189 | 714 | 39.94 | 1.66
COPY FROM | 321 | 330 | -0.02 | 0.97
INSERT INTO ... SELECT | 1084 | 586 | 45.94 | 1.84
LOGICAL REPLICATION APPLY | 3530 | 2982 | 15.52 | 1.18
Command | HEAD (WAL in GB) | PATCHED (WAL in GB) | Reduced by % | Reduced by X
------------------------------ | ---------------- | ------------------- | ------------ | -----------
CREATE TABLE AS | 60 | 12 | 80 | 5
CREATE MATERIALIZED VIEW | 60 | 12 | 80 | 5
REFRESH MATERIALIZED VIEW | 60 | 12 | 80 | 5
ALTER TABLE (TABLE REWRITE) | 123 | 31 | 60 | 2.5
COPY FROM | 12 | 12 | 0 | 1
INSERT INTO ... SELECT | 120 | 24 | 80 | 5
LOGICAL REPLICATION APPLY | 61 | 12 | 80.32 | 5
looks pretty impressive!
Pavel
Test setup:
./configure --prefix=$PWD/pg17/ --enable-tap-tests CFLAGS="-ggdb3 -O2" > install.log && make -j 8 install > install.log 2>&1 &
wal_level=logical
max_wal_size = 256GB
checkpoint_timeout = 1h
Test system is EC2 instance of type c5.4xlarge:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
CPU family: 6
Model: 85
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
Stepping: 7
BogoMIPS: 5999.99
Caches (sum of all):
L1d: 256 KiB (8 instances)
L1i: 256 KiB (8 instances)
L2: 8 MiB (8 instances)
L3: 35.8 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-15RAM:MemTotal: 32036536 kB--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM
From
Jeff Davis
Date:
On Wed, 2024-04-24 at 18:19 +0530, Bharath Rupireddy wrote: > I added a flush callback named TableModifyBufferFlushCallback; when > provided by callers invoked after tuples are flushed to disk from the > buffers but before the AM frees them up. Index insertions and AFTER > ROW INSERT triggers can be executed in this callback. See the v19- > 0001 patch for how AM invokes the flush callback, and see either v19- > 0003 or v19-0004 or v19-0005 for how a caller can supply the callback > and required context to execute index insertions and AR triggers. The flush callback takes a pointer to an array of slot pointers, and I don't think that's the right API. I think the callback should be called on each slot individually. We shouldn't assume that a table AM stores buffered inserts as an array of slot pointers. A TupleTableSlot has a fair amount of memory overhead (64 bytes), so most AMs wouldn't want to pay that overhead for every tuple. COPY does, but that's because the number of buffered tuples is fairly small. > > > > 11. Deprecate the multi_insert API. > > I did remove both table_multi_insert and table_finish_bulk_insert in > v19-0006. That's OK with me. Let's leave those functions out for now. > > If the design, code and benefits that these new Table AMs bring to > the table look good, I hope to see it for PG 18. Sounds good. Regards, Jeff Davis
Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM
From
Bharath Rupireddy
Date:
On Thu, Apr 25, 2024 at 10:11 PM Jeff Davis <pgsql@j-davis.com> wrote: > > On Wed, 2024-04-24 at 18:19 +0530, Bharath Rupireddy wrote: > > I added a flush callback named TableModifyBufferFlushCallback; when > > provided by callers invoked after tuples are flushed to disk from the > > buffers but before the AM frees them up. Index insertions and AFTER > > ROW INSERT triggers can be executed in this callback. See the v19- > > 0001 patch for how AM invokes the flush callback, and see either v19- > > 0003 or v19-0004 or v19-0005 for how a caller can supply the callback > > and required context to execute index insertions and AR triggers. > > The flush callback takes a pointer to an array of slot pointers, and I > don't think that's the right API. I think the callback should be called > on each slot individually. > > We shouldn't assume that a table AM stores buffered inserts as an array > of slot pointers. A TupleTableSlot has a fair amount of memory overhead > (64 bytes), so most AMs wouldn't want to pay that overhead for every > tuple. COPY does, but that's because the number of buffered tuples is > fairly small. I get your point. An AM can choose to implement the buffering strategy by just storing the plain tuple rather than the tuple slots in which case the flush callback with an array of tuple slots won't work. Therefore, I now changed the flush callback to accept only a single tuple slot. > > > 11. Deprecate the multi_insert API. > > > > I did remove both table_multi_insert and table_finish_bulk_insert in > > v19-0006. > > That's OK with me. Let's leave those functions out for now. Okay. Dropped the 0006 patch for now. Please see the attached v20 patch set. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
- v20-0001-Introduce-new-Table-Access-Methods-for-single-an.patch
- v20-0002-Optimize-CTAS-CMV-RMV-and-TABLE-REWRITES-with-mu.patch
- v20-0003-Optimize-INSERT-INTO-.-SELECT-with-multi-inserts.patch
- v20-0004-Optimize-Logical-Replication-apply-with-multi-in.patch
- v20-0005-Use-new-multi-insert-Table-AM-for-COPY-FROM.patch
Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM
From
Bharath Rupireddy
Date:
On Mon, Apr 29, 2024 at 11:36 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > Please see the attached v20 patch set. It looks like with the use of the new multi insert table access method (TAM) for COPY (v20-0005), pgbench regressed about 35% [1]. The reason is that the memory-based flushing decision the new TAM takes [2] differs from that of what the COPY does today with table_multi_insert. The COPY with table_multi_insert, maintains exact size of the tuples in CopyFromState after it does the line parsing. For instance, the tuple size of a table with two integer columns is 8 (4+4) bytes here. The new TAM relies on the memory occupied by the slot's memory context which holds the actual tuple as a good approximation for the tuple size. But, this memory context size also includes a tuple header, so the size here is not just 8 (4+4) bytes but more. Because of this, the buffers get flushed sooner than that of the existing COPY with table_multi_insert AM causing regression in pgbench which uses COPY extensively. The new TAM aren't designed to be able to receive tuple sizes from the callers, even if we do that, the API doesn't look generic. Here are couple of ideas to get away with this: 1. Try to get the actual tuple sizes excluding header sizes for each column in the new TAM. 2. Try not to use the new TAM for COPY in which case the table_multi_insert stays forever. 3. Try passing a flag to tell the new TAM that the caller does the flushing and no need for an internal flushing. I haven't explored the idea (1) in depth yet. If we find a way to do so, it looks to me that we are going backwards since we need to strip off headers for each column of a row for all of the rows. I suspect this would cost a bit more and may not solve the regression. With an eventual goal to get rid of table_multi_insert, (3) may not be a better choice. (3) seems reasonable to implement and reduce the regression. I did so in the attached v21 patches. A new flag TM_SKIP_INTERNAL_BUFFER_FLUSH is introduced in v21 patch, when specified, no internal flushing is done, the caller has to flush the buffered tuples using table_modify_buffer_flush(). Check the test results [3] HEAD 2.948 s, PATCHED 2.946 s. v21 also adds code to maintain tuple size for virtual tuple slots. This helps make better memory-based flushing decisions in the new TAM. Thoughts? [1] HEAD: done in 2.84 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 1.99 s, vacuum 0.21 s, primary keys 0.62 s). done in 2.78 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 1.88 s, vacuum 0.21 s, primary keys 0.69 s). done in 2.97 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 2.07 s, vacuum 0.21 s, primary keys 0.69 s). done in 2.86 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 1.96 s, vacuum 0.21 s, primary keys 0.69 s). done in 2.90 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 2.05 s, vacuum 0.21 s, primary keys 0.64 s). done in 2.83 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 1.96 s, vacuum 0.21 s, primary keys 0.66 s). done in 2.80 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 1.95 s, vacuum 0.20 s, primary keys 0.63 s). done in 2.79 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 1.89 s, vacuum 0.21 s, primary keys 0.69 s). done in 3.75 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 2.17 s, vacuum 0.32 s, primary keys 1.25 s). done in 3.86 s (drop tables 0.00 s, create tables 0.08 s, client-side generate 2.97 s, vacuum 0.21 s, primary keys 0.59 s). AVG done in 2.948 s v20 PATCHED: done in 3.94 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 3.12 s, vacuum 0.19 s, primary keys 0.62 s). done in 4.04 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 3.22 s, vacuum 0.20 s, primary keys 0.61 s). done in 3.98 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 3.16 s, vacuum 0.20 s, primary keys 0.61 s). done in 4.04 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 3.16 s, vacuum 0.20 s, primary keys 0.67 s). done in 3.98 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 3.16 s, vacuum 0.20 s, primary keys 0.61 s). done in 4.00 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 3.17 s, vacuum 0.20 s, primary keys 0.63 s). done in 4.43 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 3.24 s, vacuum 0.21 s, primary keys 0.98 s). done in 4.16 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 3.36 s, vacuum 0.20 s, primary keys 0.59 s). done in 3.62 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 2.83 s, vacuum 0.20 s, primary keys 0.58 s). done in 3.67 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 2.84 s, vacuum 0.21 s, primary keys 0.61 s). AVG done in 3.986 s [2] + /* + * Memory allocated for the whole tuple is in slot's memory context, so + * use it keep track of the total space occupied by all buffered tuples. + */ + if (TTS_SHOULDFREE(slot)) + mistate->cur_size += MemoryContextMemAllocated(slot->tts_mcxt, false); + + if (mistate->cur_slots >= HEAP_MAX_BUFFERED_SLOTS || + mistate->cur_size >= HEAP_MAX_BUFFERED_BYTES) + heap_modify_buffer_flush(state); [3] v21 PATCHED: done in 2.92 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 2.12 s, vacuum 0.21 s, primary keys 0.59 s). done in 2.89 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 2.07 s, vacuum 0.21 s, primary keys 0.61 s). done in 2.89 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 2.05 s, vacuum 0.21 s, primary keys 0.62 s). done in 2.90 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 2.07 s, vacuum 0.21 s, primary keys 0.62 s). done in 2.80 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 2.00 s, vacuum 0.21 s, primary keys 0.59 s). done in 2.84 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 2.04 s, vacuum 0.20 s, primary keys 0.60 s). done in 2.84 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 2.03 s, vacuum 0.20 s, primary keys 0.59 s). done in 2.85 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 2.04 s, vacuum 0.20 s, primary keys 0.60 s). done in 3.48 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 2.44 s, vacuum 0.23 s, primary keys 0.80 s). done in 3.05 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 2.28 s, vacuum 0.21 s, primary keys 0.55 s). AVG done in 2.946 s -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
- v21-0001-Introduce-new-Table-Access-Methods-for-single-an.patch
- v21-0002-Optimize-CTAS-CMV-RMV-and-TABLE-REWRITES-with-mu.patch
- v21-0003-Optimize-INSERT-INTO-.-SELECT-with-multi-inserts.patch
- v21-0004-Optimize-Logical-Replication-apply-with-multi-in.patch
- v21-0005-Use-new-multi-insert-Table-AM-for-COPY-FROM.patch
Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM
From
Alvaro Herrera
Date:
Sorry to interject, but -- On 2024-May-15, Bharath Rupireddy wrote: > It looks like with the use of the new multi insert table access method > (TAM) for COPY (v20-0005), pgbench regressed about 35% [1]. Where does this acronym "TAM" comes from for "table access method"? I find it thoroughly horrible and wish we didn't use it. What's wrong with using "table AM"? It's not that much longer, much clearer and reuses our well-established acronym AM. We don't use IAM anywhere, for example (it's always "index AM"), and I don't think we'd turn "sequence AM" into SAM either, would we? -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM
From
Bharath Rupireddy
Date:
On Wed, May 15, 2024 at 2:44 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > It looks like with the use of the new multi insert table access method > > (TAM) for COPY (v20-0005), pgbench regressed about 35% [1]. > > Where does this acronym "TAM" comes from for "table access method"? Thanks for pointing it out. I used it for just the discussion sake in this response. Although a few of the previous responses from others in this thread mentioned that word, none of the patches have it added in the code. I'll ensure to not use it further in this thread if that worries one like another acronym is being added. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM
From
Andres Freund
Date:
Hi, On 2024-05-15 11:14:14 +0200, Alvaro Herrera wrote: > On 2024-May-15, Bharath Rupireddy wrote: > > > It looks like with the use of the new multi insert table access method > > (TAM) for COPY (v20-0005), pgbench regressed about 35% [1]. > > Where does this acronym "TAM" comes from for "table access method"? I > find it thoroughly horrible and wish we didn't use it. What's wrong > with using "table AM"? It's not that much longer, much clearer and > reuses our well-established acronym AM. Strongly agreed. I don't know why I dislike TAM so much though. Greetings, Andres Freund
Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM
From
Michael Paquier
Date:
On Wed, May 15, 2024 at 11:14:14AM +0200, Alvaro Herrera wrote: > We don't use IAM anywhere, for example (it's always "index AM"), and I > don't think we'd turn "sequence AM" into SAM either, would we? SAM is not a term I've seen used for sequence AMs in the past, I don't intend to use it. TAM is similar strange to me, but perhaps it's just because I am used to table AMs as a whole. -- Michael
Attachment
Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM
From
Jeff Davis
Date:
On Wed, 2024-05-15 at 12:56 +0530, Bharath Rupireddy wrote: > Because of this, the > buffers get flushed sooner than that of the existing COPY with > table_multi_insert AM causing regression in pgbench which uses COPY > extensively. The flushing behavior is entirely controlled by the table AM. The heap can use the same flushing logic that it did before, which is to hold 1000 tuples. I like that it's accounting for memory, too, but it doesn't need to be overly restrictive. Why not just use work_mem? That should hold 1000 reasonably-sized tuples, plus overhead. Even better would be if we could take into account partitioning. That might be out of scope for your current work, but it would be very useful. We could have a couple new GUCs like modify_table_buffer and modify_table_buffer_per_partition or something like that. > 1. Try to get the actual tuple sizes excluding header sizes for each > column in the new TAM. I don't see the point in arbitrarily excluding the header. > v21 also adds code to maintain tuple size for virtual tuple slots. > This helps make better memory-based flushing decisions in the new > TAM. That seems wrong. We shouldn't need to change the TupleTableSlot structure for this patch. Comments on v21: * All callers specify TM_FLAG_MULTI_INSERTS. What's the purpose? * The only caller that doesn't use TM_FLAG_BAS_BULKWRITE is ExecInsert(). What's the disadvantage to using a bulk insert state there? * I'm a bit confused by TableModifyState->modify_end_callback. The AM both sets the callback and calls the callback -- why can't the code just go into the table_modify_end method? * The code structure in table_modify_begin() (and related) is strange. Can it be simplified or am I missing something? * Why are table_modify_state and insert_modify_buffer_flush_context globals? What if there are multiple modify nodes in a plan? * Can you explain the design in logical rep? Regards, Jeff Davis
Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM
From
Jeff Davis
Date:
On Wed, 2024-05-15 at 16:31 -0700, Jeff Davis wrote: > Even better would be if we could take into account partitioning. That > might be out of scope for your current work, but it would be very > useful. We could have a couple new GUCs like modify_table_buffer and > modify_table_buffer_per_partition or something like that. To expand on this point: For heap, the insert buffer is only 1000 tuples, which doesn't take much memory. But for an AM that does any significant reorganization of the input data, the buffer may be much larger. For insert into a partitioned table, that buffer could be multiplied across many partitions, and start to be a real concern. We might not need table AM API changes at all here beyond what v21 offers. The ModifyTableState includes the memory context, so that gives the caller a way to know the memory consumption of a single partition's buffer. And if it needs to free the resources, it can just call modify_table_end(), and then _begin() again if more tuples hit that partition. So I believe what I'm asking for here is entirely orthogonal to the current proposal. However, it got me thinking that we might not want to use work_mem for controlling the heap's buffer size. Each AM is going to have radically different memory needs, and may have its own (extension) GUCs to control that memory usage, so they won't honor work_mem. We could either have a separate GUC for the heap if it makes sense, or we could just hard-code a reasonable value. Regards, Jeff Davis
Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM
From
Bharath Rupireddy
Date:
Hi, On Thu, May 16, 2024 at 5:01 AM Jeff Davis <pgsql@j-davis.com> wrote: > > The flushing behavior is entirely controlled by the table AM. The heap > can use the same flushing logic that it did before, which is to hold > 1000 tuples. > > I like that it's accounting for memory, too, but it doesn't need to be > overly restrictive. Why not just use work_mem? That should hold 1000 > reasonably-sized tuples, plus overhead. > > Even better would be if we could take into account partitioning. That > might be out of scope for your current work, but it would be very > useful. We could have a couple new GUCs like modify_table_buffer and > modify_table_buffer_per_partition or something like that. I disagree with inventing more GUCs. Instead, I'd vote for just holding 1000 tuples in buffers for heap AM. This not only keeps the code and new table AM simple, but also does not cause regression for COPY. In my testing, 1000 tuples with 1 int and 1 float columns took 40000 bytes of memory (40 bytes each tuple), whereas with 1 int, 1 float and 1 text columns took 172000 bytes of memory (172 bytes each tuple) bytes which IMO mustn't be a big problem. Thoughts? > > 1. Try to get the actual tuple sizes excluding header sizes for each > > column in the new TAM. > > I don't see the point in arbitrarily excluding the header. > > > v21 also adds code to maintain tuple size for virtual tuple slots. > > This helps make better memory-based flushing decisions in the new > > TAM. > > That seems wrong. We shouldn't need to change the TupleTableSlot > structure for this patch. I dropped these ideas as I went ahead with the above idea of just holding 1000 tuples in buffers for heap AM. > Comments on v21: > > * All callers specify TM_FLAG_MULTI_INSERTS. What's the purpose? Previously, the multi insert state was initialized in modify_begin, so it was then required to differentiate the code path. But, it's not needed anymore with the lazy initialization of the multi insert state moved to modify_buffer_insert. I removed it. > * The only caller that doesn't use TM_FLAG_BAS_BULKWRITE is > ExecInsert(). What's the disadvantage to using a bulk insert state > there? The subsequent read queries will not find the just-now-inserted tuples in shared buffers as a separate ring buffer is used with bulk insert access strategy. The multi inserts is nothing but buffering multiple tuples plus inserting in bulk. So using the bulk insert strategy might be worth it for INSERT INTO SELECTs too. Thoughts? > * I'm a bit confused by TableModifyState->modify_end_callback. The AM > both sets the callback and calls the callback -- why can't the code > just go into the table_modify_end method? I came up with modify_end_callback as per the discussion upthread to use modify_begin, modify_end in future for UPDATE, DELETE and MERGE, and not use any operation specific flags to clean the state appropriately. The operation specific state cleaning logic can go to the modify_end_callback implementation defined by the AM. > * The code structure in table_modify_begin() (and related) is strange. > Can it be simplified or am I missing something? I previously defined these new table AMs as optional, check GetTableAmRoutine(). And, there was a point upthread to provide default/fallback implementation to help not fail insert operations on tables without the new table AMs implemented. FWIW, the default implementation was just doing the single inserts. The table_modify_begin and friends need the logic to fallback making the code there look different than other AMs. However, I now have a feeling to drop the idea of having fallback implementation and let the AMs deal with it. Although it might create some friction with various non-core AM implementations, it keeps this patch simple which I would vote for. Thoughts? > * Why are table_modify_state and insert_modify_buffer_flush_context > globals? What if there are multiple modify nodes in a plan? Can you please provide the case that can generate multiple "modify nodes" in a single plan? AFAICS, multiple "modify nodes" in a plan can exist for both partitioned tables and tables that get created as part of CTEs. I disabled multi inserts for both of these cases. The way I disabled for CTEs looks pretty naive - I just did the following. Any better suggestions here to deal with all such cases? + if (operation == CMD_INSERT && + nodeTag(subplanstate) == T_SeqScanState) + canMultiInsert = true; > * Can you explain the design in logical rep? Multi inserts for logical replication work at the table level. In other words, all tuple inserts related to a single table within a transaction are buffered and written to the corresponding table when necessary. Whenever inserts pertaining to another table arrive, the buffered tuples related to the previous table are written to the table before starting the buffering for the new table. Also, the tuples are written to the table from the buffer when there arrives a non-INSERT operation, for example, UPDATE/DELETE/TRUNCATE/COMMIT etc. FWIW, pglogical has the similar multi inserts logic - https://github.com/2ndQuadrant/pglogical/blob/REL2_x_STABLE/pglogical_apply_heap.c#L879. Please find the v22 patches with the above changes. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
- v22-0001-Introduce-new-Table-AM-for-multi-inserts.patch
- v22-0002-Optimize-various-SQL-commands-with-new-multi-ins.patch
- v22-0003-Optimize-INSERT-INTO-SELECT-with-new-multi-inser.patch
- v22-0004-Optimize-Logical-Replication-Apply-with-new-mult.patch
- v22-0005-Use-new-multi-insert-table-AM-for-COPY.patch
Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM
From
Jingtang Zhang
Date:
Hi! Glad to see update in this thread.
---
Regards, Jingtang
Little question about v24 0002 patch: would it be better to move the
implementation of TableModifyIsMultiInsertsSupported to somewhere for table AM
level? Seems it is a common function for future use, not a specific one for
matview.implementation of TableModifyIsMultiInsertsSupported to somewhere for table AM
level? Seems it is a common function for future use, not a specific one for
---
Regards, Jingtang
Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> 于2024年10月26日周六 21:31写道:
Hi,
Thanks for looking into this.
On Thu, Aug 29, 2024 at 12:29 PM Jeff Davis <pgsql@j-davis.com> wrote:
>
> I believe we need the branching in the caller anyway:
>
> 1. If there is a BEFORE row trigger with a volatile function, the
> visibility rules[1] mean that the function should see changes from all
> the rows inserted so far this command, which won't work if they are
> still in the buffer.
>
> 2. Similarly, for an INSTEAD OF row trigger, the visibility rules say
> that the function should see all previous rows inserted.
>
> 3. If there are volatile functions in the target list or WHERE clause,
> the same visibility semantics apply.
>
> 4. If there's a "RETURNING ctid" clause, we need to either come up with
> a way to return the tuples after flushing, or we need to use the
> single-tuple path. (Similarly in the future when we support UPDATE ...
> RETURNING, as Matthias pointed out.)
>
> If we need two paths in each caller anyway, it seems cleaner to just
> wrap the check for tuple_modify_buffer_insert in
> table_modify_buffer_enabled().
>
> We could perhaps use a one path and then force a batch size of one or
> something, which is an alternative, but we have to be careful not to
> introduce a regression (and it still requires a solution for #4).
I chose to branch in the caller e.g. if there's a volatile function
SELECT query of REFRESH MATERIALIZED VIEW, the caller goes
table_tuple_insert() path, else multi-insert path.
I am posting the new v24 patch set organized as follows: 0001
introducing the new table AM, 0002 optimizing CTAS, CMV and RMV, 0003
using the new table AM for COPY ... FROM. I, for now, discarded the
INSERT INTO ... SELECT and Logical Replication Apply patches, the idea
is to take the basic stuff forward.
I reworked structure names, members and function names, reworded
comments, addressed review comments in the v24 patches. Please have a
look.
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM
From
Jingtang Zhang
Date:
Oh, another comments for v24-0001 patch: we are in heam AM now, should we use something like HEAP_INSERT_BAS_BULKWRITE instead of using table AM option, just like other heap AM options do? > + if ((state->options & TABLE_INSERT_BAS_BULKWRITE) != 0) > + istate->bistate = GetBulkInsertState(); — Regards, Jingtang
Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM
From
Jingtang Zhang
Date:
Hi~ Sorry for multiple comments in separate mail. Just found that the initialization seems redundant since we have used palloc0? > + istate = (HeapInsertState *) palloc0(sizeof(HeapInsertState)); > + istate->bistate = NULL; > + istate->mistate = NULL; --- Regards, Jingtang
Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM
From
Jingtang Zhang
Date:
Hi~ I did some performance test these days, and I have some findings. From the archive months ago, I found there were discussions about which type of TupleTableSlot to use for buffering tuples. A single column mat view was used for evaluation. Finally we used virtual one. However when I test with a 32-columns mat view, I get regression. Test case: -- prepare create table test as select i as id0, i + 1 as id1, i + 2 as id2, i + 3 as id3, i + 4 as id4, i + 5 as id5, i + 6 as id6, i + 7 as id7, i + 8 as id8, i + 9 as id9, i + 10 as id10, i + 11 as id11, i + 12 as id12, i + 13 as id13, i + 14 as id14, i + 15 as id15, i + 0.01 as f0, i + 0.1 as f1, i + 0.2 as f2, i + 0.3 as f3, i + 0.4 as f4, i + 0.5 as f5, i + 0.6 as f6, i + 0.7 as f7, i + 0.8 as f8, i + 0.9 as f9, i + 1.01 as f10, i + 1.1 as f11, i + 1.2 as f12, i + 1.3 as f13, i + 1.4 as f14, i + 1.5 as f15, i + 1.6 as f16 from generate_series(1,5000000) i; -- run create materialized view m1 as select * from test; HEAD: Time: 13615.542 ms (00:13.616) Time: 13545.706 ms (00:13.546) Time: 13578.475 ms (00:13.578) Patched Time: 20112.734 ms (00:20.113) Time: 19996.957 ms (00:19.997) Time: 19936.871 ms (00:19.937) I did a quick perf, the overhead seems to come from virtual tuple materialization. HEAD: 12.29% postgres [.] pg_checksum_block 6.33% postgres [.] GetPrivateRefCountEntry 5.40% postgres [.] pg_comp_crc32c_sse42 4.54% [kernel] [k] copy_user_enhanced_fast_string 2.69% postgres [.] BufferIsValid 1.52% postgres [.] XLogRecordAssemble Patched: 11.75% postgres [.] tts_virtual_materialize 8.87% postgres [.] pg_checksum_block 8.17% postgres [.] slot_deform_heap_tuple 8.09% postgres [.] heap_compute_data_size 6.17% postgres [.] fill_val 3.81% postgres [.] heap_fill_tuple 3.37% postgres [.] tts_virtual_copyslot 2.62% [kernel] [k] copy_user_enhanced_fast_string Not sure if it is a universal situation. — Regards, Jingtang
Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM
From
Daniil Davydov
Date:
Hi, A few days ago I came up with an idea to implement multi insert optimization wherever possible. I prepared a raw patch and it showed a great performance gain (up to 4 times during INSERT ... INTO ... in the best case). Then I was very happy to find this thread. You did a great job and I want to help you to bring the matter to an end. On Thu, Oct 31, 2024 at 11:17 AM Jingtang Zhang <mrdrivingduck@gmail.com> wrote: > I did some performance test these days, and I have some findings. > HEAD: > 12.29% postgres [.] pg_checksum_block > 6.33% postgres [.] GetPrivateRefCountEntry > 5.40% postgres [.] pg_comp_crc32c_sse42 > 4.54% [kernel] [k] copy_user_enhanced_fast_string > 2.69% postgres [.] BufferIsValid > 1.52% postgres [.] XLogRecordAssemble > > Patched: > 11.75% postgres [.] tts_virtual_materialize > 8.87% postgres [.] pg_checksum_block > 8.17% postgres [.] slot_deform_heap_tuple > 8.09% postgres [.] heap_compute_data_size > 6.17% postgres [.] fill_val > 3.81% postgres [.] heap_fill_tuple > 3.37% postgres [.] tts_virtual_copyslot > 2.62% [kernel] [k] copy_user_enhanced_fast_string I applied v25 patches on the master branch and made some measurements to find out what is the bottleneck in this case. The 'time' utility showed that without a patch, this query will run 1.5 times slower. I also made a few flamegraphs for this test. Most of the time is spent calling these two functions : tts_virtual_copyslot and heap_form_tuple. All tests were run in virtual machine with these CPU characteristics: Architecture: x86_64 CPU(s): 2 On-line CPU(s) list: 0,1 Virtualization features: Virtualization: AMD-V Hypervisor vendor: KVM Virtualization type: full Caches (sum of all): L1d: 128 KiB (2 instances) L1i: 128 KiB (2 instances) L2: 1 MiB (2 instances) L3: 32 MiB (2 instances) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0,1 In my implementation, I used Tuplestore functionality to store tuples. In order to get rid of getting stuck in the above mentioned functions, I crossed it with the current implementation (v25 patches) and got a 10% increase in performance (for the test above). I also set up v22 patches to compare performance (with/without tuplestore) for INSERT ... INTO ... queries (with -j 4 -c 10 parameters for pgbech), and there also was an increase in TPS (about 3-4%). I attach a patch that adds Tuplestore to v25. What do you think about this idea? -- Best regards, Daniil Davydov
Attachment
Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM
From
Daniil Davydov
Date:
Hi, Recently I took more careful measurements of the performance. I compared three branches with each other: HEAD, Patched and Patched with tuplestore. Here are the results : 1) Test case : matview creation test attached in the email from Jingtang Zhang. 10 measurements for each branch. Result in wall clock execution time : HEAD 30.532 +- 0.59 seconds elapsed Patched 20.454 +- 0.114 seconds elapsed Patched with tuplestore 19.653 +- 0.111 seconds elapsed 2) -- init.sql drop table test_insert; vacuum; checkpoint; create table test_insert(i int, f float); -- iowrite.sql insert into test_insert select g, (g % 100) / 100.0 from generate_series(1, 1000000) as g; Test case : pgbench -f iowrite.sql -n -j 4 -c 10 -T 40 5 measurements for each branch. Result in tps : HEAD 1.025 +- 0.009 Patched 2.923 +- 0.032 Patched with tuplestore 2.987 +- 0.011 P.S. I cannot find a commitfest entry for this patch. Should we add it there? -- Best regards, Daniil Davydov