Re: New Table Access Methods for Multi and Single Inserts - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: New Table Access Methods for Multi and Single Inserts
Date
Msg-id 8633171cb034aafc260fdf37df04b6c779aa1e2f.camel@j-davis.com
Whole thread Raw
In response to Re: New Table Access Methods for Multi and Single Inserts  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Responses Re: New Table Access Methods for Multi and Single Inserts  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
List pgsql-hackers
On Sun, 2024-03-31 at 21:18 +0530, Bharath Rupireddy wrote:
> if (table_modify_buffer_insert() is defined)
>    table_modify_buffer_insert(...);
> else
> {
>   myState->bistate = GetBulkInsertState();
>   table_tuple_insert(...);
> }

We can't alloc/free the bulk insert state for every insert call. I see
two options:

* Each caller needs to support two code paths: if the buffered insert
APIs are defined, then use those; otherwise the caller needs to manage
the bulk insert state itself and call the plain insert API.

* Have default implementation for the new API methods, so that the
default for the begin method would allocate the bulk insert state, and
the default for the buffered insert method would be to call plain
insert using the bulk insert state.

I'd prefer the latter, at least in the long term. But I haven't really
thought through the details, so perhaps we'd need to use the former.

> >
> > After we have these new APIs fully in place and used by COPY, what
> > will
> > happen to those other APIs? Will they be deprecated or will there
> > be a
> > reason to keep them?
>
> Deprecated perhaps?

Including Alexander on this thread, because he's making changes to the
multi-insert API. We need some consensus on where we are going with
these APIs before we make more changes, and what incremental steps make
sense in v17.

Here's where I think this API should go:

1. Have table_modify_begin/end and table_modify_buffer_insert, like
those that are implemented in your patch.

2. Add some kind of flush callback that will be called either while the
tuples are being flushed or after the tuples are flushed (but before
they are freed by the AM). (Aside: do we need to call it while the
tuples are being flushed to get the right visibility semantics for
after-row triggers?)

3. Add table_modify_buffer_{update|delete} APIs.

4. Some kind of API tweaks to help manage memory when modifying
pertitioned tables, so that the buffering doesn't get out of control.
Perhaps just reporting memory usage and allowing the caller to force
flushes would be enough.

5. Use these new methods for CREATE/REFRESH MATERIALIZED VIEW. This is
fairly straightforward, I believe, and handled by your patch. Indexes
are (re)built afterward, and no triggers are possible.

6. Use these new methods for CREATE TABLE ... AS. This is fairly
straightforward, I believe, and handled by your patch. No indexes or
triggers are possible.

7. Use these new methods for COPY. We have to be careful to avoid
regressions for the heap method, because it's already managing its own
buffers. If the AM manages the buffering, then it may require
additional copying of slots, which could be a disadvantage. To solve
this, we may need some minor API tweaks to avoid copying when the
caller guarantees that the memory will not be freed to early, or
perhaps expose the AM's memory context to copyfrom.c. Another thing to
consider is that the buffering in copyfrom.c is also used for FDWs, so
that buffering code path needs to be preserved in copyfrom.c even if
not used for AMs.

8. Use these new methods for INSERT INTO ... SELECT. One potential
challenge here is that execution nodes are not always run to
completion, so we need to be sure that the flush isn't forgotten in
that case.

9. Use these new methods for DELETE, UPDATE, and MERGE. MERGE can use
the buffer_insert/update/delete APIs; we don't need a separate merge
method. This probably requires that the AM maintain 3 separate buffers
to distinguish different kinds of changes at flush time (obviously
these can be initialized lazily to avoid overhead when not being used).

10. Use these new methods for logical apply.

11. Deprecate the multi_insert API.

Thoughts on this plan? Does your patch make sense in v17 as a stepping
stone, or should we try to make all of these API changes together in
v18?

Also, a sample AM code would be a huge benefit here. Writing a real AM
is hard, but perhaps we can at least have an example one to demonstrate
how to use these APIs?

Regards,
    Jeff Davis




pgsql-hackers by date:

Previous
From: Jacob Champion
Date:
Subject: Re: WIP Incremental JSON Parser
Next
From: Tom Lane
Date:
Subject: Re: Fixing backslash dot for COPY FROM...CSV