Re: New Table Access Methods for Multi and Single Inserts - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: New Table Access Methods for Multi and Single Inserts
Date
Msg-id CALj2ACVsiAZMsP8p5MPg6SSEtoMFFaiAa6j2AFtEQJDhfbgs3Q@mail.gmail.com
Whole thread Raw
In response to Re: New Table Access Methods for Multi and Single Inserts  (Luc Vlaming <luc@swarm64.com>)
Responses Re: New Table Access Methods for Multi and Single Inserts  (Luc Vlaming <luc@swarm64.com>)
List pgsql-hackers
On Mon, Jan 4, 2021 at 1:29 PM Luc Vlaming <luc@swarm64.com> wrote:
>  > table AM patch [2] be reviewed further?
> As to the patches themselves:
>
> I think the API is a huge step forward! I assume that we want to have a
> single-insert API like heap_insert_v2 so that we can encode the
> knowledge that there will just be a single insert coming and likely a
> commit afterwards?
>
> Reason I'm asking is that I quite liked the heap_insert_begin parameter
> is_multi, which could even be turned into a "expected_rowcount" of the
> amount of rows expected to be commited in the transaction (e.g. single,
> several, thousands/stream).
> If we were to make the API based on expected rowcounts, the whole
> heap_insert_v2, heap_insert and heap_multi_insert could be turned into a
> single function heap_insert, as the knowledge about buffering of the
> slots is then already stored in the TableInsertState, creating an API like:
>
> // expectedRows: -1 = streaming, otherwise expected rowcount.
> TableInsertState* heap_insert_begin(Relation rel, CommandId cid, int
> options, int expectedRows);
> heap_insert(TableInsertState *state, TupleTableSlot *slot);
>
> Do you think that's a good idea?

IIUC, your suggestion is to use expectedRows and move the multi insert implementation heap_multi_insert_v2 to heap_insert_v2. If that's correct, so heap_insert_v2 will look something like this:

heap_insert_v2()
{
    if (single_insert)
      //do single insertion work, the code in existing heap_insert_v2 comes here
   else
      //do multi insertion work, the code in existing heap_multi_insert_v2 comes here
}

I don't see any problem in combining single and multi insert APIs into one. Having said that, will the APIs be cleaner then? Isn't it going to be confusing if a single heap_insert_v2 API does both the works? With the existing separate APIs, for single insertion, the sequence of the API can be like begin, insert_v2, end and for multi inserts it's like begin, multi_insert_v2, flush, end. I prefer to have a separate multi insert API so that it will make the code look readable.

Thoughts?

> Two smaller things I'm wondering:
> - the clear_mi_slots; why is this not in the HeapMultiInsertState? the
> slots themselves are declared there?

Firstly, we need to have the buffered slots sometimes(please have a look at the comments in TableInsertState structure) outside the multi_insert API. And we need to have cleared the previously flushed slots before we start buffering in heap_multi_insert_v2(). I can remove the clear_mi_slots flag altogether and do as follows: I will not set mistate->cur_slots to 0 in heap_multi_insert_flush after the flush, I will only set state->flushed to true. In heap_multi_insert_v2,

void
heap_multi_insert_v2(TableInsertState *state, TupleTableSlot *slot)
{
    TupleTableSlot  *batchslot;
    HeapMultiInsertState *mistate = (HeapMultiInsertState *)state->mistate;
    Size sz;

    Assert(mistate && mistate->slots);

    /* if the slots are flushed previously then clear them off before using them again. */
    if (state->flushed)
    {
        int i;

        for (i = 0; i < mistate->cur_slots; i++)
            ExecClearTuple(mistate->slots[i]);

        mistate->cur_slots = 0;
        state->flushed = false
    }


    if (mistate->slots[mistate->cur_slots] == NULL)
        mistate->slots[mistate->cur_slots] =
                                    table_slot_create(state->rel, NULL);

    batchslot = mistate->slots[mistate->cur_slots];

    ExecCopySlot(batchslot, slot);

Thoughts?

> Also, why do we want to do ExecClearTuple() anyway? Isn't
> it good enough that the next call to ExecCopySlot will effectively clear
> it out?

For virtual, heap, minimal tuple slots, yes ExecCopySlot slot clears the slot before copying. But, for buffer heap slots, the tts_buffer_heap_copyslot does not always clear the destination slot, see below. If we fall into else condition, we might get some issues. And also note that, once the slot is cleared in ExecClearTuple, it will not be cleared again in ExecCopySlot because TTS_SHOULDFREE(slot) will be false. That is why, let's have ExecClearTuple as is.

    /*
     * If the source slot is of a different kind, or is a buffer slot that has
     * been materialized / is virtual, make a new copy of the tuple. Otherwise
     * make a new reference to the in-buffer tuple.
     */
    if (dstslot->tts_ops != srcslot->tts_ops ||
        TTS_SHOULDFREE(srcslot) ||
        !bsrcslot->base.tuple)
    {
        MemoryContext oldContext;

        ExecClearTuple(dstslot);
    }
    else
    {
        Assert(BufferIsValid(bsrcslot->buffer));

        tts_buffer_heap_store_tuple(dstslot, bsrcslot->base.tuple,
                                    bsrcslot->buffer, false);

> - flushed -> why is this a stored boolean? isn't this indirectly encoded
> by cur_slots/cur_size == 0?

Note that cur_slots is in HeapMultiInsertState and outside of the new APIs i.e. in TableInsertState, mistate is a void pointer, and we can't really access the cur_slots. I mean, we can access but we need to be dereferencing using the tableam kind. Instead of  doing all of that, to keep the API cleaner, I chose to have a boolean in the TableInsertState which we can see and use outside of the new APIs. Hope that's fine.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: Single transaction in the tablesync worker?
Next
From: Dilip Kumar
Date:
Subject: Re: Parallel Inserts in CREATE TABLE AS