Re: Multi Inserts in CREATE TABLE AS - revived patch - Mailing list pgsql-hackers

From Luc Vlaming
Subject Re: Multi Inserts in CREATE TABLE AS - revived patch
Date
Msg-id 4eee0730-f6ec-e72d-3477-561643f4b327@swarm64.com
Whole thread Raw
In response to Re: Multi Inserts in CREATE TABLE AS - revived patch  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Responses Re: Multi Inserts in CREATE TABLE AS - revived patch  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
List pgsql-hackers
On 26-11-2020 14:45, Bharath Rupireddy wrote:
> On Thu, Nov 26, 2020 at 5:34 PM Luc Vlaming <luc@swarm64.com> wrote:
>>
>> On 26-11-2020 12:36, Bharath Rupireddy wrote:
>>> Few things:
>>>
>>> IIUC Andres mentioned similar kinds of APIs earlier in [1].
>>>
>>> [1] -
>>> https://www.postgresql.org/message-id/20200924024128.kyk3r5g7dnu3fxxx%40alap3.anarazel.de
>>> <https://www.postgresql.org/message-id/20200924024128.kyk3r5g7dnu3fxxx%40alap3.anarazel.de>
>>>
>>> I would like to add some more info to one of the API:
>>>
>>> typedef struct MultiInsertStateData
>>> {
>>>       MemoryContext         micontext; /* A temporary memory context for
>>> multi insert. */
>>>       BulkInsertStateData *bistate;   /* Bulk insert state. */
>>>       TupleTableSlot      **mislots; /* Array of buffered slots. */
>>>       uint32              nslots; /* Total number of buffered slots. */
>>>       int64              nbytes; /* Flush buffers if the total tuple size
>>>   >= nbytes. */
>>>       int32              nused; /* Number of current buffered slots for a
>>> multi insert batch. */
>>>       int64              nsize; /* Total tuple size for a multi insert
>>> batch. */
>>> } MultiInsertStateData;
>>>
>>> /* Creates a temporary memory context, allocates the
>>> MultiInsertStateData, BulkInsertStateData and initializes other members. */
>>>       void        (*begin_multi_insert) (Relation rel,
>>> MultiInsertStateData **mistate, uint32 nslots, uint64 nbytes);
>>>
>>> /* Buffers the input slot into mistate slots, computes the size of the
>>> tuple, and adds it total buffer tuple size, if this size crosses
>>> mistate->nbytes, flush the buffered tuples into table. For heapam,
>>> existing heap_multi_insert can be used. Once the buffer is flushed, then
>>> the micontext can be reset and buffered slots can be cleared. *If nbytes
>>> i.e. total tuple size of the batch is not given, tuple size is not
>>> calculated, tuples are buffered until all the nslots are filled and then
>>> flushed.* */
>>>       void        (*do_multi_insert) (Relation rel, MultiInsertStateData
>>> *mistate, struct TupleTableSlot *slot, CommandId cid, int options);
>>>
>>> /* Flush the buffered tuples if any. For heapam, existing
>>> heap_multi_insert can be used. Deletes temporary memory context and
>>> deallocates mistate. */
>>>       void        (*end_multi_insert) (Relation rel, MultiInsertStateData
>>> *mistate, CommandId cid, int options);
>>
>> Looks all good to me, except for the nbytes part.
>> Could you explain to me what use case that supports? IMHO the tableam
>> can best decide itself that its time to flush, based on its
>> implementation that e.g. considers how many pages to flush at a time and
>> such, etc? This means also that most of the fields of
>> MultiInsertStateData can be private as each tableam would return a
>> derivative of that struct (like with the destreceivers).
>>
> 
> nbytes is basically to support the following case, say the number of
> tuples to buffer is 1000, and if all the tuples are toasted with size
> in few hundred MB or even GB, then do we want to wait until 1000
> tuples are buffered in which case we occupy for one query 1000*toasted
> tuple size in GB. So, if we have a memory limit, then it will give
> flexibility. Whether to use it or not is up to the table AM
> implementation. And also that existing copy code(since it can know the
> tuple size after parsing input data) uses this mechanism to decide
> when to flush.
> 
> If the nbytes is not used in a table am, then the multi insert can
> wait until the total tuples, how much ever large memory they occupy,
> are buffered.
> 
> IMO, we can retain nbytes for now to decide on when to flush. Thoughts?

I'm very sorry I had not realized at all that the toasted data would be 
kept in memory until written out. I guess I'm not familiar enough with 
that part yet. I assumed this would be toasted beforehand and be tableam 
agnostic, and that any decision from the tableam to flush would happen 
way before a lot memory would have accumulated, which is a bit naive in 
hindsight.

> 
> I wonder, how can the do_multi_insert() API decide on when to flush, I
> mean, based on the number of pages to flush? Do we need to pass the
> maximum number of pages the buffered tuples can occupy and track the
> pages currently buffered tuples occupy to decide when to flush? Or is
> it something that the existing table AM infrastructure already
> supports? If we use the number of pages to decide on when to flush,
> how well it works with parallel inserts?
> 

I was assuming each tableam to use its own logic, based on its needs and 
the tradeoffs a storage engine might want to provide. This does not mean 
it should not consider outside parameters, like the aforementioned 
memory usage.
I think it would imply that each tableam implements its own tracking 
mechanism for how much has accumulated, how, and when to flush, because 
they might track different statistics. IMHO given that each tableam 
anyway would want to implement its own logic on how to store a slot into 
a page, tracking the logic for tracking these statistics seemed minor to 
me. Maybe I missed some parts that should be extracted out to a generic 
interface however?

Some examples of why a tableam could decide on its own on when to flush:
- the current heap implementation could accumulate a few pages (say up 
to 64) and thereby limit the amount of calls to write() and limit the 
accompanying blocks/context switches. This would also then make the 
writes more sequential wrt the processes which can help with the 
flushing I presume, like how the sequential scan was optimized to 
process a consequtive set of blocks per worker (see 
table_block_parallelscan_nextpage).
- something like zheap could accumulate data based on the amount of 
columns so that a page with column data is completely filled, thereby 
limiting the write amplification.
- something that would implement an lsm storage might accumulate a full 
in-memory level before flushing it out.

>>
>> One thing I'm wondering is in which memory context the slots end up
>> being allocated. I'd assume we would want to keep the slots around
>> between flushes. If they are in the temporary context this might prove
>> problematic however?
>>
> 
> I should not have used the word temporary, it actually is not
> temporary. This memory conext will be created in begin_multi_insert(),
> all the buffered tuples are copied using this context, it will be
> reset at the end of each flush and reused. It can get destroyed at the
> end in end_multi_insert(). I think we should even do this with the new
> APIs implementation.
> 
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com
> 

Okay. But in which context are the slots themselves allocated then? 
Because if we allocate the slots themselves in the context and then 
reset with each flush we would have to also re-allocate the slots every 
flush, which seems wasteful to me?

Regards,
Luc
Swarm64



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: Disable WAL logging to speed up data loading
Next
From: Bharath Rupireddy
Date:
Subject: Re: Parallel Inserts in CREATE TABLE AS