Re: Custom table AMs need to include heapam.h because of BulkInsertState - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Custom table AMs need to include heapam.h because of BulkInsertState
Date
Msg-id CA+TgmoYK=6BpxiJ0tN-p9wtH0BTAfbdxzHhwou0mdud4+BkYuQ@mail.gmail.com
Whole thread Raw
In response to Re: Custom table AMs need to include heapam.h because ofBulkInsertState  (Andres Freund <andres@anarazel.de>)
Responses Re: Custom table AMs need to include heapam.h because ofBulkInsertState
List pgsql-hackers
On Sat, Jun 1, 2019 at 3:20 PM Andres Freund <andres@anarazel.de> wrote:
> > I'd like to think that the best way to deal with that and reduce the
> > confusion would be to move anything related to bulk inserts into their
> > own header/file, meaning the following set:
> > - ReleaseBulkInsertStatePin
> > - GetBulkInsertState
> > - FreeBulkInsertState
> > There is the argument that we could also move that part into tableam.h
> > itself though as some of the rather generic table-related callbacks,
> > but that seems grotty.  So I think that we could just move that stuff
> > as backend/access/common/bulkinsert.c.
>
> Yea, I think we should do that at some point. But I'm not sure this is
> the right design. Bulk insert probably needs to rather be something
> that's allocated inside the AM.

As far as I can see, any on-disk, row-oriented, block-based AM is
likely to want the same implementation as the heap.  Column stores
might want to pin multiple buffers, and an in-memory AM might have a
completely different set of requirements, but something like zheap
really has no reason to depart from what the heap does.  I think it's
really important that new table AMs not only have the option to do
something different than the heap in any particular area, but that
they also have the option to do the SAME thing as the heap without
having to duplicate a bunch of code.  So I think it would be
reasonable to start by doing some pure code movement here, along the
lines proposed by Michael -- not sure if src/backend/access/common is
right or if it should be src/backend/access/table -- and then add the
abstraction afterwards.  Worth noting is ReadBufferBI() also needs
moving and is a actually a bigger problem than the functions that
Michael mentioned, because the other functions are accessible if
you're willing to stoop to including heap-specific headers, but that
function is static and you'll have to just copy-and-paste it.  Uggh.

Here's a draft design for adding some abstraction, roughly modeled on
the abstraction Andres added for TupleTableSlots:

1. a BulkInsertState becomes a struct whose only member is a pointer
to const BulkInsertStateOps *const ops

2. that structure has a member for each defined operation on a BulkInsertState:

void (*free)(BulkInsertState *);
void (*release_pin)(BulkInsertState *); // maybe rename to make it more generic

3. table AM gets a new member BulkInsertState
*(*create_bistate)(Relation Rel) and a corresponding function
table_create_bistate(), analogous to table_create_slot(), which can
call the constructor function for the appropriate type of
BulkInsertState and return the result

4. each type of BulkInsertState has its own functions for making use
of it, akin to ReadBufferBI.  That particular function signature is
only likely to be correct for something that does more-or-less what
the existing type of BulkInsertState does; if you're using a
column-store that pins multiple buffers or something, you'll need your
own code path.  But that's OK, because ReadBufferBI or whatever other
functions you have are only going to get called from AM-specific code,
which will know what type of BulkInsertState they have got, because
they are in control of which kind of BulkInsertState gets created for
their relations as per point #4, so they can just call the right
functions.

5. The current implementation of BulkInsertState gets renamed to
BlockBulkInsertState (or something else) and is used by heap and any
AMs that like it.

I see Michael's point about the relationship between
finish_bulk_insert() and the BulkInsertState, and maybe if we could
figure that out we could avoid the need for a BulkInsertState to have
a free method (or maybe any methods at all, in which case it could
just be an opaque struct, like a Node). However, it looks to me as
though copy.c can create a bunch of BulkInsertStates but only call
finish_bulk_insert() once, so unless that's a bug in need of fixing I
don't quite see how to make that approach work.

Comments?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Custom table AMs need to include heapam.h because of BulkInsertState
Next
From: Tomas Vondra
Date:
Subject: Re: Avoiding hash join batch explosions with extreme skew and weirdstats