Thread: Status of the table access method work

Status of the table access method work

From

Andres Freund

Date:

05 April 2019, 20:25:38

Hi,

In this email I want to give a brief status update of the table access
method work - I assume that most of you sensibly haven't followed it
into all nooks and crannies.

I want to thank Haribabu, Alvaro, Alexander, David, Dmitry and all the
others that collaborated on making tableam happen. It was/is a huge
project.

I think what's in v12 - I don't know of any non-cleanup / bugfix work
pending for 12 - is a pretty reasonable initial set of features. It
allows to reimplement a heap like storage without any core modifications
(except WAL logging, see below); it is not sufficient to implement a
good index oriented table AM. It does not allow to store the catalog in
a non heap table.


The tableam interface itself doesn't care that much about the AM
internally stores data.  Most of the API (sequential scans, index
lookups, insert/update/delete) don't know about blocks, and only
indirectly & optionally about buffers (via BulkInsertState).  There's a
few callbacks / functions that do care about blocks, because it's not
clear, or would have been too much work, to remove the dependency.  This
currently is:

- ANALYZE integration - currently the sampling logic is tied to blocks.
- index build range scans - the range is defined as blocks
- planner relation size estimate - but that could trivially just be
  filled with size-in-bytes / BLCKSZin the callback.
- the (optional) bitmap heap scan API - that's fairly intrinsically
  block based. An AM could just internally subdivide TIDs in a different
  way, but I don't think a bitmap scan like we have would e.g. make a
  lot of sense for an index oriented table without any sort of stable
  tid.
- the sample scan API - tsmapi.h is block based, so the tableam.h API is
  as well.

I think none of these are limiting in a particularly bad way.


The most constraining factor for storage, I think, is that currently the
API relies on ItemPointerData style TIDs in a number of places (i.e. a 6
byte tuple identifier). One can implement scans, and inserts into
index-less tables without providing that, but no updates, deletes etc.
One reason for that is that it'd just have required more changes to
executor etc to allow for wider identifiers, but the primary reason is
that indexes currently simply don't support anything else.

I think this is, by far, the biggest limitation of the API. If one
e.g. wanted to implement a practical index-organized-table, the 6 byte
limitation obviously would become a limitation very quickly. I suspect
that we're going to want to get rid of that limitation in indexes before
long for other reasons too, to allow global indexes (which'd need to
encode the partition somewhere).


With regards to storing the rows themselves, the second biggest
limitation is a limitation that is not actually a part of tableam
itself: WAL. Many tableam's would want to use WAL, but we only have
extensible WAL as part of generic_xlog.h. While that's useful to allow
prototyping etc, it's imo not efficient enough to build a competitive
storage engine for OLTP (OLAP probably much less of a problem).  I don't
know what the best approach here is - allowing "well known" extensions
to register rmgr entries would be the easiest solution, but it's
certainly a bit crummy.


Currently there's some, fairly minor, requirement that TIDs are actually
unique when not using a snapshot qualifier. That's currently only
relevant for GetTupleForTrigger(), AfterTriggerSaveEvent() and
EvalPlanQualFetchRowMarks(), which use SnapshotAny. That prevents AMs
from implementing in-place updates (thus a problem e.g. for zheap).
I've a patch that fixes that, but it's too hacky for v12 - there's not
always a convenient snapshot to fetch a row (e.g. in
GetTupleForTrigger() after EPQ the row isn't visible to
es_snapshot).


A second set of limitations is around making more of tableam
optional. Right now it e.g. is not possible to have an AM that doesn't
implement insert/update/delete. Obviously an AM can just throw an error
in the relevant callbacks, but I think it'd be better if we made those
callbacks optional, and threw errors at parse-analysis time (both to
make the errors consistent, and to ensure it's consistently thrown,
rather than only when e.g. an UPDATE actually finds a row to update).


Currently foreign keys are allowed between tables of different types of
AM. I am wondering whether we ought to allow AMs to forbid being
referenced. If e.g. an AM has lower consistency guarantees than the AM
of the table referencing it, it might be preferrable to forbid
that. OTOH, I guess such an AM could just require UNLOGGED to be used.


Another restriction is actually related to UNLOGGED - currently the
UNLOGGED processing after crashes works by recognizing init forks by
file name. But what if e.g. the storage isn't inside postgres files? Not
sure if we actually can do anything good about that.


The last issue I know about is that nodeBitmapHeapscan.c and
nodeIndexOnlyscan.c currently directly accesses the visibilitymap. Which
means if an AM doesn't use the VM, they're never going to use the
optimized path. And conversely if the AM uses the VM, it needs to
internally map tids in way compatible with heap.  I strongly suspect
that we're going to have to fix this quite soon.


It'd be a pretty significant amount of work to allow storing catalogs in
a non-heap table. One difficulty is that there's just a lot of direct
accesses to catalog via heapam.h APIs - while a significant amount of
work to "fix" that, it's probably not very hard for each individual
site. There's a few places that rely on heap internals (checking xmin
for invalidation and the like). I think the biggest issue however would
be the catalog bootstrapping - to be able to read pg_am, we obviously
need to go through relcache.c's bootstrapping, and that only works
because we hardcode how those tables look like.  I personally don't
think it's particularly important issue to work on, nor am I convinced
that there'd be buy-in to make the necessary extensive changes.

Greetings,

Andres Freund

Re: Status of the table access method work

From

Heikki Linnakangas

Date:

08 April 2019, 11:53:53

On 05/04/2019 23:25, Andres Freund wrote:
> I think what's in v12 - I don't know of any non-cleanup / bugfix work
> pending for 12 - is a pretty reasonable initial set of features.

Hooray!

> - the (optional) bitmap heap scan API - that's fairly intrinsically
>    block based. An AM could just internally subdivide TIDs in a different
>    way, but I don't think a bitmap scan like we have would e.g. make a
>    lot of sense for an index oriented table without any sort of stable
>    tid.

If an AM doesn't implement the bitmap heap scan API, what happens? 
Bitmap scans are disabled?

Even if an AM isn't block-oriented, the bitmap heap scan API still makes 
sense as long as there's some correlation between TIDs and physical 
location. The only really broken thing about that currently is the 
prefetching: nodeBitmapHeapScan.c calls PrefetchBuffer() directly with 
the TID's block numbers. It would be pretty straightforward to wrap that 
in a callback, so that the AM could do something different.

Or move even more of the logic to the AM, so that the AM would get the 
whole TIDBitmap in table_beginscan_bm(). It could then implement the 
fetching and prefetching as it sees fit.

I don't think it's urgent, though. We can cross that bridge when we get 
there, with the first AM that needs that flexibility.

> The most constraining factor for storage, I think, is that currently the
> API relies on ItemPointerData style TIDs in a number of places (i.e. a 6
> byte tuple identifier).

I think 48 bits would be just about enough, but it's even more limited 
than you might at the moment. There are a few places that assume that 
the offsetnumber <= MaxHeapTuplesPerPage. See ginpostinglist.c, and 
MAX_TUPLES_PER_PAGE in tidbitmap.c. Also, offsetnumber can't be 0, 
because that makes the ItemPointer invalid, which is inconvenient if you 
tried to use ItemPointer as just an arbitrary 48-bit integer.

- Heikki

Re: Status of the table access method work

From

Andres Freund

Date:

08 April 2019, 16:38:59

Hi

On 2019-04-08 14:53:53 +0300, Heikki Linnakangas wrote:
> On 05/04/2019 23:25, Andres Freund wrote:
> > - the (optional) bitmap heap scan API - that's fairly intrinsically
> >    block based. An AM could just internally subdivide TIDs in a different
> >    way, but I don't think a bitmap scan like we have would e.g. make a
> >    lot of sense for an index oriented table without any sort of stable
> >    tid.
> 
> If an AM doesn't implement the bitmap heap scan API, what happens? Bitmap
> scans are disabled?

Yea, the planner doesn't consider them. It just masks the index's
amhasgetbitmap.  Seems to be the most reasonable thing to do?

> Even if an AM isn't block-oriented, the bitmap heap scan API still makes
> sense as long as there's some correlation between TIDs and physical
> location.

Yea, it could be a non-linear mapping. But I'm honestly not sure how
many non-block oriented AMs with such a correlation there are - I mean
you're not going to have that in say an IOT. And it'd be trivial to just
"fake" a block mapping for an in-memory AM.

> The only really broken thing about that currently is the
> prefetching: nodeBitmapHeapScan.c calls PrefetchBuffer() directly with the
> TID's block numbers. It would be pretty straightforward to wrap that in a
> callback, so that the AM could do something different.

That, and the VM_ALL_VISIBLE() checks both in nodeBitmapHeapscan.c and
nodeIndexonlyscan.c.

> Or move even more of the logic to the AM, so that the AM would get the whole
> TIDBitmap in table_beginscan_bm(). It could then implement the fetching and
> prefetching as it sees fit.
> 
> I don't think it's urgent, though. We can cross that bridge when we get
> there, with the first AM that needs that flexibility.

Yea, it seemed nontrivial (not in really hard, just not obvious), and
the implicated code duplication scared me away.

> > The most constraining factor for storage, I think, is that currently the
> > API relies on ItemPointerData style TIDs in a number of places (i.e. a 6
> > byte tuple identifier).
> 
> I think 48 bits would be just about enough

I don't think that's really true. Consider e.g. implementing an index
oriented table - there's no way you can efficiently implement one with
that small a key. You basically need a helper index just to have
efficient and small enough tids.  And given that we're also going to
need wider tids for global indexes, I suspect we're just going to have
to bite into the sour apple and make tids variable width.

> , but it's even more limited than
> you might at the moment. There are a few places that assume that the
> offsetnumber <= MaxHeapTuplesPerPage. See ginpostinglist.c, and
> MAX_TUPLES_PER_PAGE in tidbitmap.c. Also, offsetnumber can't be 0, because
> that makes the ItemPointer invalid, which is inconvenient if you tried to
> use ItemPointer as just an arbitrary 48-bit integer.

Good point.

Thanks for looking (and playing, in the other thread)!

Greetings,

Andres Freund

Re: Status of the table access method work

From

Haribabu Kommi

Date:

09 April 2019, 02:12:23

On Sat, Apr 6, 2019 at 7:25 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

In this email I want to give a brief status update of the table access
method work - I assume that most of you sensibly haven't followed it
into all nooks and crannies.

I want to thank Haribabu, Alvaro, Alexander, David, Dmitry and all the
others that collaborated on making tableam happen. It was/is a huge
project.

A big thank you Andres for your enormous efforts in this patch.

Without your involvement, this patch couldn't have been made into v12.

With regards to storing the rows themselves, the second biggest
limitation is a limitation that is not actually a part of tableam
itself: WAL. Many tableam's would want to use WAL, but we only have
extensible WAL as part of generic_xlog.h. While that's useful to allow
prototyping etc, it's imo not efficient enough to build a competitive
storage engine for OLTP (OLAP probably much less of a problem). I don't
know what the best approach here is - allowing "well known" extensions
to register rmgr entries would be the easiest solution, but it's
certainly a bit crummy.

I got the same doubt when i looked into some of the UNDO patches

where it tries to modify the core code to add UNDO specific WAL types.

Different AM's may need different set of operations to be WAL logged,

so it may be better for the AM's to register their own types?

Regards,

Haribabu Kommi

Fujitsu Australia

Re: Status of the table access method work

From

Heikki Linnakangas

Date:

09 April 2019, 06:32:17

On 08/04/2019 19:38, Andres Freund wrote:
> On 2019-04-08 14:53:53 +0300, Heikki Linnakangas wrote:
>> On 05/04/2019 23:25, Andres Freund wrote:
>>> - the (optional) bitmap heap scan API - that's fairly intrinsically
>>>     block based. An AM could just internally subdivide TIDs in a different
>>>     way, but I don't think a bitmap scan like we have would e.g. make a
>>>     lot of sense for an index oriented table without any sort of stable
>>>     tid.
>>
>> If an AM doesn't implement the bitmap heap scan API, what happens? Bitmap
>> scans are disabled?
> 
> Yea, the planner doesn't consider them. It just masks the index's
> amhasgetbitmap.  Seems to be the most reasonable thing to do?

Yep.

>> Even if an AM isn't block-oriented, the bitmap heap scan API still makes
>> sense as long as there's some correlation between TIDs and physical
>> location.
> 
> Yea, it could be a non-linear mapping. But I'm honestly not sure how
> many non-block oriented AMs with such a correlation there are - I mean
> you're not going to have that in say an IOT. And it'd be trivial to just
> "fake" a block mapping for an in-memory AM.

Now that Ashwin conveniently posted the ZedStore prototype we started to 
hack on [1], I'll point to that as an example :-). It stores data in a 
B-tree (or rather, multiple B-trees) on TIDs. So there's very high 
correlation between TIDs and physical locality, but it's not block oriented.

Another example would be the "LZ4 Compressed Storage Manager" that 
Nikolai envisioned recently [2]. Before we came up with the idea of 
using b-trees in ZedStore, we were actually thinking of something very 
similar to that. Although that one perhaps still counts as 
"block-oriented" as far as the bitmap heap scan API is concerned, as it 
still deals with blocks, they're just mapped to different physical 
locations.

I'm not sure how an Index-Organized-Table would work, but I think it 
would want to just get the whole bitmap, and figure out the best order 
to fetch the rows by itself.

PS. Seems that having a table AM API has opened the floodgates for 
storage ideas. Nice!

- Heikki

[1] 
https://www.postgresql.org/message-id/CALfoeiuF-m5jg51mJUPm5GN8u396o5sA2AF5N97vTRAEDYac7w@mail.gmail.com
[2] 
https://www.postgresql.org/message-id/flat/11996861554042351%40iva4-dd95b404a60b.qloud-c.yandex.net

Re: Status of the table access method work

From

Dmitry Dolgov

Date:

09 April 2019, 09:17:29

> On Tue, Apr 9, 2019 at 4:12 AM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
> On Sat, Apr 6, 2019 at 7:25 AM Andres Freund <andres@anarazel.de> wrote:
>>
>> With regards to storing the rows themselves, the second biggest
>> limitation is a limitation that is not actually a part of tableam
>> itself: WAL. Many tableam's would want to use WAL, but we only have
>> extensible WAL as part of generic_xlog.h. While that's useful to allow
>> prototyping etc, it's imo not efficient enough to build a competitive
>> storage engine for OLTP (OLAP probably much less of a problem).  I don't
>> know what the best approach here is - allowing "well known" extensions
>> to register rmgr entries would be the easiest solution, but it's
>> certainly a bit crummy.
>
>
> I got the same doubt when i looked into some of the UNDO patches
> where it tries to modify the core code to add UNDO specific WAL types.
> Different AM's may need different set of operations to be WAL logged,
> so it may be better for the AM's to register their own types?

I'm also curious about that. As far as I can see the main objection against
that was that in this case the recovery process will depend on an extension,
which could violate reliability. But I wonder if this argument is still valid
for AM's, since the whole data is kind of depends on it, not only the recovery.

Btw, can someone elaborate, why exactly generic_xlog is not efficient enough?
I've went through the corresponding thread, looks like generic WAL records are
bigger than normal one - is it the only reason?

Re: Status of the table access method work

From

Thomas Munro

Date:

09 April 2019, 09:41:43

On Tue, Apr 9, 2019 at 2:12 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
> I got the same doubt when i looked into some of the UNDO patches
> where it tries to modify the core code to add UNDO specific WAL types.
> Different AM's may need different set of operations to be WAL logged,
> so it may be better for the AM's to register their own types?

In the current undo proposal, the undo subsystem itself needs an rmgr
ID for WAL-logging of some low level undo log management records (ie
its own record types), but then any undo-aware AM would also need to
have its own rmgr ID for its own universe of WAL records (its own
types, meaningful to it alone), and that same rmgr ID is used also for
its undo records (which themselves have specific types). That is, in
rmgrlist.h, an undo-aware AM would register not only its redo function
(called for each WAL record in recovery) but also its undo function
(called when transaction roll back, if your transaction generated any
undo records).  Which raises the question of how a hypothetical
undo-aware AM could deal with undo records if it's using generic WAL
records.  I haven't thought about that.  A couple of ideas we've
bounced around to allow extensions to work with specific WAL records:
(1) a community-wide registry of rmgr IDs (basically, just allocate
the IDs for all known extensions in a header in the tree, like IANA),
or (2) a per-cluster registry scheme where an extension, identified by
its name, would be permanently allocated an rmgr number and library +
callback functions for the lifetime of that cluster.  Or something
like that.

-- 
Thomas Munro
https://enterprisedb.com

Re: Status of the table access method work

From

Andres Freund

Date:

09 April 2019, 12:32:20

Hi,

On 2019-04-09 11:17:29 +0200, Dmitry Dolgov wrote:
> I'm also curious about that. As far as I can see the main objection against
> that was that in this case the recovery process will depend on an extension,
> which could violate reliability.

I don't think that's a primary concern - although it is one. The mapping
from types of records to the handler function needs to be accessible at
a very early state, when the cluster isn't yet in a consistent state. So
we can't just go an look into pg_am, and look up a handler function, etc
- crash recovery happens much earlier than that is possible. Nor do we
want the mapping of 'rmgr id' -> 'extension' to be defined in the config
file, that's way too likely to be wrong.  So there needs to be a
different type of mapping, accessible outside the catalog. I supect we'd
have to end up with something very roughly like the relmapper
infrastructure.  A tertiary problem is then how to identify extensions
in that mapping - although I suspect just using any library name that
can be passed to load_library() will be OK.


> But I wonder if this argument is still valid for AM's, since the whole
> data is kind of depends on it, not only the recovery.

I don't buy that argument. If you have an AM that registers, using a new
facility, replay routines, and then it errors out / crashes during
those, there's no way to get the cluster back into a consistent
state. So it's not just the one table in that AM that's gone, it's the
entire cluster that's impacted.


> Btw, can someone elaborate, why exactly generic_xlog is not efficient enough?
> I've went through the corresponding thread, looks like generic WAL records are
> bigger than normal one - is it the only reason?

That's one big reason. But also, you just can't do much more than "write
this block into that file" during recovery with. A lot of our replay
routines intentionally do more complicated tasks.

Greetings,

Andres Freund

Re: Status of the table access method work

From

Alexander Korotkov

Date:

10 April 2019, 17:14:17

On Fri, Apr 5, 2019 at 11:25 PM Andres Freund <andres@anarazel.de> wrote:
> I want to thank Haribabu, Alvaro, Alexander, David, Dmitry and all the
> others that collaborated on making tableam happen. It was/is a huge
> project.

Thank you so much for bringing this project to commit!  Excellent work!

Your explanation of existing limitations looks very good and
convincing.  But I think there is one you didn't mention. We require
new table AMs to basically save old "contract" between heap and
indexes.  We have "all or nothing" choice during updates.  So, storage
can either insert to none of indexes or insert to all of them
(HOT-like update).  I think any storage, which is going to address
"write amplification" point raised by Uber, needs to break this
"contract".

For example, zheap is promised to implement delete-marking indexes.
But it's not yet published.  And for me it's not clear that this
approach is better among the alternatives.  With delete-marking
approach you need to update index tuples corresponding to old values
of updated fields.  But additionally to that it's not trivial to
delete index tuple.  In order to do that, you need to both locate this
index tuple and know that this index value isn't present in undo
chain.  So, it's likely required another index lookup during purging
of undo chain.  Thus, we basically need to random lookup index twice
for every deleted index tuple.  Also, it becomes more complex to
lookup appropriate heap tuple during index scan.  Then you need to
check not only visibility, but also matching index value (here we need
to adjust index_fetch_tuple interface).  Because it might happen that
visible to you version have different index value.  That may lead to
O(N^2) performance while accessing single row with N versions (MySQL
InnoDB has this problem).

Alternative idea is to have MVCC-aware indexes.  This approach looks
more attractive for me.  In this approach you basically need xmin,
xmax fields in index tuples.  On insertion of index tuple you fill
it's xmin.  On update, previous index tuple is marked with xmax.
After that outdated index tuples might be deleted in the lazy manner
when page space is required.  So, only one random access is required
for deleted index tuple.  With this approach fetching single row is
O(N).  Also, index-only scan becomes very easy and doesn't even need a
visibility map.  The only problem here is extra space requirements for
index tuples.  But I think, this is well-isolated problem, which is
easy to attack.  For instance, some visibility information could be
evicted to undo chain (like zheap does for its tuples).  Also, we can
have special bit for "all visible" index tuples.  With "all visible"
bit set this tuple can get rid of visibility fields.  We can do this
for index tuples, because if index tuple requires extra space we can
split the page, in spite of heap where tuples are fixed in pages and
xmax needs to be updated in-place.

I understand that delete-marking indexes have some advantages, and
some people find them more appealing.  But my point is that we
shouldn't builtin one of this approaches into API unless we have
concrete proof that this approach is strongly overcomes another.  It
would be better to have our table-AM API flexible enough to implement
both.  I can imagine we have proper encapsulation here bringing more
interaction with indexes to the table AM side.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Status of the table access method work

From

Andres Freund

Date:

10 April 2019, 17:32:16

Hi,

On 2019-04-10 20:14:17 +0300, Alexander Korotkov wrote:
> Your explanation of existing limitations looks very good and
> convincing.  But I think there is one you didn't mention. We require
> new table AMs to basically save old "contract" between heap and
> indexes.  We have "all or nothing" choice during updates.  So, storage
> can either insert to none of indexes or insert to all of them
> (HOT-like update).

I think that's a problem, and yea, I should have mentioned it. I'd
earlier thought about it and then forgot.

I however don't think we should design the interface for this before we
have at least one AM that's actually in decent-ish shape that needs
it. I seriously doubt we'll get the interface right enough.

Note: I'm *extremely* *extremely* doubtful that moving the full executor
invocations for expression indices etc into the tableam is a sane thing
to do. It's possible to convince me there's no alternative, but it'll be
really hard.

I suspect the right direction will be more going in a direction of
computing new index tuples for expression indexes before tableam gets
involved. If we do that right, we can also implement the stuff that
1c53c4dec3985512f7f2f53c9d76a5295cd0a2dd reverted in a proper way.

> I think any storage, which is going to address "write amplification"
> point raised by Uber, needs to break this "contract".

FWIW, I don't think it makes much point in using Uber as a justification
for anything here. Their analysis was so deeply flawed and motivated by
non-technical reasons that it should just be ignored.

Greetings,

Andres Freund

Re: Status of the table access method work

From

Alexander Korotkov

Date:

10 April 2019, 17:41:17

On Wed, Apr 10, 2019 at 8:32 PM Andres Freund <andres@anarazel.de> wrote:
> On 2019-04-10 20:14:17 +0300, Alexander Korotkov wrote:
> > Your explanation of existing limitations looks very good and
> > convincing.  But I think there is one you didn't mention. We require
> > new table AMs to basically save old "contract" between heap and
> > indexes.  We have "all or nothing" choice during updates.  So, storage
> > can either insert to none of indexes or insert to all of them
> > (HOT-like update).
>
> I think that's a problem, and yea, I should have mentioned it. I'd
> earlier thought about it and then forgot.
>
> I however don't think we should design the interface for this before we
> have at least one AM that's actually in decent-ish shape that needs
> it. I seriously doubt we'll get the interface right enough.
>
> Note: I'm *extremely* *extremely* doubtful that moving the full executor
> invocations for expression indices etc into the tableam is a sane thing
> to do. It's possible to convince me there's no alternative, but it'll be
> really hard.
>
> I suspect the right direction will be more going in a direction of
> computing new index tuples for expression indexes before tableam gets
> involved. If we do that right, we can also implement the stuff that
> 1c53c4dec3985512f7f2f53c9d76a5295cd0a2dd reverted in a proper way.

Probably we can invent few modes table AM might work: calculation of
all new index tuples, calculation of new and old index tuples for
updated fields, calculation of all new and old index tuples and so on.
And them index tuples would be calculated either in advance or by
callback.

> > I think any storage, which is going to address "write amplification"
> > point raised by Uber, needs to break this "contract".
>
> FWIW, I don't think it makes much point in using Uber as a justification
> for anything here. Their analysis was so deeply flawed and motivated by
> non-technical reasons that it should just be ignored.

Yeah, Uber is just a buzz word here.  But problem that update of
single indexed field leads to insertions to every index is well-known
among the PostgreSQL users.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Status of the table access method work

From

Alexander Korotkov

Date:

15 April 2019, 02:47:55

On Wed, Apr 10, 2019 at 8:32 PM Andres Freund <andres@anarazel.de> wrote:
> On 2019-04-10 20:14:17 +0300, Alexander Korotkov wrote:
> > Your explanation of existing limitations looks very good and
> > convincing.  But I think there is one you didn't mention. We require
> > new table AMs to basically save old "contract" between heap and
> > indexes.  We have "all or nothing" choice during updates.  So, storage
> > can either insert to none of indexes or insert to all of them
> > (HOT-like update).
>
> I think that's a problem, and yea, I should have mentioned it. I'd
> earlier thought about it and then forgot.
>
> I however don't think we should design the interface for this before we
> have at least one AM that's actually in decent-ish shape that needs
> it. I seriously doubt we'll get the interface right enough.

Sure.

My point is that once we get first table AM which needs this, say
zheap, we shouldn't make it like this

TM_Result   (*tuple_update) (Relation rel, ... bool *update_indexes,
bool *delete_marking);

but rather try to design proper encapsulation of logic inside of table AM.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Status of the table access method work

From

Dmitry Dolgov

Date:

17 April 2019, 20:02:24

> On Fri, Apr 5, 2019 at 10:25 PM Andres Freund <andres@anarazel.de> wrote:
>
> A second set of limitations is around making more of tableam
> optional. Right now it e.g. is not possible to have an AM that doesn't
> implement insert/update/delete. Obviously an AM can just throw an error
> in the relevant callbacks, but I think it'd be better if we made those
> callbacks optional, and threw errors at parse-analysis time (both to
> make the errors consistent, and to ensure it's consistently thrown,
> rather than only when e.g. an UPDATE actually finds a row to update).

Agree, but I guess some of tableam still should be mandatory, and then I wonder
where to put the live between those that are optional and those that are not.
E.g. looks like it can be relatively straightforward (ignoring `create table as`
and some other stuff) to make insert/update/delete optional with messages at
analysis time, but for others like parallel scan related it's probably not.

Attachment

Re: Status of the table access method work

From

Andres Freund

Date:

17 April 2019, 20:24:55

Hi!

On 2019-04-17 22:02:24 +0200, Dmitry Dolgov wrote:
> > On Fri, Apr 5, 2019 at 10:25 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > A second set of limitations is around making more of tableam
> > optional. Right now it e.g. is not possible to have an AM that doesn't
> > implement insert/update/delete. Obviously an AM can just throw an error
> > in the relevant callbacks, but I think it'd be better if we made those
> > callbacks optional, and threw errors at parse-analysis time (both to
> > make the errors consistent, and to ensure it's consistently thrown,
> > rather than only when e.g. an UPDATE actually finds a row to update).
> 
> Agree, but I guess some of tableam still should be mandatory, and then I wonder
> where to put the live between those that are optional and those that are not.
> E.g. looks like it can be relatively straightforward (ignoring `create table as`
> and some other stuff) to make insert/update/delete optional with messages at
> analysis time, but for others like parallel scan related it's probably not.

Thanks for the patch! I assume you're aware, but it's probably not going
to be applied for 12...

I think most of the read-only stuff just needs to be non-optional, and
most of the DML stuff needs to be optional.

On the executor side it'd probably be good to make the sample scan
optional too, but then we also need to check for that during
parse-analysis. In contast to bitmap scans there's no alternative way to
execute them.


>      Assert(routine->relation_set_new_filenode != NULL);
> diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
> index c39218f8db..36e2dbf1b8 100644
> --- a/src/backend/commands/copy.c
> +++ b/src/backend/commands/copy.c
> @@ -41,6 +41,7 @@
>  #include "miscadmin.h"
>  #include "optimizer/optimizer.h"
>  #include "nodes/makefuncs.h"
> +#include "nodes/nodeFuncs.h"
>  #include "parser/parse_coerce.h"
>  #include "parser/parse_collate.h"
>  #include "parser/parse_expr.h"
> @@ -901,6 +902,13 @@ DoCopy(ParseState *pstate, const CopyStmt *stmt,
>                                              NULL, false, false);
>          rte->requiredPerms = (is_from ? ACL_INSERT : ACL_SELECT);
>  
> +        if (is_from && !table_support_multi_insert(rel))
> +            ereport(ERROR,
> +                        (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +                         errmsg("Table access method doesn't support the operation"),
> +                         parser_errposition(pstate,
> +                                            exprLocation((Node *) stmt))));

Probably should fall-back to plain inserts if multi-insert isn't
supported.

And if insert isn't supported either, we should probably talk about
that specifically? I.e. a message like
"access method \"%s\" of table \"%s\" does not support %s"
?

Without knowing at least thatmuch operation it might sometimes be very
hard to figure out what's not supported.



> +static inline bool
> +table_support_speculative(Relation rel)
> +{
> +    return rel->rd_tableam == NULL ||
> +           (rel->rd_tableam->tuple_insert_speculative != NULL &&
> +            rel->rd_tableam->tuple_complete_speculative != NULL);
> +}

In GetTableAmRoutine() I'd assert that either both or none are defined.


> +static inline bool
> +table_support_multi_insert(Relation rel)
> +{
> +    return rel->rd_tableam == NULL ||
> +           (rel->rd_tableam->multi_insert != NULL &&
> +            rel->rd_tableam->finish_bulk_insert != NULL);
> +}

bulk insert already is optional...


I think there's more places that need checks like these - consider
e.g. replication and such that don't go through the full blown executor.

Greetings,

Andres Freund

Re: Status of the table access method work

From

Dmitry Dolgov

Date:

17 April 2019, 20:37:15

> On Wed, Apr 17, 2019 at 10:25 PM Andres Freund <andres@anarazel.de> wrote:
>
> I assume you're aware, but it's probably not going to be applied for 12...

Sure, the patch was mostly to express more clearly what I was thinking about :)

> I think most of the read-only stuff just needs to be non-optional, and most
> of the DML stuff needs to be optional.

> On the executor side it'd probably be good to make the sample scan optional
> too, but then we also need to check for that during parse-analysis. In
> contast to bitmap scans there's no alternative way to execute them.

Yeah, makes sense.

> bulk insert already is optional...

Oh, haven't noticed.

Re: Status of the table access method work

From

Simon Riggs

Date:

18 April 2019, 06:27:06

On Wed, 10 Apr 2019 at 18:14, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Alternative idea is to have MVCC-aware indexes. This approach looks
more attractive for me. In this approach you basically need xmin,
xmax fields in index tuples. On insertion of index tuple you fill
it's xmin. On update, previous index tuple is marked with xmax.

xmax can be provided through to index by indexam when 1) we mark killed tuples, 2) when we do indexscan of index entry without xmax set.

xmax can be set as a hint on normal scans, or set as part of an update, as the index chooses

After that outdated index tuples might be deleted in the lazy manner
when page space is required.

That is already done, so hardly any change there.

Also, we can

have special bit for "all visible" index tuples. With "all visible"
bit set this tuple can get rid of visibility fields. We can do this
for index tuples, because if index tuple requires extra space we can
split the page, in spite of heap where tuples are fixed in pages and
xmax needs to be updated in-place.

Keeping the xmin/xmax would also be useful for historical indexes, i.e. indexes that can be used to search for data with historic snapshots.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Status of the table access method work

From

Alvaro Herrera

Date:

11 June 2019, 15:47:11

On 2019-Apr-10, Alexander Korotkov wrote:

> Alternative idea is to have MVCC-aware indexes.  This approach looks
> more attractive for me.  In this approach you basically need xmin,
> xmax fields in index tuples.

"We liked freezing xmin so much that we had to introduce freezing for
xmax" -- rhaas dixit.  And now we want to introduce freezing for
indexes?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Status of the table access method work

From

Robert Haas

Date:

11 June 2019, 15:59:36

On Tue, Jun 11, 2019 at 11:47 AM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> On 2019-Apr-10, Alexander Korotkov wrote:
> > Alternative idea is to have MVCC-aware indexes.  This approach looks
> > more attractive for me.  In this approach you basically need xmin,
> > xmax fields in index tuples.
>
> "We liked freezing xmin so much that we had to introduce freezing for
> xmax" -- rhaas dixit.  And now we want to introduce freezing for
> indexes?

Plus it would add 8 bytes to the size of every index tuple.  if you
are indexing long-ish strings it may not hurt too much, but if your
primary key is an integer, I think your index is going to get a lot
bigger.

The problem with freezing is perhaps avoidable if you store an epoch
in the page special space as part of all this.  But I don't see any
way to avoid having the tuples get wider.  Possibly you could include
xmin and xmax only when needed, removing xmin once the tuples are
all-visible and splitting pages if you need to make room to add an
xmax. I'm not sure how much that helps, though, because if you do a
bulk insert, you're going to have to leave room for all of the xmins
initially, and removing them later will produce free space for which
you may have little use.

I don't think that including visibility information in indexes is a
bad idea; we've thought about making zheap do this someday. But I
think that we need to use some more sophisticated approach involving,
maybe, undo pointers, or some other kind of magic, rather than just
widening the tuples. I expect that just widening the tuples would be
good enough to win for some use cases, but I think there would be
others that lose heavily.

In any case, I agree that we do not want to create more things that
need freezing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Status of the table access method work

From

Peter Geoghegan

Date:

11 June 2019, 16:14:46

On Tue, Jun 11, 2019 at 8:59 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I don't think that including visibility information in indexes is a
> bad idea; we've thought about making zheap do this someday. But I
> think that we need to use some more sophisticated approach involving,
> maybe, undo pointers, or some other kind of magic, rather than just
> widening the tuples. I expect that just widening the tuples would be
> good enough to win for some use cases, but I think there would be
> others that lose heavily.

+1. Limited visibility information would make sense (e.g. maybe a per
tuple all-visible bit), which would have to be backed by something
like UNDO, but storing XIDs in tuples seems like a very bad idea. The
idea that something like this would have to be usable by any possible
table access method seems unworkable in general.

Sometimes it seems like the table access method work could use some
specific non-goals. Perhaps I just haven't being paying enough
attention to have noticed them.

-- 
Peter Geoghegan

Re: Status of the table access method work

From

Andres Freund

Date:

11 June 2019, 16:32:21

Hi,

On 2019-06-11 11:59:36 -0400, Robert Haas wrote:
> On Tue, Jun 11, 2019 at 11:47 AM Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > On 2019-Apr-10, Alexander Korotkov wrote:
> > > Alternative idea is to have MVCC-aware indexes.  This approach looks
> > > more attractive for me.  In this approach you basically need xmin,
> > > xmax fields in index tuples.
> >
> > "We liked freezing xmin so much that we had to introduce freezing for
> > xmax" -- rhaas dixit.  And now we want to introduce freezing for
> > indexes?
> 
> Plus it would add 8 bytes to the size of every index tuple.  if you
> are indexing long-ish strings it may not hurt too much, but if your
> primary key is an integer, I think your index is going to get a lot
> bigger.
> 
> The problem with freezing is perhaps avoidable if you store an epoch
> in the page special space as part of all this.  But I don't see any
> way to avoid having the tuples get wider.  Possibly you could include
> xmin and xmax only when needed, removing xmin once the tuples are
> all-visible and splitting pages if you need to make room to add an
> xmax. I'm not sure how much that helps, though, because if you do a
> bulk insert, you're going to have to leave room for all of the xmins
> initially, and removing them later will produce free space for which
> you may have little use.
> 
> I don't think that including visibility information in indexes is a
> bad idea; we've thought about making zheap do this someday. But I
> think that we need to use some more sophisticated approach involving,
> maybe, undo pointers, or some other kind of magic, rather than just
> widening the tuples. I expect that just widening the tuples would be
> good enough to win for some use cases, but I think there would be
> others that lose heavily.

Yea, I think there's plenty reasons to want to do something different
than what we're doing. But I'd like to see a concrete proposal before
building API for it...

Greetings,

Andres Freund

Re: Status of the table access method work

From

Robert Haas

Date:

11 June 2019, 18:20:06

On Tue, Jun 11, 2019 at 12:32 PM Andres Freund <andres@anarazel.de> wrote:
> Yea, I think there's plenty reasons to want to do something different
> than what we're doing. But I'd like to see a concrete proposal before
> building API for it...

I wasn't intending to propose that you should.  We're just in the
brainstorming stage here I think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company