Thread: Re: Table AM and DDLs

Re: Table AM and DDLs

From
Andres Freund
Date:
Hi,

On 2021-03-03 22:15:18 +0100, Mats Kindahl wrote:
> On Tue, Feb 23, 2021 at 2:11 AM Andres Freund <andres@anarazel.de> wrote:
> Thanks for the answer and sorry about the late reply.

Mine is even later ;)


> > I don't think that's quite right. It's not exactly obvious from the
> > name, but RelationDropStorage() does not actually drop storage. Instead
> > it *schedules* the storage to be dropped upon commit.
> >
> > The reason for deferring the dropping of table storage is that DDL in
> > postgres is transactional. Therefore we cannot remove the storage at the
> > moment the DROP TABLE is executed - only when the transaction that
> > performed the DDL commits. Therefore just providing you with a callback
> > that runs in heap_drop_with_catalog() doesn't really achieve much -
> > you'd not have a way to execute the "actual" dropping of the relation at
> > the later stage.
> >
> 
> Yeah, I found the chain (performDeletion -> deleteOneObject -> doDeletion
> -> heap_drop_with_catalog) where the delete was just scheduled for deletion
> but it appeared like this was the place to actually perform the "actual"
> delete. Looking closer, I see this was the wrong location. However, the
> intention was to get a callback when the "actual" delete should happen.
> Before that, the blocks are still potentially alive and could be read, so
> shouldn't be recycled.
> 
> It seems the right location seems to be in the storage manager (smgr_unlink
> in smgr.c), but that does not seem to be extensible, or are there any plans
> to make it available so that you can implement something other than just
> "magnetic disk"?

There've been patches to add new types of storage below smgr.c, but not
in way that can be done outside of build time. As far as I recall.


> > Before I explain some more: Could you describe in a bit more detail what
> > kind of optimization you'd like to make?
> >
> 
> This is not really about any optimizations, it more about a good API for
> tables and managing storage. If a memory table can be implemented entirely
> in the extension and storage managed fully, there is a lot of interesting
> potential for various implementations of table backends. For this to work I
> think it is necessary to be able to handle schema changes for the backend
> storage in addition to scans, inserts, updates, and deletes, but I am not
> sure if it is already possible in some way that I haven't discovered or if
> I should just try to propose something (making the storage manager API
> extensible seems like a good first attempt).

As long as you have a compatible definition of what is acceptable "in
place" ALTER TABLE (e.g. adding new columns, changing between compatible
types), and what requires a table rewrite (e.g. an incompatible column
type change), I don't see a real problem. Except for the unlink thing
above.

Any schema change requiring a table rewrite will trigger a new relation
to be created, which in turn will involve tableam. After that you'll
just get called back to re-insert all the tuples in the original
relation.

If you want a different definition on what needs a rewrite, good luck,
it'll be a heck of a lot more work.



> > Due to postgres' transactional DDL you cannot really change the storage
> > layout of *existing data* when that DDL command is executed - the data
> > still needs to be interpretable in case the DDL is rolled back
> > (including when crashing).
> >
> 
> No, didn't expect this, but some means to see that a schema change is about
> to happen.

For anything that's not in-place you'll see a new table being created
(c.f. ATRewriteTables() calling make_new_heap()). The relfilenode
identifying the data (as opposed to the oid, identifying a relation),
will then be swapped with the current table's relfilenode via
finish_heap_swap().


> > Other changes, e.g. changing the type of a column "sufficiently", will
> > cause a so called table rewrite. Which means that a new relation will be
> > created (including a call to relation_set_new_filenode()), then that new
> > relation will get all the new data inserted, and then
> > pg_class->relfilenode for the "original" relation will be changed to the
> > "rewritten" table (there's two variants of this, once for rewrites due
> > to ALTER TABLE and a separate one for VACUUM FULL/CLUSTER).

> But that is not visible in the access method interface. If I add debug
> output to the memory table, I only see a call to needs_toast_table. If
> there were a new call to create a new block and some additional information
> about , this would be possible to handle.

It should be. If I e.g. do

CREATE TABLE blarg(id int4 not null);
I get one call to table_relation_set_new_filenode()

#0  table_relation_set_new_filenode (rel=0x7f84c2417b70, newrnode=0x7f84c2417b70, persistence=112 'p',
freezeXid=0x7ffc8c61263c,minmulti=0x7ffc8c612638)
 
    at /home/andres/src/postgresql/src/include/access/tableam.h:1596
#1  0x000055b1901e9116 in heap_create (relname=0x7ffc8c612900 "blarg", relnamespace=2200, reltablespace=0,
relid=3016410,relfilenode=3016410, accessmtd=2, 
 
    tupDesc=0x55b191d2a8c8, relkind=114 'r', relpersistence=112 'p', shared_relation=false, mapped_relation=false,
allow_system_table_mods=false,
 
    relfrozenxid=0x7ffc8c61263c, relminmxid=0x7ffc8c612638) at
/home/andres/src/postgresql/src/backend/catalog/heap.c:436
#2  0x000055b1901eab28 in heap_create_with_catalog (relname=0x7ffc8c612900 "blarg", relnamespace=2200, reltablespace=0,
relid=3016410,reltypeid=0, 
 
    reloftypeid=0, ownerid=10, accessmtd=2, tupdesc=0x55b191d2a8c8, cooked_constraints=0x0, relkind=114 'r',
relpersistence=112'p', shared_relation=false, 
 
    mapped_relation=false, oncommit=ONCOMMIT_NOOP, reloptions=0, use_user_acl=true, allow_system_table_mods=false,
is_internal=false,relrewrite=0, 
 
    typaddress=0x0) at /home/andres/src/postgresql/src/backend/catalog/heap.c:1291
#3  0x000055b19030002a in DefineRelation (stmt=0x55b191d31478, relkind=114 'r', ownerId=10, typaddress=0x0, 
    queryString=0x55b191c35bc0 "CREATE TABLE blarg(id int8 not null

then when I do
ALTER TABLE blarg ALTER COLUMN id TYPE int8;
I see
#0  table_relation_set_new_filenode (rel=0x7f84c241f2a8, newrnode=0x7f84c241f2a8, persistence=112 'p',
freezeXid=0x7ffc8c61275c,minmulti=0x7ffc8c612758)
 
    at /home/andres/src/postgresql/src/include/access/tableam.h:1596
#1  0x000055b1901e9116 in heap_create (relname=0x7ffc8c612860 "pg_temp_3016404", relnamespace=2200, reltablespace=0,
relid=3016407,relfilenode=3016407, 
 
    accessmtd=2, tupDesc=0x7f84c24162a0, relkind=114 'r', relpersistence=112 'p', shared_relation=false,
mapped_relation=false,allow_system_table_mods=true, 
 
    relfrozenxid=0x7ffc8c61275c, relminmxid=0x7ffc8c612758) at
/home/andres/src/postgresql/src/backend/catalog/heap.c:436
#2  0x000055b1901eab28 in heap_create_with_catalog (relname=0x7ffc8c612860 "pg_temp_3016404", relnamespace=2200,
reltablespace=0,relid=3016407, reltypeid=0, 
 
    reloftypeid=0, ownerid=10, accessmtd=2, tupdesc=0x7f84c24162a0, cooked_constraints=0x0, relkind=114 'r',
relpersistence=112'p', shared_relation=false, 
 
    mapped_relation=false, oncommit=ONCOMMIT_NOOP, reloptions=0, use_user_acl=false, allow_system_table_mods=true,
is_internal=true,relrewrite=3016404, 
 
    typaddress=0x0) at /home/andres/src/postgresql/src/backend/catalog/heap.c:1291



> I *was* expecting either a call of set_filenode with a new xact id, or
> something like that, and with some information so that you can locate the
> schema change planned (e.g., digging through pg_class and friends), I just
> don't see that when I add debug output.

You should. And it'll have the new table "schema" associated. E.g. in
the above example the new table will have
rel->rd_att->natts == 1
rel->rd_att->attrs[0].atttypid == 20 (i.e. int8)


Greetings,

Andres Freund



Table AM and DROP TABLE [ Was: Table AM and DDLs]

From
Mats Kindahl
Date:
On Mon, Mar 22, 2021 at 12:16 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2021-03-03 22:15:18 +0100, Mats Kindahl wrote:
> On Tue, Feb 23, 2021 at 2:11 AM Andres Freund <andres@anarazel.de> wrote:
> Thanks for the answer and sorry about the late reply.

Mine is even later ;)

:)

Seems I keep the tradition. :)

Thanks a lot for the pointers, I have some comments below both about DROP TABLE and ALTER TABLE.


> > I don't think that's quite right. It's not exactly obvious from the
> > name, but RelationDropStorage() does not actually drop storage. Instead
> > it *schedules* the storage to be dropped upon commit.
> >
> > The reason for deferring the dropping of table storage is that DDL in
> > postgres is transactional. Therefore we cannot remove the storage at the
> > moment the DROP TABLE is executed - only when the transaction that
> > performed the DDL commits. Therefore just providing you with a callback
> > that runs in heap_drop_with_catalog() doesn't really achieve much -
> > you'd not have a way to execute the "actual" dropping of the relation at
> > the later stage.
> >
>
> Yeah, I found the chain (performDeletion -> deleteOneObject -> doDeletion
> -> heap_drop_with_catalog) where the delete was just scheduled for deletion
> but it appeared like this was the place to actually perform the "actual"
> delete. Looking closer, I see this was the wrong location. However, the
> intention was to get a callback when the "actual" delete should happen.
> Before that, the blocks are still potentially alive and could be read, so
> shouldn't be recycled.
>
> It seems the right location seems to be in the storage manager (smgr_unlink
> in smgr.c), but that does not seem to be extensible, or are there any plans
> to make it available so that you can implement something other than just
> "magnetic disk"?

There've been patches to add new types of storage below smgr.c, but not
in way that can be done outside of build time. As far as I recall.

I have done some more research and I do not think it is necessary to extend the storage layer. As a matter of fact, I think the patch I suggested is the right approach: let me elaborate on why.

Let's look at how the implementation works with the heap access method (the file heapam_handler.c) and for this case let's use CREATE TABLE, DROP TABLE, and TRUNCATE TABLE (last one since that is supported in the Table AM and hence is a good reference for the comparison).

Disregarding surrounding layers, we have three layers that are important here:
  1. Heap catalog layer (not sure what to call it, but it's the src/backend/catalog/heap.c file)
  2. AM layer (the src/backend/access/heap/heapam_handler.c file)
  3. Storage layer (the src/backend/catalog/storage.c file) "code to create and destroy physical storage for relations".
Looking at CREATE TRUNCATE, we have the following calls through these layers:
  1. In the heap catalog layer we have a call of heap_truncate_one_rel which calls the table AM layer.
  2. In the Table AM layer heapam_relation_nontransactional_truncate will just call the storage layer to truncate the storage.
  3. The storage layer gets called through RelationTruncate, which will truncate the actual files.
Looking at CREATE TABLE, we have a similar pattern:
  1. In the heap catalog layer heap_create_with_catalog is called, which in turn calls heap_create, which will create the actual relcache and also call the table AM layer if it is a relation, materialized view, or toastvalue.
  2. In the Table AM layer, heapam_relation_set_new_filenode is called which will record the transaction identifiers and call the storage layer to create the underlying storage.
  3. In the storage layer, RelationCreateStorage will create the necessary storage, but also register the table for deletion if the transaction is aborted.
Note here that the storage layer remembers the table for deletion by saving it in pendingDeletes, which is local to the storage layer.

Looking at DROP TABLE, we have a similar pattern, but am missing one step:
  1. In the heap catalog layer the function heap_drop_with_catalog is called, which releases the system cache and calls the storage layer to drop the relation
  2. In the storage layer, the function RelationDropStorage is called, which will record the table to be dropped in the pendingDeletes
When committing (or aborting) the transaction, there are two calls that are interesting, in this order:
  1. CallXactCallbacks which calls registered callbacks
  2. smgrDoPendingDeletes, which calls the storage layer directly to perform the actual deletion, if necessary.
Now, suppose that we want to replace the storage layer with a different one. It is straightforward to replace it by implementing the Table AM methods above, but we are missing a callback on dropping the table. If we have that, we can record the table-to-be-dropped in a similar manner to how the heap AM does it and register a transaction callback using RegisterXactCallback.



> > Before I explain some more: Could you describe in a bit more detail what
> > kind of optimization you'd like to make?
> >
>
> This is not really about any optimizations, it more about a good API for
> tables and managing storage. If a memory table can be implemented entirely
> in the extension and storage managed fully, there is a lot of interesting
> potential for various implementations of table backends. For this to work I
> think it is necessary to be able to handle schema changes for the backend
> storage in addition to scans, inserts, updates, and deletes, but I am not
> sure if it is already possible in some way that I haven't discovered or if
> I should just try to propose something (making the storage manager API
> extensible seems like a good first attempt).

As long as you have a compatible definition of what is acceptable "in
place" ALTER TABLE (e.g. adding new columns, changing between compatible
types), and what requires a table rewrite (e.g. an incompatible column
type change), I don't see a real problem. Except for the unlink thing
above.

Any schema change requiring a table rewrite will trigger a new relation
to be created, which in turn will involve tableam. After that you'll
just get called back to re-insert all the tuples in the original
relation.

If you want a different definition on what needs a rewrite, good luck,
it'll be a heck of a lot more work.

No, this should work fine.

> > Due to postgres' transactional DDL you cannot really change the storage
> > layout of *existing data* when that DDL command is executed - the data
> > still needs to be interpretable in case the DDL is rolled back
> > (including when crashing).
> >
>
> No, didn't expect this, but some means to see that a schema change is about
> to happen.

For anything that's not in-place you'll see a new table being created
(c.f. ATRewriteTables() calling make_new_heap()). The relfilenode
identifying the data (as opposed to the oid, identifying a relation),
will then be swapped with the current table's relfilenode via
finish_heap_swap().


> > Other changes, e.g. changing the type of a column "sufficiently", will
> > cause a so called table rewrite. Which means that a new relation will be
> > created (including a call to relation_set_new_filenode()), then that new
> > relation will get all the new data inserted, and then
> > pg_class->relfilenode for the "original" relation will be changed to the
> > "rewritten" table (there's two variants of this, once for rewrites due
> > to ALTER TABLE and a separate one for VACUUM FULL/CLUSTER).

> But that is not visible in the access method interface. If I add debug
> output to the memory table, I only see a call to needs_toast_table. If
> there were a new call to create a new block and some additional information
> about , this would be possible to handle.

It should be. If I e.g. do

CREATE TABLE blarg(id int4 not null);
I get one call to table_relation_set_new_filenode()

#0  table_relation_set_new_filenode (rel=0x7f84c2417b70, newrnode=0x7f84c2417b70, persistence=112 'p', freezeXid=0x7ffc8c61263c, minmulti=0x7ffc8c612638)
    at /home/andres/src/postgresql/src/include/access/tableam.h:1596
#1  0x000055b1901e9116 in heap_create (relname=0x7ffc8c612900 "blarg", relnamespace=2200, reltablespace=0, relid=3016410, relfilenode=3016410, accessmtd=2,
    tupDesc=0x55b191d2a8c8, relkind=114 'r', relpersistence=112 'p', shared_relation=false, mapped_relation=false, allow_system_table_mods=false,
    relfrozenxid=0x7ffc8c61263c, relminmxid=0x7ffc8c612638) at /home/andres/src/postgresql/src/backend/catalog/heap.c:436
#2  0x000055b1901eab28 in heap_create_with_catalog (relname=0x7ffc8c612900 "blarg", relnamespace=2200, reltablespace=0, relid=3016410, reltypeid=0,
    reloftypeid=0, ownerid=10, accessmtd=2, tupdesc=0x55b191d2a8c8, cooked_constraints=0x0, relkind=114 'r', relpersistence=112 'p', shared_relation=false,
    mapped_relation=false, oncommit=ONCOMMIT_NOOP, reloptions=0, use_user_acl=true, allow_system_table_mods=false, is_internal=false, relrewrite=0,
    typaddress=0x0) at /home/andres/src/postgresql/src/backend/catalog/heap.c:1291
#3  0x000055b19030002a in DefineRelation (stmt=0x55b191d31478, relkind=114 'r', ownerId=10, typaddress=0x0,
    queryString=0x55b191c35bc0 "CREATE TABLE blarg(id int8 not null

then when I do
ALTER TABLE blarg ALTER COLUMN id TYPE int8;
I see
#0  table_relation_set_new_filenode (rel=0x7f84c241f2a8, newrnode=0x7f84c241f2a8, persistence=112 'p', freezeXid=0x7ffc8c61275c, minmulti=0x7ffc8c612758)
    at /home/andres/src/postgresql/src/include/access/tableam.h:1596
#1  0x000055b1901e9116 in heap_create (relname=0x7ffc8c612860 "pg_temp_3016404", relnamespace=2200, reltablespace=0, relid=3016407, relfilenode=3016407,
    accessmtd=2, tupDesc=0x7f84c24162a0, relkind=114 'r', relpersistence=112 'p', shared_relation=false, mapped_relation=false, allow_system_table_mods=true,
    relfrozenxid=0x7ffc8c61275c, relminmxid=0x7ffc8c612758) at /home/andres/src/postgresql/src/backend/catalog/heap.c:436
#2  0x000055b1901eab28 in heap_create_with_catalog (relname=0x7ffc8c612860 "pg_temp_3016404", relnamespace=2200, reltablespace=0, relid=3016407, reltypeid=0,
    reloftypeid=0, ownerid=10, accessmtd=2, tupdesc=0x7f84c24162a0, cooked_constraints=0x0, relkind=114 'r', relpersistence=112 'p', shared_relation=false,
    mapped_relation=false, oncommit=ONCOMMIT_NOOP, reloptions=0, use_user_acl=false, allow_system_table_mods=true, is_internal=true, relrewrite=3016404,
    typaddress=0x0) at /home/andres/src/postgresql/src/backend/catalog/heap.c:1291



> I *was* expecting either a call of set_filenode with a new xact id, or
> something like that, and with some information so that you can locate the
> schema change planned (e.g., digging through pg_class and friends), I just
> don't see that when I add debug output.

You should. And it'll have the new table "schema" associated. E.g. in
the above example the new table will have
rel->rd_att->natts == 1
rel->rd_att->attrs[0].atttypid == 20 (i.e. int8)

I didn't get a callback because I did ADD COLUMN and that works differently: it does not call set_filenode until you either try to insert something or run a vacuum. Thanks for the pointers, it helped a lot. I need to look over the code a little more.

Thanks,
Mats Kindahl



Greetings,

Andres Freund

Re: Table AM and DROP TABLE [ Was: Table AM and DDLs]

From
Alexander Kuzmenkov
Date:
On 05.04.2021 22:57, Mats Kindahl wrote:
Now, suppose that we want to replace the storage layer with a different one. It is straightforward to replace it by implementing the Table AM methods above, but we are missing a callback on dropping the table. If we have that, we can record the table-to-be-dropped in a similar manner to how the heap AM does it and register a transaction callback using RegisterXactCallback.

This explanation makes sense, and the suggested patch makes it easier to replace the storage layer with a different one.

Some other places might become problematic if we're trying to implement fully memory-based tables. For example, the heap_create_with_catalog -> GetNewRelFilenode -> access() call that directly checks the existence of a file bypassing the smgr layer. But I think that adding a symmetric callback to the tableam layer can be a good start for further experiments.

Some nitpicks:

+	/*
+	 * This callback needs to remove all associations with the relation `rel`
+	 * since the relation is being dropped.
+	 *
+	 * See also table_relation_reset_filenode().
+	 */

"Remove all associations" sounds vague, maybe something like "schedule the relation files to be deleted at transaction commit"?


+	void        (*relation_reset_filenode) (Relation rel);

This line uses spaces instead of tabs.


For the reference, there is a recent patch that makes the smgr layer itself pluggable: https://www.postgresql.org/message-id/flat/1dc080496f58ce5375778baed0c0fbcc%40postgrespro.ru#502a1278ad8fce6ae85c08b4806c2289


--
Alexander Kuzmenkov
https://www.timescale.com/

Re: Table AM and DROP TABLE [ Was: Table AM and DDLs]

From
Aleksander Alekseev
Date:
Hi hackers,

> As a matter of fact, I think the patch I suggested is the right approach:
> let me elaborate on why.
> [...]
> It is straightforward to replace it by implementing the Table AM methods
> above, but we are missing a callback on dropping the table. If we have that,
> we can record the table-to-be-dropped in a similar manner to how the heap AM
> does it and register a transaction callback using RegisterXactCallback.

Since no one objected in 5 months, I assume Mats made a good point. At least,
personally, I can't argue.

The patch looks good to me except for the fact that comments seem to be
inaccurate in light of the discussion. The corrected patch is attached.
I'm going to mark it as "Ready for Committer" unless anyone objects.

-- 
Best regards,
Aleksander Alekseev

Attachment

Re: Table AM and DROP TABLE [ Was: Table AM and DDLs]

From
Aleksander Alekseev
Date:
Hi hackers,

> I'm going to mark it as "Ready for Committer" unless anyone objects.

I updated the status of the patch.

To clarify, Alexander and I replied almost at the same time. The
drawbacks noted by Alexander are fixed in the v2 version of the patch.

-- 
Best regards,
Aleksander Alekseev



Re: Table AM and DROP TABLE [ Was: Table AM and DDLs]

From
Heikki Linnakangas
Date:
On 27/09/2021 14:59, Aleksander Alekseev wrote:
> Hi hackers,
> 
>> As a matter of fact, I think the patch I suggested is the right approach:
>> let me elaborate on why.
>> [...]
>> It is straightforward to replace it by implementing the Table AM methods
>> above, but we are missing a callback on dropping the table. If we have that,
>> we can record the table-to-be-dropped in a similar manner to how the heap AM
>> does it and register a transaction callback using RegisterXactCallback.
> 
> Since no one objected in 5 months, I assume Mats made a good point. At least,
> personally, I can't argue.

I agree that having a table AM callback at relation drop would make it 
more consistent with creating and truncating a relation. Then again, the 
indexam API doesn't have a drop-callback either.

But what can you actually do in the callback? WAL replay of dropping the 
storage needs to work without running any AM-specific code. It happens 
as part of replaying a commit record. So whatever action you do in the 
callback will not be executed at WAL replay. Also, because the callback 
merely *schedules* things to happen at commit, it cannot generate 
separate WAL records about dropping resources either.

Mats's in-memory table is an interesting example. I guess you don't even 
try WAL-logging that, so it's OK that nothing happens at WAL replay. As 
you said, the callback to schedule deletion of the shared memory block 
and use an end-of-xact callback to perform the deletion. You're 
basically re-inventing a pending-deletes mechanism similar to smgr's.

I think you could actually piggyback on smgr's pending-deletions 
mechanism instead of re-inventing it. In the callback, you can call 
smgrGetPendingDeletes(), and drop the shared memory segment for any 
relation in that list.

- Heikki



Re: Table AM and DROP TABLE [ Was: Table AM and DDLs]

From
Mats Kindahl
Date:
On Wed, Feb 16, 2022 at 10:07 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 27/09/2021 14:59, Aleksander Alekseev wrote:
> Hi hackers,
>
>> As a matter of fact, I think the patch I suggested is the right approach:
>> let me elaborate on why.
>> [...]
>> It is straightforward to replace it by implementing the Table AM methods
>> above, but we are missing a callback on dropping the table. If we have that,
>> we can record the table-to-be-dropped in a similar manner to how the heap AM
>> does it and register a transaction callback using RegisterXactCallback.
>
> Since no one objected in 5 months, I assume Mats made a good point. At least,
> personally, I can't argue.

I agree that having a table AM callback at relation drop would make it
more consistent with creating and truncating a relation. Then again, the
indexam API doesn't have a drop-callback either.

That is actually a good point. We could add an on-drop-callback for the indexam as well, if we add it for tableam. Haven't looked at that though, so if you think it should be added, I can investigate.
 
But what can you actually do in the callback? WAL replay of dropping the
storage needs to work without running any AM-specific code. It happens
as part of replaying a commit record. So whatever action you do in the
callback will not be executed at WAL replay.
Also, because the callback
merely *schedules* things to happen at commit, it cannot generate
separate WAL records about dropping resources either.

Digressing slightly: This is actually a drawback and I have been looking for a way to do things on recovery.

Just to have an example: if you want to have an in-memory table that is distributed the problem will be that even though the table is empty, it might actually be of interest to fetch the contents from another node on recovery, but this is currently not possible since it is assumed that all actions are present in the WAL and a memory table would not use the WAL for this since one of the goals is to make it fast and avoid writes to disk.

This might be possible to piggyback on the first select or insert done on the table, but it makes the system more complicated since it is easy to miss one place where you need to do this fetching. If you always do this on recovery it is a single place that you need to add and the system will after that be in a predictable state.

In addition, if it were to be added to the first access of the table, it would add execution time to this first operation, but most users would assume that all such work is done at recovery and that the database is "warm" after recovery.

However, this is slightly outside the discussion for this proposed change, so we can ignore it for now.
 

Mats's in-memory table is an interesting example. I guess you don't even
try WAL-logging that, so it's OK that nothing happens at WAL replay. As
you said, the callback to schedule deletion of the shared memory block
and use an end-of-xact callback to perform the deletion. You're
basically re-inventing a pending-deletes mechanism similar to smgr's.

I think you could actually piggyback on smgr's pending-deletions
mechanism instead of re-inventing it. In the callback, you can call
smgrGetPendingDeletes(), and drop the shared memory segment for any
relation in that list.

Hmm... it is a good point that smgrGetPendingDeletes() can be used in the commit callback for something as simple as a memory table when all the data is local. It should also work well with a distributed optimistic storage engine when you certify the transaction at commit time. What will happen then is that the actual "drop command" will be sent out at commit time rather than when the command is actually executed.

My main interest is, however, to have an API that works for all kinds of storage engines, not just limited to local storage but also supporting distributed storage systems and also being able to interact with existing implementations. There are a few reasons why  getting a notification when the table is dropped rather than when the commit is done is beneficial.
  1. In a distributed storage engine you might want to distribute changes speculatively when they happen so that the commit, once it occurs, will be fast. By sending out the action early, you allow work to start independently of the current machine, which will  improve parallelization.
  2. In a distributed storage engine or order of the statements received remotely make a difference. For example, if you want to use a distributed locking scheme for your distributed storage, you are currently forced to implement an optimistic scheme while in reality you might want to distribute the drop and lock the table exclusively on all remote nodes (this is already what PostgreSQL does, locking the table on a drop). I do realize that distributed transactions are not that simple and there are other problems associated with this, but this would still introduce an unnecessary restriction on what you can do.
  3. A problem with optimistic protocols in general is that they drop in performance when you have a lot of writes. It is simply the case that other smaller transactions will constantly force a long-running transaction to be aborted. This also means that there is a risk that a transaction that drops a table will have to be aborted out of necessity since other transactions are updating the table. In a distributed system there will be more of those, so the odds of aborting "more important" transactions (in the sense of needing stronger locks) is higher.
  4. A smaller issue is that right now the storage manager (smgr) and the transaction system are quite tightly coupled, which makes it more difficult to make the storage system "pluggable". I think that not requiring the use of pendingDeletes would move one step in the direction of removing this coupling, but I am not entirely sure here.
It is likely that many of these problems can be worked around by placing restrictions on how DDLs can be used in transactions, but that would create unnecessary restrictions for the end-user. It might also be possible to find implementation workarounds by placing code at strategic points in the implementation, but this is error prone and the risk of making an error is higher.

Best wishes,
Mats Kindahl
 

- Heikki

Re: Table AM and DROP TABLE [ Was: Table AM and DDLs]

From
Andres Freund
Date:
Hi,

On 2021-04-05 21:57:12 +0200, Mats Kindahl wrote:
>    2. In the storage layer, the function RelationDropStorage is called,
>    which will record the table to be dropped in the pendingDeletes
> 
> When committing (or aborting) the transaction, there are two calls that are
> interesting, in this order:
> 
>    1. CallXactCallbacks which calls registered callbacks
>    2. smgrDoPendingDeletes, which calls the storage layer directly to
>    perform the actual deletion, if necessary.
> 
> Now, suppose that we want to replace the storage layer with a different
> one. It is straightforward to replace it by implementing the Table AM
> methods above, but we are missing a callback on dropping the table. If we
> have that, we can record the table-to-be-dropped in a similar manner to how
> the heap AM does it and register a transaction callback using
> RegisterXactCallback.

I don't think implementing dropping relation data at-commit/rollback using
xact callbacks can be correct. The dropping needs to be integrated with the
commit / abort records, so it is redone during crash recovery - that's not
possible with xact callbacks.

To me it still seems fundamentally the wrong direction to implement a "drop
relation callback" tableam callback.

Greetings,

Andres Freund



Re: Table AM and DROP TABLE [ Was: Table AM and DDLs]

From
Andres Freund
Date:
Hi,

The CF entry for this patch doesn't currently apply and there has been a bunch
of feedback on the approach. Mats, are you actually waiting for further
feedback right now?

Greetings,

Andres Freund



Re: Table AM and DROP TABLE [ Was: Table AM and DDLs]

From
Michael Paquier
Date:
On Sun, Oct 02, 2022 at 09:53:01AM -0700, Andres Freund wrote:
> The CF entry for this patch doesn't currently apply and there has been a bunch
> of feedback on the approach. Mats, are you actually waiting for further
> feedback right now?

Okay, for now this has been marked as RwF.
--
Michael

Attachment

Re: Table AM and DROP TABLE [ Was: Table AM and DDLs]

From
Mats Kindahl
Date:
Hello all,

I think the discussion went a little sideways, so let me recap what I'm suggesting:
  1. I mentioned that there is a missing callback when the filenode is unlinked and this is particularly evident when dropping a table.
  2. It was correctly pointed out to me that an implementor need to ensure that dropping a table is transactional.
  3. I argued that the callback is still correct and outlined how this can be handled by a table access method using xact callbacks, if necessary.
I see huge potential in the table access method and would like to do my part in helping it in succeeding. I noted that the API is biased in how the existing heap implementation works and is also very focused on implementations of "local" storage engines. For more novel architectures (for example, various sorts of distributed architectures) and to be easier to work with, I think that the API can be improved in a few places. This is a first step in the direction of making the API both easier to use as well as enabling more novel use-cases.

Writing implementations with ease is more about having the right callbacks in the right places and providing the right information, but in some cases it is just not possible to implement efficient functionality with the current interface. I think it can be useful to separate these two kinds of enhancements when discussing the API, but I think both are important for the table access methods to be practically usable and to leverage the full power of this concept.

On Tue, Aug 2, 2022 at 1:44 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2021-04-05 21:57:12 +0200, Mats Kindahl wrote:
>    2. In the storage layer, the function RelationDropStorage is called,
>    which will record the table to be dropped in the pendingDeletes
>
> When committing (or aborting) the transaction, there are two calls that are
> interesting, in this order:
>
>    1. CallXactCallbacks which calls registered callbacks
>    2. smgrDoPendingDeletes, which calls the storage layer directly to
>    perform the actual deletion, if necessary.
>
> Now, suppose that we want to replace the storage layer with a different
> one. It is straightforward to replace it by implementing the Table AM
> methods above, but we are missing a callback on dropping the table. If we
> have that, we can record the table-to-be-dropped in a similar manner to how
> the heap AM does it and register a transaction callback using
> RegisterXactCallback.

  don't think implementing dropping relation data at-commit/rollback using
xact callbacks can be correct. The dropping needs to be integrated with the
commit / abort records, so it is redone during crash recovery - that's not
possible with xact callbacks.
 
Yes, but this patch is about making the extension aware that a file node is being unlinked.


To me it still seems fundamentally the wrong direction to implement a "drop
relation callback" tableam callback.

This is not really a "drop table" callback, it is just the most obvious case where this is missing. So, just to recap the situation as it looks right now.

Here is (transactional) truncate table:
  1. Allocate a new file node in the same tablespace as the table
  2. Add the file node to the list of pending node to delete
  3. Overwrite the existing file node in the relation with the new one
  4. Call table_relation_set_new_filenode to tell extension that there is a new filenode for the relation
Here is drop table:
  1. Add the existing file node to the list of pending deletes
  2. Remove the table from the catalogs
For an extension writer, the disappearance of the old file node is "invisible" since there is no callback about this, but it is very clear when a new file node is allocated. In addition to being inconsistent, it adds an extra burden on the extension writer. To notice that a file node has been unlinked you can register a transaction handler and investigate the pending list at commit or abort time. Even though possible, there are two problems with this: 1) the table access method is notified "late" in the transaction that the file node is going away, and 2) it is unnecessarily complicated to register a transaction handler only for inspecting this.

Telling the access method that the filenode is unlinked by adding a callback is by far the best solution since it does not affect existing extensions and will give the table access methods opportunities to act on it immediately.

I have attached an updated patch that changes the names of the callbacks since there was a name change. I had also missed the case of unlinking a file node when tables were truncated, so I have added a callback for this as well.

Best wishes,
Mats Kindahl


Greetings,

Andres Freund
Attachment

Re: Table AM and DROP TABLE [ Was: Table AM and DDLs]

From
Andres Freund
Date:
Hi,

On 2022-11-16 14:49:59 +0100, Mats Kindahl wrote:
> I think the discussion went a little sideways, so let me recap what I'm
> suggesting:
> 
>    1. I mentioned that there is a missing callback when the filenode is
>    unlinked and this is particularly evident when dropping a table.
>    2. It was correctly pointed out to me that an implementor need to ensure
>    that dropping a table is transactional.
>    3. I argued that the callback is still correct and outlined how this can
>    be handled by a table access method using xact callbacks, if necessary.

I still think 3) isn't a solution to 2). The main issue is that xact callbacks
don't address that storage has to be removed during redo as well. For that
dropping storage has to be integrated with commit / abort records.

I don't think custom a custom WAL rmgr addresses this either - it has to be
integrated with the commit record, and the record has to be replayed as a whole.


> I see huge potential in the table access method and would like to do my
> part in helping it in succeeding. I noted that the API is biased in how the
> existing heap implementation works and is also very focused on
> implementations of "local" storage engines. For more novel architectures
> (for example, various sorts of distributed architectures) and to be easier
> to work with, I think that the API can be improved in a few places. This is
> a first step in the direction of making the API both easier to use as well
> as enabling more novel use-cases.

I agree with that - there's lots more work to be done and the evolution from
not having the abstraction clearly shows. Some of the deficiencies are easy to
fix, but others are there because there's no quick solution to them.


> > To me it still seems fundamentally the wrong direction to implement a "drop
> > relation callback" tableam callback.
> >
> 
> This is not really a "drop table" callback, it is just the most obvious
> case where this is missing. So, just to recap the situation as it looks
> right now.
> 
> Here is (transactional) truncate table:
> 
>    1. Allocate a new file node in the same tablespace as the table
>    2. Add the file node to the list of pending node to delete
>    3. Overwrite the existing file node in the relation with the new one
>    4. Call table_relation_set_new_filenode to tell extension that there is
>    a new filenode for the relation
> 
> Here is drop table:
> 
>    1. Add the existing file node to the list of pending deletes
>    2. Remove the table from the catalogs
> 
> For an extension writer, the disappearance of the old file node is
> "invisible" since there is no callback about this, but it is very clear
> when a new file node is allocated. In addition to being inconsistent, it
> adds an extra burden on the extension writer. To notice that a file node
> has been unlinked you can register a transaction handler and investigate
> the pending list at commit or abort time. Even though possible, there are
> two problems with this: 1) the table access method is notified "late" in
> the transaction that the file node is going away, and 2) it is
> unnecessarily complicated to register a transaction handler only for
> inspecting this.

Using a transaction callback solely also doesn't address the redo issue...


I think to make this viable for filenodes that don't look like md.c's, you'd
have to add a way to make commit/abort records extensible by custom
rmgrs. Then the replay of a commit/abort could call the custom rmgr for the
"sub-record" during replay.


> Telling the access method that the filenode is unlinked by adding a
> callback is by far the best solution since it does not affect existing
> extensions and will give the table access methods opportunities to act on
> it immediately.

I'm loathe to add a callback that I don't think can be used correctly without
further changes.

Greetings,

Andres Freund