Thread: Re: Table AM and DDLs
Hi, On 2021-03-03 22:15:18 +0100, Mats Kindahl wrote: > On Tue, Feb 23, 2021 at 2:11 AM Andres Freund <andres@anarazel.de> wrote: > Thanks for the answer and sorry about the late reply. Mine is even later ;) > > I don't think that's quite right. It's not exactly obvious from the > > name, but RelationDropStorage() does not actually drop storage. Instead > > it *schedules* the storage to be dropped upon commit. > > > > The reason for deferring the dropping of table storage is that DDL in > > postgres is transactional. Therefore we cannot remove the storage at the > > moment the DROP TABLE is executed - only when the transaction that > > performed the DDL commits. Therefore just providing you with a callback > > that runs in heap_drop_with_catalog() doesn't really achieve much - > > you'd not have a way to execute the "actual" dropping of the relation at > > the later stage. > > > > Yeah, I found the chain (performDeletion -> deleteOneObject -> doDeletion > -> heap_drop_with_catalog) where the delete was just scheduled for deletion > but it appeared like this was the place to actually perform the "actual" > delete. Looking closer, I see this was the wrong location. However, the > intention was to get a callback when the "actual" delete should happen. > Before that, the blocks are still potentially alive and could be read, so > shouldn't be recycled. > > It seems the right location seems to be in the storage manager (smgr_unlink > in smgr.c), but that does not seem to be extensible, or are there any plans > to make it available so that you can implement something other than just > "magnetic disk"? There've been patches to add new types of storage below smgr.c, but not in way that can be done outside of build time. As far as I recall. > > Before I explain some more: Could you describe in a bit more detail what > > kind of optimization you'd like to make? > > > > This is not really about any optimizations, it more about a good API for > tables and managing storage. If a memory table can be implemented entirely > in the extension and storage managed fully, there is a lot of interesting > potential for various implementations of table backends. For this to work I > think it is necessary to be able to handle schema changes for the backend > storage in addition to scans, inserts, updates, and deletes, but I am not > sure if it is already possible in some way that I haven't discovered or if > I should just try to propose something (making the storage manager API > extensible seems like a good first attempt). As long as you have a compatible definition of what is acceptable "in place" ALTER TABLE (e.g. adding new columns, changing between compatible types), and what requires a table rewrite (e.g. an incompatible column type change), I don't see a real problem. Except for the unlink thing above. Any schema change requiring a table rewrite will trigger a new relation to be created, which in turn will involve tableam. After that you'll just get called back to re-insert all the tuples in the original relation. If you want a different definition on what needs a rewrite, good luck, it'll be a heck of a lot more work. > > Due to postgres' transactional DDL you cannot really change the storage > > layout of *existing data* when that DDL command is executed - the data > > still needs to be interpretable in case the DDL is rolled back > > (including when crashing). > > > > No, didn't expect this, but some means to see that a schema change is about > to happen. For anything that's not in-place you'll see a new table being created (c.f. ATRewriteTables() calling make_new_heap()). The relfilenode identifying the data (as opposed to the oid, identifying a relation), will then be swapped with the current table's relfilenode via finish_heap_swap(). > > Other changes, e.g. changing the type of a column "sufficiently", will > > cause a so called table rewrite. Which means that a new relation will be > > created (including a call to relation_set_new_filenode()), then that new > > relation will get all the new data inserted, and then > > pg_class->relfilenode for the "original" relation will be changed to the > > "rewritten" table (there's two variants of this, once for rewrites due > > to ALTER TABLE and a separate one for VACUUM FULL/CLUSTER). > But that is not visible in the access method interface. If I add debug > output to the memory table, I only see a call to needs_toast_table. If > there were a new call to create a new block and some additional information > about , this would be possible to handle. It should be. If I e.g. do CREATE TABLE blarg(id int4 not null); I get one call to table_relation_set_new_filenode() #0 table_relation_set_new_filenode (rel=0x7f84c2417b70, newrnode=0x7f84c2417b70, persistence=112 'p', freezeXid=0x7ffc8c61263c,minmulti=0x7ffc8c612638) at /home/andres/src/postgresql/src/include/access/tableam.h:1596 #1 0x000055b1901e9116 in heap_create (relname=0x7ffc8c612900 "blarg", relnamespace=2200, reltablespace=0, relid=3016410,relfilenode=3016410, accessmtd=2, tupDesc=0x55b191d2a8c8, relkind=114 'r', relpersistence=112 'p', shared_relation=false, mapped_relation=false, allow_system_table_mods=false, relfrozenxid=0x7ffc8c61263c, relminmxid=0x7ffc8c612638) at /home/andres/src/postgresql/src/backend/catalog/heap.c:436 #2 0x000055b1901eab28 in heap_create_with_catalog (relname=0x7ffc8c612900 "blarg", relnamespace=2200, reltablespace=0, relid=3016410,reltypeid=0, reloftypeid=0, ownerid=10, accessmtd=2, tupdesc=0x55b191d2a8c8, cooked_constraints=0x0, relkind=114 'r', relpersistence=112'p', shared_relation=false, mapped_relation=false, oncommit=ONCOMMIT_NOOP, reloptions=0, use_user_acl=true, allow_system_table_mods=false, is_internal=false,relrewrite=0, typaddress=0x0) at /home/andres/src/postgresql/src/backend/catalog/heap.c:1291 #3 0x000055b19030002a in DefineRelation (stmt=0x55b191d31478, relkind=114 'r', ownerId=10, typaddress=0x0, queryString=0x55b191c35bc0 "CREATE TABLE blarg(id int8 not null then when I do ALTER TABLE blarg ALTER COLUMN id TYPE int8; I see #0 table_relation_set_new_filenode (rel=0x7f84c241f2a8, newrnode=0x7f84c241f2a8, persistence=112 'p', freezeXid=0x7ffc8c61275c,minmulti=0x7ffc8c612758) at /home/andres/src/postgresql/src/include/access/tableam.h:1596 #1 0x000055b1901e9116 in heap_create (relname=0x7ffc8c612860 "pg_temp_3016404", relnamespace=2200, reltablespace=0, relid=3016407,relfilenode=3016407, accessmtd=2, tupDesc=0x7f84c24162a0, relkind=114 'r', relpersistence=112 'p', shared_relation=false, mapped_relation=false,allow_system_table_mods=true, relfrozenxid=0x7ffc8c61275c, relminmxid=0x7ffc8c612758) at /home/andres/src/postgresql/src/backend/catalog/heap.c:436 #2 0x000055b1901eab28 in heap_create_with_catalog (relname=0x7ffc8c612860 "pg_temp_3016404", relnamespace=2200, reltablespace=0,relid=3016407, reltypeid=0, reloftypeid=0, ownerid=10, accessmtd=2, tupdesc=0x7f84c24162a0, cooked_constraints=0x0, relkind=114 'r', relpersistence=112'p', shared_relation=false, mapped_relation=false, oncommit=ONCOMMIT_NOOP, reloptions=0, use_user_acl=false, allow_system_table_mods=true, is_internal=true,relrewrite=3016404, typaddress=0x0) at /home/andres/src/postgresql/src/backend/catalog/heap.c:1291 > I *was* expecting either a call of set_filenode with a new xact id, or > something like that, and with some information so that you can locate the > schema change planned (e.g., digging through pg_class and friends), I just > don't see that when I add debug output. You should. And it'll have the new table "schema" associated. E.g. in the above example the new table will have rel->rd_att->natts == 1 rel->rd_att->attrs[0].atttypid == 20 (i.e. int8) Greetings, Andres Freund
Hi,
On 2021-03-03 22:15:18 +0100, Mats Kindahl wrote:
> On Tue, Feb 23, 2021 at 2:11 AM Andres Freund <andres@anarazel.de> wrote:
> Thanks for the answer and sorry about the late reply.
Mine is even later ;)
> > I don't think that's quite right. It's not exactly obvious from the
> > name, but RelationDropStorage() does not actually drop storage. Instead
> > it *schedules* the storage to be dropped upon commit.
> >
> > The reason for deferring the dropping of table storage is that DDL in
> > postgres is transactional. Therefore we cannot remove the storage at the
> > moment the DROP TABLE is executed - only when the transaction that
> > performed the DDL commits. Therefore just providing you with a callback
> > that runs in heap_drop_with_catalog() doesn't really achieve much -
> > you'd not have a way to execute the "actual" dropping of the relation at
> > the later stage.
> >
>
> Yeah, I found the chain (performDeletion -> deleteOneObject -> doDeletion
> -> heap_drop_with_catalog) where the delete was just scheduled for deletion
> but it appeared like this was the place to actually perform the "actual"
> delete. Looking closer, I see this was the wrong location. However, the
> intention was to get a callback when the "actual" delete should happen.
> Before that, the blocks are still potentially alive and could be read, so
> shouldn't be recycled.
>
> It seems the right location seems to be in the storage manager (smgr_unlink
> in smgr.c), but that does not seem to be extensible, or are there any plans
> to make it available so that you can implement something other than just
> "magnetic disk"?
There've been patches to add new types of storage below smgr.c, but not
in way that can be done outside of build time. As far as I recall.
- Heap catalog layer (not sure what to call it, but it's the src/backend/catalog/heap.c file)
- AM layer (the src/backend/access/heap/heapam_handler.c file)
- Storage layer (the src/backend/catalog/storage.c file) "code to create and destroy physical storage for relations".
- In the heap catalog layer we have a call of heap_truncate_one_rel which calls the table AM layer.
- In the Table AM layer heapam_relation_nontransactional_truncate will just call the storage layer to truncate the storage.
- The storage layer gets called through RelationTruncate, which will truncate the actual files.
- In the heap catalog layer heap_create_with_catalog is called, which in turn calls heap_create, which will create the actual relcache and also call the table AM layer if it is a relation, materialized view, or toastvalue.
- In the Table AM layer, heapam_relation_set_new_filenode is called which will record the transaction identifiers and call the storage layer to create the underlying storage.
- In the storage layer, RelationCreateStorage will create the necessary storage, but also register the table for deletion if the transaction is aborted.
- In the heap catalog layer the function heap_drop_with_catalog is called, which releases the system cache and calls the storage layer to drop the relation
- In the storage layer, the function RelationDropStorage is called, which will record the table to be dropped in the pendingDeletes
- CallXactCallbacks which calls registered callbacks
- smgrDoPendingDeletes, which calls the storage layer directly to perform the actual deletion, if necessary.
> > Before I explain some more: Could you describe in a bit more detail what
> > kind of optimization you'd like to make?
> >
>
> This is not really about any optimizations, it more about a good API for
> tables and managing storage. If a memory table can be implemented entirely
> in the extension and storage managed fully, there is a lot of interesting
> potential for various implementations of table backends. For this to work I
> think it is necessary to be able to handle schema changes for the backend
> storage in addition to scans, inserts, updates, and deletes, but I am not
> sure if it is already possible in some way that I haven't discovered or if
> I should just try to propose something (making the storage manager API
> extensible seems like a good first attempt).
As long as you have a compatible definition of what is acceptable "in
place" ALTER TABLE (e.g. adding new columns, changing between compatible
types), and what requires a table rewrite (e.g. an incompatible column
type change), I don't see a real problem. Except for the unlink thing
above.
Any schema change requiring a table rewrite will trigger a new relation
to be created, which in turn will involve tableam. After that you'll
just get called back to re-insert all the tuples in the original
relation.
If you want a different definition on what needs a rewrite, good luck,
it'll be a heck of a lot more work.
> > Due to postgres' transactional DDL you cannot really change the storage
> > layout of *existing data* when that DDL command is executed - the data
> > still needs to be interpretable in case the DDL is rolled back
> > (including when crashing).
> >
>
> No, didn't expect this, but some means to see that a schema change is about
> to happen.
For anything that's not in-place you'll see a new table being created
(c.f. ATRewriteTables() calling make_new_heap()). The relfilenode
identifying the data (as opposed to the oid, identifying a relation),
will then be swapped with the current table's relfilenode via
finish_heap_swap().
> > Other changes, e.g. changing the type of a column "sufficiently", will
> > cause a so called table rewrite. Which means that a new relation will be
> > created (including a call to relation_set_new_filenode()), then that new
> > relation will get all the new data inserted, and then
> > pg_class->relfilenode for the "original" relation will be changed to the
> > "rewritten" table (there's two variants of this, once for rewrites due
> > to ALTER TABLE and a separate one for VACUUM FULL/CLUSTER).
> But that is not visible in the access method interface. If I add debug
> output to the memory table, I only see a call to needs_toast_table. If
> there were a new call to create a new block and some additional information
> about , this would be possible to handle.
It should be. If I e.g. do
CREATE TABLE blarg(id int4 not null);
I get one call to table_relation_set_new_filenode()
#0 table_relation_set_new_filenode (rel=0x7f84c2417b70, newrnode=0x7f84c2417b70, persistence=112 'p', freezeXid=0x7ffc8c61263c, minmulti=0x7ffc8c612638)
at /home/andres/src/postgresql/src/include/access/tableam.h:1596
#1 0x000055b1901e9116 in heap_create (relname=0x7ffc8c612900 "blarg", relnamespace=2200, reltablespace=0, relid=3016410, relfilenode=3016410, accessmtd=2,
tupDesc=0x55b191d2a8c8, relkind=114 'r', relpersistence=112 'p', shared_relation=false, mapped_relation=false, allow_system_table_mods=false,
relfrozenxid=0x7ffc8c61263c, relminmxid=0x7ffc8c612638) at /home/andres/src/postgresql/src/backend/catalog/heap.c:436
#2 0x000055b1901eab28 in heap_create_with_catalog (relname=0x7ffc8c612900 "blarg", relnamespace=2200, reltablespace=0, relid=3016410, reltypeid=0,
reloftypeid=0, ownerid=10, accessmtd=2, tupdesc=0x55b191d2a8c8, cooked_constraints=0x0, relkind=114 'r', relpersistence=112 'p', shared_relation=false,
mapped_relation=false, oncommit=ONCOMMIT_NOOP, reloptions=0, use_user_acl=true, allow_system_table_mods=false, is_internal=false, relrewrite=0,
typaddress=0x0) at /home/andres/src/postgresql/src/backend/catalog/heap.c:1291
#3 0x000055b19030002a in DefineRelation (stmt=0x55b191d31478, relkind=114 'r', ownerId=10, typaddress=0x0,
queryString=0x55b191c35bc0 "CREATE TABLE blarg(id int8 not null
then when I do
ALTER TABLE blarg ALTER COLUMN id TYPE int8;
I see
#0 table_relation_set_new_filenode (rel=0x7f84c241f2a8, newrnode=0x7f84c241f2a8, persistence=112 'p', freezeXid=0x7ffc8c61275c, minmulti=0x7ffc8c612758)
at /home/andres/src/postgresql/src/include/access/tableam.h:1596
#1 0x000055b1901e9116 in heap_create (relname=0x7ffc8c612860 "pg_temp_3016404", relnamespace=2200, reltablespace=0, relid=3016407, relfilenode=3016407,
accessmtd=2, tupDesc=0x7f84c24162a0, relkind=114 'r', relpersistence=112 'p', shared_relation=false, mapped_relation=false, allow_system_table_mods=true,
relfrozenxid=0x7ffc8c61275c, relminmxid=0x7ffc8c612758) at /home/andres/src/postgresql/src/backend/catalog/heap.c:436
#2 0x000055b1901eab28 in heap_create_with_catalog (relname=0x7ffc8c612860 "pg_temp_3016404", relnamespace=2200, reltablespace=0, relid=3016407, reltypeid=0,
reloftypeid=0, ownerid=10, accessmtd=2, tupdesc=0x7f84c24162a0, cooked_constraints=0x0, relkind=114 'r', relpersistence=112 'p', shared_relation=false,
mapped_relation=false, oncommit=ONCOMMIT_NOOP, reloptions=0, use_user_acl=false, allow_system_table_mods=true, is_internal=true, relrewrite=3016404,
typaddress=0x0) at /home/andres/src/postgresql/src/backend/catalog/heap.c:1291
> I *was* expecting either a call of set_filenode with a new xact id, or
> something like that, and with some information so that you can locate the
> schema change planned (e.g., digging through pg_class and friends), I just
> don't see that when I add debug output.
You should. And it'll have the new table "schema" associated. E.g. in
the above example the new table will have
rel->rd_att->natts == 1
rel->rd_att->attrs[0].atttypid == 20 (i.e. int8)
Greetings,
Andres Freund
Now, suppose that we want to replace the storage layer with a different one. It is straightforward to replace it by implementing the Table AM methods above, but we are missing a callback on dropping the table. If we have that, we can record the table-to-be-dropped in a similar manner to how the heap AM does it and register a transaction callback using RegisterXactCallback.
This explanation makes sense, and the suggested patch makes it easier to replace the storage layer with a different one.
Some other places might become problematic if we're trying to implement fully memory-based tables. For example, the heap_create_with_catalog -> GetNewRelFilenode -> access() call that directly checks the existence of a file bypassing the smgr layer. But I think that adding a symmetric callback to the tableam layer can be a good start for further experiments.
Some nitpicks:
+ /* + * This callback needs to remove all associations with the relation `rel` + * since the relation is being dropped. + * + * See also table_relation_reset_filenode(). + */
"Remove all associations" sounds vague, maybe something like "schedule the relation files to be deleted at transaction commit"?
+ void (*relation_reset_filenode) (Relation rel);
This line uses spaces instead of tabs.
For the reference, there is a recent patch that makes the smgr layer itself pluggable: https://www.postgresql.org/message-id/flat/1dc080496f58ce5375778baed0c0fbcc%40postgrespro.ru#502a1278ad8fce6ae85c08b4806c2289
--
Alexander Kuzmenkov
https://www.timescale.com/
Hi hackers, > As a matter of fact, I think the patch I suggested is the right approach: > let me elaborate on why. > [...] > It is straightforward to replace it by implementing the Table AM methods > above, but we are missing a callback on dropping the table. If we have that, > we can record the table-to-be-dropped in a similar manner to how the heap AM > does it and register a transaction callback using RegisterXactCallback. Since no one objected in 5 months, I assume Mats made a good point. At least, personally, I can't argue. The patch looks good to me except for the fact that comments seem to be inaccurate in light of the discussion. The corrected patch is attached. I'm going to mark it as "Ready for Committer" unless anyone objects. -- Best regards, Aleksander Alekseev
Attachment
Hi hackers, > I'm going to mark it as "Ready for Committer" unless anyone objects. I updated the status of the patch. To clarify, Alexander and I replied almost at the same time. The drawbacks noted by Alexander are fixed in the v2 version of the patch. -- Best regards, Aleksander Alekseev
On 27/09/2021 14:59, Aleksander Alekseev wrote: > Hi hackers, > >> As a matter of fact, I think the patch I suggested is the right approach: >> let me elaborate on why. >> [...] >> It is straightforward to replace it by implementing the Table AM methods >> above, but we are missing a callback on dropping the table. If we have that, >> we can record the table-to-be-dropped in a similar manner to how the heap AM >> does it and register a transaction callback using RegisterXactCallback. > > Since no one objected in 5 months, I assume Mats made a good point. At least, > personally, I can't argue. I agree that having a table AM callback at relation drop would make it more consistent with creating and truncating a relation. Then again, the indexam API doesn't have a drop-callback either. But what can you actually do in the callback? WAL replay of dropping the storage needs to work without running any AM-specific code. It happens as part of replaying a commit record. So whatever action you do in the callback will not be executed at WAL replay. Also, because the callback merely *schedules* things to happen at commit, it cannot generate separate WAL records about dropping resources either. Mats's in-memory table is an interesting example. I guess you don't even try WAL-logging that, so it's OK that nothing happens at WAL replay. As you said, the callback to schedule deletion of the shared memory block and use an end-of-xact callback to perform the deletion. You're basically re-inventing a pending-deletes mechanism similar to smgr's. I think you could actually piggyback on smgr's pending-deletions mechanism instead of re-inventing it. In the callback, you can call smgrGetPendingDeletes(), and drop the shared memory segment for any relation in that list. - Heikki
On 27/09/2021 14:59, Aleksander Alekseev wrote:
> Hi hackers,
>
>> As a matter of fact, I think the patch I suggested is the right approach:
>> let me elaborate on why.
>> [...]
>> It is straightforward to replace it by implementing the Table AM methods
>> above, but we are missing a callback on dropping the table. If we have that,
>> we can record the table-to-be-dropped in a similar manner to how the heap AM
>> does it and register a transaction callback using RegisterXactCallback.
>
> Since no one objected in 5 months, I assume Mats made a good point. At least,
> personally, I can't argue.
I agree that having a table AM callback at relation drop would make it
more consistent with creating and truncating a relation. Then again, the
indexam API doesn't have a drop-callback either.
But what can you actually do in the callback? WAL replay of dropping the
storage needs to work without running any AM-specific code. It happens
as part of replaying a commit record. So whatever action you do in the
callback will not be executed at WAL replay.
Also, because the callback
merely *schedules* things to happen at commit, it cannot generate
separate WAL records about dropping resources either.
Mats's in-memory table is an interesting example. I guess you don't even
try WAL-logging that, so it's OK that nothing happens at WAL replay. As
you said, the callback to schedule deletion of the shared memory block
and use an end-of-xact callback to perform the deletion. You're
basically re-inventing a pending-deletes mechanism similar to smgr's.
I think you could actually piggyback on smgr's pending-deletions
mechanism instead of re-inventing it. In the callback, you can call
smgrGetPendingDeletes(), and drop the shared memory segment for any
relation in that list.
- In a distributed storage engine you might want to distribute changes speculatively when they happen so that the commit, once it occurs, will be fast. By sending out the action early, you allow work to start independently of the current machine, which will improve parallelization.
- In a distributed storage engine or order of the statements received remotely make a difference. For example, if you want to use a distributed locking scheme for your distributed storage, you are currently forced to implement an optimistic scheme while in reality you might want to distribute the drop and lock the table exclusively on all remote nodes (this is already what PostgreSQL does, locking the table on a drop). I do realize that distributed transactions are not that simple and there are other problems associated with this, but this would still introduce an unnecessary restriction on what you can do.
- A problem with optimistic protocols in general is that they drop in performance when you have a lot of writes. It is simply the case that other smaller transactions will constantly force a long-running transaction to be aborted. This also means that there is a risk that a transaction that drops a table will have to be aborted out of necessity since other transactions are updating the table. In a distributed system there will be more of those, so the odds of aborting "more important" transactions (in the sense of needing stronger locks) is higher.
- A smaller issue is that right now the storage manager (smgr) and the transaction system are quite tightly coupled, which makes it more difficult to make the storage system "pluggable". I think that not requiring the use of pendingDeletes would move one step in the direction of removing this coupling, but I am not entirely sure here.
- Heikki
Hi, On 2021-04-05 21:57:12 +0200, Mats Kindahl wrote: > 2. In the storage layer, the function RelationDropStorage is called, > which will record the table to be dropped in the pendingDeletes > > When committing (or aborting) the transaction, there are two calls that are > interesting, in this order: > > 1. CallXactCallbacks which calls registered callbacks > 2. smgrDoPendingDeletes, which calls the storage layer directly to > perform the actual deletion, if necessary. > > Now, suppose that we want to replace the storage layer with a different > one. It is straightforward to replace it by implementing the Table AM > methods above, but we are missing a callback on dropping the table. If we > have that, we can record the table-to-be-dropped in a similar manner to how > the heap AM does it and register a transaction callback using > RegisterXactCallback. I don't think implementing dropping relation data at-commit/rollback using xact callbacks can be correct. The dropping needs to be integrated with the commit / abort records, so it is redone during crash recovery - that's not possible with xact callbacks. To me it still seems fundamentally the wrong direction to implement a "drop relation callback" tableam callback. Greetings, Andres Freund
Hi, The CF entry for this patch doesn't currently apply and there has been a bunch of feedback on the approach. Mats, are you actually waiting for further feedback right now? Greetings, Andres Freund
On Sun, Oct 02, 2022 at 09:53:01AM -0700, Andres Freund wrote: > The CF entry for this patch doesn't currently apply and there has been a bunch > of feedback on the approach. Mats, are you actually waiting for further > feedback right now? Okay, for now this has been marked as RwF. -- Michael
Attachment
- I mentioned that there is a missing callback when the filenode is unlinked and this is particularly evident when dropping a table.
- It was correctly pointed out to me that an implementor need to ensure that dropping a table is transactional.
- I argued that the callback is still correct and outlined how this can be handled by a table access method using xact callbacks, if necessary.
Hi,
On 2021-04-05 21:57:12 +0200, Mats Kindahl wrote:
> 2. In the storage layer, the function RelationDropStorage is called,
> which will record the table to be dropped in the pendingDeletes
>
> When committing (or aborting) the transaction, there are two calls that are
> interesting, in this order:
>
> 1. CallXactCallbacks which calls registered callbacks
> 2. smgrDoPendingDeletes, which calls the storage layer directly to
> perform the actual deletion, if necessary.
>
> Now, suppose that we want to replace the storage layer with a different
> one. It is straightforward to replace it by implementing the Table AM
> methods above, but we are missing a callback on dropping the table. If we
> have that, we can record the table-to-be-dropped in a similar manner to how
> the heap AM does it and register a transaction callback using
> RegisterXactCallback.
don't think implementing dropping relation data at-commit/rollback using
xact callbacks can be correct. The dropping needs to be integrated with the
commit / abort records, so it is redone during crash recovery - that's not
possible with xact callbacks.
To me it still seems fundamentally the wrong direction to implement a "drop
relation callback" tableam callback.
- Allocate a new file node in the same tablespace as the table
- Add the file node to the list of pending node to delete
- Overwrite the existing file node in the relation with the new one
- Call table_relation_set_new_filenode to tell extension that there is a new filenode for the relation
- Add the existing file node to the list of pending deletes
- Remove the table from the catalogs
Greetings,
Andres Freund
Attachment
Hi, On 2022-11-16 14:49:59 +0100, Mats Kindahl wrote: > I think the discussion went a little sideways, so let me recap what I'm > suggesting: > > 1. I mentioned that there is a missing callback when the filenode is > unlinked and this is particularly evident when dropping a table. > 2. It was correctly pointed out to me that an implementor need to ensure > that dropping a table is transactional. > 3. I argued that the callback is still correct and outlined how this can > be handled by a table access method using xact callbacks, if necessary. I still think 3) isn't a solution to 2). The main issue is that xact callbacks don't address that storage has to be removed during redo as well. For that dropping storage has to be integrated with commit / abort records. I don't think custom a custom WAL rmgr addresses this either - it has to be integrated with the commit record, and the record has to be replayed as a whole. > I see huge potential in the table access method and would like to do my > part in helping it in succeeding. I noted that the API is biased in how the > existing heap implementation works and is also very focused on > implementations of "local" storage engines. For more novel architectures > (for example, various sorts of distributed architectures) and to be easier > to work with, I think that the API can be improved in a few places. This is > a first step in the direction of making the API both easier to use as well > as enabling more novel use-cases. I agree with that - there's lots more work to be done and the evolution from not having the abstraction clearly shows. Some of the deficiencies are easy to fix, but others are there because there's no quick solution to them. > > To me it still seems fundamentally the wrong direction to implement a "drop > > relation callback" tableam callback. > > > > This is not really a "drop table" callback, it is just the most obvious > case where this is missing. So, just to recap the situation as it looks > right now. > > Here is (transactional) truncate table: > > 1. Allocate a new file node in the same tablespace as the table > 2. Add the file node to the list of pending node to delete > 3. Overwrite the existing file node in the relation with the new one > 4. Call table_relation_set_new_filenode to tell extension that there is > a new filenode for the relation > > Here is drop table: > > 1. Add the existing file node to the list of pending deletes > 2. Remove the table from the catalogs > > For an extension writer, the disappearance of the old file node is > "invisible" since there is no callback about this, but it is very clear > when a new file node is allocated. In addition to being inconsistent, it > adds an extra burden on the extension writer. To notice that a file node > has been unlinked you can register a transaction handler and investigate > the pending list at commit or abort time. Even though possible, there are > two problems with this: 1) the table access method is notified "late" in > the transaction that the file node is going away, and 2) it is > unnecessarily complicated to register a transaction handler only for > inspecting this. Using a transaction callback solely also doesn't address the redo issue... I think to make this viable for filenodes that don't look like md.c's, you'd have to add a way to make commit/abort records extensible by custom rmgrs. Then the replay of a commit/abort could call the custom rmgr for the "sub-record" during replay. > Telling the access method that the filenode is unlinked by adding a > callback is by far the best solution since it does not affect existing > extensions and will give the table access methods opportunities to act on > it immediately. I'm loathe to add a callback that I don't think can be used correctly without further changes. Greetings, Andres Freund