Table AM and DROP TABLE [ Was: Table AM and DDLs] - Mailing list pgsql-hackers

From Mats Kindahl
Subject Table AM and DROP TABLE [ Was: Table AM and DDLs]
Date
Msg-id CA+144243EHgLYcmBSd+WWDDOfZRzAux8F=L6X7TxFZRADR9mUQ@mail.gmail.com
Whole thread Raw
In response to Re: Table AM and DDLs  (Andres Freund <andres@anarazel.de>)
Responses Re: Table AM and DROP TABLE [ Was: Table AM and DDLs]  (Aleksander Alekseev <aleksander@timescale.com>)
Re: Table AM and DROP TABLE [ Was: Table AM and DDLs]  (Alexander Kuzmenkov <akuzmenkov@timescale.com>)
Re: Table AM and DROP TABLE [ Was: Table AM and DDLs]  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Mon, Mar 22, 2021 at 12:16 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2021-03-03 22:15:18 +0100, Mats Kindahl wrote:
> On Tue, Feb 23, 2021 at 2:11 AM Andres Freund <andres@anarazel.de> wrote:
> Thanks for the answer and sorry about the late reply.

Mine is even later ;)

:)

Seems I keep the tradition. :)

Thanks a lot for the pointers, I have some comments below both about DROP TABLE and ALTER TABLE.


> > I don't think that's quite right. It's not exactly obvious from the
> > name, but RelationDropStorage() does not actually drop storage. Instead
> > it *schedules* the storage to be dropped upon commit.
> >
> > The reason for deferring the dropping of table storage is that DDL in
> > postgres is transactional. Therefore we cannot remove the storage at the
> > moment the DROP TABLE is executed - only when the transaction that
> > performed the DDL commits. Therefore just providing you with a callback
> > that runs in heap_drop_with_catalog() doesn't really achieve much -
> > you'd not have a way to execute the "actual" dropping of the relation at
> > the later stage.
> >
>
> Yeah, I found the chain (performDeletion -> deleteOneObject -> doDeletion
> -> heap_drop_with_catalog) where the delete was just scheduled for deletion
> but it appeared like this was the place to actually perform the "actual"
> delete. Looking closer, I see this was the wrong location. However, the
> intention was to get a callback when the "actual" delete should happen.
> Before that, the blocks are still potentially alive and could be read, so
> shouldn't be recycled.
>
> It seems the right location seems to be in the storage manager (smgr_unlink
> in smgr.c), but that does not seem to be extensible, or are there any plans
> to make it available so that you can implement something other than just
> "magnetic disk"?

There've been patches to add new types of storage below smgr.c, but not
in way that can be done outside of build time. As far as I recall.

I have done some more research and I do not think it is necessary to extend the storage layer. As a matter of fact, I think the patch I suggested is the right approach: let me elaborate on why.

Let's look at how the implementation works with the heap access method (the file heapam_handler.c) and for this case let's use CREATE TABLE, DROP TABLE, and TRUNCATE TABLE (last one since that is supported in the Table AM and hence is a good reference for the comparison).

Disregarding surrounding layers, we have three layers that are important here:
  1. Heap catalog layer (not sure what to call it, but it's the src/backend/catalog/heap.c file)
  2. AM layer (the src/backend/access/heap/heapam_handler.c file)
  3. Storage layer (the src/backend/catalog/storage.c file) "code to create and destroy physical storage for relations".
Looking at CREATE TRUNCATE, we have the following calls through these layers:
  1. In the heap catalog layer we have a call of heap_truncate_one_rel which calls the table AM layer.
  2. In the Table AM layer heapam_relation_nontransactional_truncate will just call the storage layer to truncate the storage.
  3. The storage layer gets called through RelationTruncate, which will truncate the actual files.
Looking at CREATE TABLE, we have a similar pattern:
  1. In the heap catalog layer heap_create_with_catalog is called, which in turn calls heap_create, which will create the actual relcache and also call the table AM layer if it is a relation, materialized view, or toastvalue.
  2. In the Table AM layer, heapam_relation_set_new_filenode is called which will record the transaction identifiers and call the storage layer to create the underlying storage.
  3. In the storage layer, RelationCreateStorage will create the necessary storage, but also register the table for deletion if the transaction is aborted.
Note here that the storage layer remembers the table for deletion by saving it in pendingDeletes, which is local to the storage layer.

Looking at DROP TABLE, we have a similar pattern, but am missing one step:
  1. In the heap catalog layer the function heap_drop_with_catalog is called, which releases the system cache and calls the storage layer to drop the relation
  2. In the storage layer, the function RelationDropStorage is called, which will record the table to be dropped in the pendingDeletes
When committing (or aborting) the transaction, there are two calls that are interesting, in this order:
  1. CallXactCallbacks which calls registered callbacks
  2. smgrDoPendingDeletes, which calls the storage layer directly to perform the actual deletion, if necessary.
Now, suppose that we want to replace the storage layer with a different one. It is straightforward to replace it by implementing the Table AM methods above, but we are missing a callback on dropping the table. If we have that, we can record the table-to-be-dropped in a similar manner to how the heap AM does it and register a transaction callback using RegisterXactCallback.



> > Before I explain some more: Could you describe in a bit more detail what
> > kind of optimization you'd like to make?
> >
>
> This is not really about any optimizations, it more about a good API for
> tables and managing storage. If a memory table can be implemented entirely
> in the extension and storage managed fully, there is a lot of interesting
> potential for various implementations of table backends. For this to work I
> think it is necessary to be able to handle schema changes for the backend
> storage in addition to scans, inserts, updates, and deletes, but I am not
> sure if it is already possible in some way that I haven't discovered or if
> I should just try to propose something (making the storage manager API
> extensible seems like a good first attempt).

As long as you have a compatible definition of what is acceptable "in
place" ALTER TABLE (e.g. adding new columns, changing between compatible
types), and what requires a table rewrite (e.g. an incompatible column
type change), I don't see a real problem. Except for the unlink thing
above.

Any schema change requiring a table rewrite will trigger a new relation
to be created, which in turn will involve tableam. After that you'll
just get called back to re-insert all the tuples in the original
relation.

If you want a different definition on what needs a rewrite, good luck,
it'll be a heck of a lot more work.

No, this should work fine.

> > Due to postgres' transactional DDL you cannot really change the storage
> > layout of *existing data* when that DDL command is executed - the data
> > still needs to be interpretable in case the DDL is rolled back
> > (including when crashing).
> >
>
> No, didn't expect this, but some means to see that a schema change is about
> to happen.

For anything that's not in-place you'll see a new table being created
(c.f. ATRewriteTables() calling make_new_heap()). The relfilenode
identifying the data (as opposed to the oid, identifying a relation),
will then be swapped with the current table's relfilenode via
finish_heap_swap().


> > Other changes, e.g. changing the type of a column "sufficiently", will
> > cause a so called table rewrite. Which means that a new relation will be
> > created (including a call to relation_set_new_filenode()), then that new
> > relation will get all the new data inserted, and then
> > pg_class->relfilenode for the "original" relation will be changed to the
> > "rewritten" table (there's two variants of this, once for rewrites due
> > to ALTER TABLE and a separate one for VACUUM FULL/CLUSTER).

> But that is not visible in the access method interface. If I add debug
> output to the memory table, I only see a call to needs_toast_table. If
> there were a new call to create a new block and some additional information
> about , this would be possible to handle.

It should be. If I e.g. do

CREATE TABLE blarg(id int4 not null);
I get one call to table_relation_set_new_filenode()

#0  table_relation_set_new_filenode (rel=0x7f84c2417b70, newrnode=0x7f84c2417b70, persistence=112 'p', freezeXid=0x7ffc8c61263c, minmulti=0x7ffc8c612638)
    at /home/andres/src/postgresql/src/include/access/tableam.h:1596
#1  0x000055b1901e9116 in heap_create (relname=0x7ffc8c612900 "blarg", relnamespace=2200, reltablespace=0, relid=3016410, relfilenode=3016410, accessmtd=2,
    tupDesc=0x55b191d2a8c8, relkind=114 'r', relpersistence=112 'p', shared_relation=false, mapped_relation=false, allow_system_table_mods=false,
    relfrozenxid=0x7ffc8c61263c, relminmxid=0x7ffc8c612638) at /home/andres/src/postgresql/src/backend/catalog/heap.c:436
#2  0x000055b1901eab28 in heap_create_with_catalog (relname=0x7ffc8c612900 "blarg", relnamespace=2200, reltablespace=0, relid=3016410, reltypeid=0,
    reloftypeid=0, ownerid=10, accessmtd=2, tupdesc=0x55b191d2a8c8, cooked_constraints=0x0, relkind=114 'r', relpersistence=112 'p', shared_relation=false,
    mapped_relation=false, oncommit=ONCOMMIT_NOOP, reloptions=0, use_user_acl=true, allow_system_table_mods=false, is_internal=false, relrewrite=0,
    typaddress=0x0) at /home/andres/src/postgresql/src/backend/catalog/heap.c:1291
#3  0x000055b19030002a in DefineRelation (stmt=0x55b191d31478, relkind=114 'r', ownerId=10, typaddress=0x0,
    queryString=0x55b191c35bc0 "CREATE TABLE blarg(id int8 not null

then when I do
ALTER TABLE blarg ALTER COLUMN id TYPE int8;
I see
#0  table_relation_set_new_filenode (rel=0x7f84c241f2a8, newrnode=0x7f84c241f2a8, persistence=112 'p', freezeXid=0x7ffc8c61275c, minmulti=0x7ffc8c612758)
    at /home/andres/src/postgresql/src/include/access/tableam.h:1596
#1  0x000055b1901e9116 in heap_create (relname=0x7ffc8c612860 "pg_temp_3016404", relnamespace=2200, reltablespace=0, relid=3016407, relfilenode=3016407,
    accessmtd=2, tupDesc=0x7f84c24162a0, relkind=114 'r', relpersistence=112 'p', shared_relation=false, mapped_relation=false, allow_system_table_mods=true,
    relfrozenxid=0x7ffc8c61275c, relminmxid=0x7ffc8c612758) at /home/andres/src/postgresql/src/backend/catalog/heap.c:436
#2  0x000055b1901eab28 in heap_create_with_catalog (relname=0x7ffc8c612860 "pg_temp_3016404", relnamespace=2200, reltablespace=0, relid=3016407, reltypeid=0,
    reloftypeid=0, ownerid=10, accessmtd=2, tupdesc=0x7f84c24162a0, cooked_constraints=0x0, relkind=114 'r', relpersistence=112 'p', shared_relation=false,
    mapped_relation=false, oncommit=ONCOMMIT_NOOP, reloptions=0, use_user_acl=false, allow_system_table_mods=true, is_internal=true, relrewrite=3016404,
    typaddress=0x0) at /home/andres/src/postgresql/src/backend/catalog/heap.c:1291



> I *was* expecting either a call of set_filenode with a new xact id, or
> something like that, and with some information so that you can locate the
> schema change planned (e.g., digging through pg_class and friends), I just
> don't see that when I add debug output.

You should. And it'll have the new table "schema" associated. E.g. in
the above example the new table will have
rel->rd_att->natts == 1
rel->rd_att->attrs[0].atttypid == 20 (i.e. int8)

I didn't get a callback because I did ADD COLUMN and that works differently: it does not call set_filenode until you either try to insert something or run a vacuum. Thanks for the pointers, it helped a lot. I need to look over the code a little more.

Thanks,
Mats Kindahl



Greetings,

Andres Freund

pgsql-hackers by date:

Previous
From: David Steele
Date:
Subject: Re: Stronger safeguard for archive recovery not to miss data
Next
From: Bryn Llewellyn
Date:
Subject: Re: Have I found an interval arithmetic bug?