Re: Table AM and DDLs - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Table AM and DDLs
Date
Msg-id 20210321231626.mt32oc653b7vmdn2@alap3.anarazel.de
Whole thread Raw
Responses Table AM and DROP TABLE [ Was: Table AM and DDLs]
List pgsql-hackers
Hi,

On 2021-03-03 22:15:18 +0100, Mats Kindahl wrote:
> On Tue, Feb 23, 2021 at 2:11 AM Andres Freund <andres@anarazel.de> wrote:
> Thanks for the answer and sorry about the late reply.

Mine is even later ;)


> > I don't think that's quite right. It's not exactly obvious from the
> > name, but RelationDropStorage() does not actually drop storage. Instead
> > it *schedules* the storage to be dropped upon commit.
> >
> > The reason for deferring the dropping of table storage is that DDL in
> > postgres is transactional. Therefore we cannot remove the storage at the
> > moment the DROP TABLE is executed - only when the transaction that
> > performed the DDL commits. Therefore just providing you with a callback
> > that runs in heap_drop_with_catalog() doesn't really achieve much -
> > you'd not have a way to execute the "actual" dropping of the relation at
> > the later stage.
> >
> 
> Yeah, I found the chain (performDeletion -> deleteOneObject -> doDeletion
> -> heap_drop_with_catalog) where the delete was just scheduled for deletion
> but it appeared like this was the place to actually perform the "actual"
> delete. Looking closer, I see this was the wrong location. However, the
> intention was to get a callback when the "actual" delete should happen.
> Before that, the blocks are still potentially alive and could be read, so
> shouldn't be recycled.
> 
> It seems the right location seems to be in the storage manager (smgr_unlink
> in smgr.c), but that does not seem to be extensible, or are there any plans
> to make it available so that you can implement something other than just
> "magnetic disk"?

There've been patches to add new types of storage below smgr.c, but not
in way that can be done outside of build time. As far as I recall.


> > Before I explain some more: Could you describe in a bit more detail what
> > kind of optimization you'd like to make?
> >
> 
> This is not really about any optimizations, it more about a good API for
> tables and managing storage. If a memory table can be implemented entirely
> in the extension and storage managed fully, there is a lot of interesting
> potential for various implementations of table backends. For this to work I
> think it is necessary to be able to handle schema changes for the backend
> storage in addition to scans, inserts, updates, and deletes, but I am not
> sure if it is already possible in some way that I haven't discovered or if
> I should just try to propose something (making the storage manager API
> extensible seems like a good first attempt).

As long as you have a compatible definition of what is acceptable "in
place" ALTER TABLE (e.g. adding new columns, changing between compatible
types), and what requires a table rewrite (e.g. an incompatible column
type change), I don't see a real problem. Except for the unlink thing
above.

Any schema change requiring a table rewrite will trigger a new relation
to be created, which in turn will involve tableam. After that you'll
just get called back to re-insert all the tuples in the original
relation.

If you want a different definition on what needs a rewrite, good luck,
it'll be a heck of a lot more work.



> > Due to postgres' transactional DDL you cannot really change the storage
> > layout of *existing data* when that DDL command is executed - the data
> > still needs to be interpretable in case the DDL is rolled back
> > (including when crashing).
> >
> 
> No, didn't expect this, but some means to see that a schema change is about
> to happen.

For anything that's not in-place you'll see a new table being created
(c.f. ATRewriteTables() calling make_new_heap()). The relfilenode
identifying the data (as opposed to the oid, identifying a relation),
will then be swapped with the current table's relfilenode via
finish_heap_swap().


> > Other changes, e.g. changing the type of a column "sufficiently", will
> > cause a so called table rewrite. Which means that a new relation will be
> > created (including a call to relation_set_new_filenode()), then that new
> > relation will get all the new data inserted, and then
> > pg_class->relfilenode for the "original" relation will be changed to the
> > "rewritten" table (there's two variants of this, once for rewrites due
> > to ALTER TABLE and a separate one for VACUUM FULL/CLUSTER).

> But that is not visible in the access method interface. If I add debug
> output to the memory table, I only see a call to needs_toast_table. If
> there were a new call to create a new block and some additional information
> about , this would be possible to handle.

It should be. If I e.g. do

CREATE TABLE blarg(id int4 not null);
I get one call to table_relation_set_new_filenode()

#0  table_relation_set_new_filenode (rel=0x7f84c2417b70, newrnode=0x7f84c2417b70, persistence=112 'p',
freezeXid=0x7ffc8c61263c,minmulti=0x7ffc8c612638)
 
    at /home/andres/src/postgresql/src/include/access/tableam.h:1596
#1  0x000055b1901e9116 in heap_create (relname=0x7ffc8c612900 "blarg", relnamespace=2200, reltablespace=0,
relid=3016410,relfilenode=3016410, accessmtd=2, 
 
    tupDesc=0x55b191d2a8c8, relkind=114 'r', relpersistence=112 'p', shared_relation=false, mapped_relation=false,
allow_system_table_mods=false,
 
    relfrozenxid=0x7ffc8c61263c, relminmxid=0x7ffc8c612638) at
/home/andres/src/postgresql/src/backend/catalog/heap.c:436
#2  0x000055b1901eab28 in heap_create_with_catalog (relname=0x7ffc8c612900 "blarg", relnamespace=2200, reltablespace=0,
relid=3016410,reltypeid=0, 
 
    reloftypeid=0, ownerid=10, accessmtd=2, tupdesc=0x55b191d2a8c8, cooked_constraints=0x0, relkind=114 'r',
relpersistence=112'p', shared_relation=false, 
 
    mapped_relation=false, oncommit=ONCOMMIT_NOOP, reloptions=0, use_user_acl=true, allow_system_table_mods=false,
is_internal=false,relrewrite=0, 
 
    typaddress=0x0) at /home/andres/src/postgresql/src/backend/catalog/heap.c:1291
#3  0x000055b19030002a in DefineRelation (stmt=0x55b191d31478, relkind=114 'r', ownerId=10, typaddress=0x0, 
    queryString=0x55b191c35bc0 "CREATE TABLE blarg(id int8 not null

then when I do
ALTER TABLE blarg ALTER COLUMN id TYPE int8;
I see
#0  table_relation_set_new_filenode (rel=0x7f84c241f2a8, newrnode=0x7f84c241f2a8, persistence=112 'p',
freezeXid=0x7ffc8c61275c,minmulti=0x7ffc8c612758)
 
    at /home/andres/src/postgresql/src/include/access/tableam.h:1596
#1  0x000055b1901e9116 in heap_create (relname=0x7ffc8c612860 "pg_temp_3016404", relnamespace=2200, reltablespace=0,
relid=3016407,relfilenode=3016407, 
 
    accessmtd=2, tupDesc=0x7f84c24162a0, relkind=114 'r', relpersistence=112 'p', shared_relation=false,
mapped_relation=false,allow_system_table_mods=true, 
 
    relfrozenxid=0x7ffc8c61275c, relminmxid=0x7ffc8c612758) at
/home/andres/src/postgresql/src/backend/catalog/heap.c:436
#2  0x000055b1901eab28 in heap_create_with_catalog (relname=0x7ffc8c612860 "pg_temp_3016404", relnamespace=2200,
reltablespace=0,relid=3016407, reltypeid=0, 
 
    reloftypeid=0, ownerid=10, accessmtd=2, tupdesc=0x7f84c24162a0, cooked_constraints=0x0, relkind=114 'r',
relpersistence=112'p', shared_relation=false, 
 
    mapped_relation=false, oncommit=ONCOMMIT_NOOP, reloptions=0, use_user_acl=false, allow_system_table_mods=true,
is_internal=true,relrewrite=3016404, 
 
    typaddress=0x0) at /home/andres/src/postgresql/src/backend/catalog/heap.c:1291



> I *was* expecting either a call of set_filenode with a new xact id, or
> something like that, and with some information so that you can locate the
> schema change planned (e.g., digging through pg_class and friends), I just
> don't see that when I add debug output.

You should. And it'll have the new table "schema" associated. E.g. in
the above example the new table will have
rel->rd_att->natts == 1
rel->rd_att->attrs[0].atttypid == 20 (i.e. int8)


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] Custom compression methods (mac+lz4.h)
Next
From: Justin Pryzby
Date:
Subject: Re: [HACKERS] Custom compression methods (mac+lz4.h)