Re: Table AM and DROP TABLE [ Was: Table AM and DDLs] - Mailing list pgsql-hackers
From | Mats Kindahl |
---|---|
Subject | Re: Table AM and DROP TABLE [ Was: Table AM and DDLs] |
Date | |
Msg-id | CA+14425KObZ6kfoePtX1XRjV7r+hy8MAJTO7vdTvXkz4E2R6gg@mail.gmail.com Whole thread Raw |
In response to | Re: Table AM and DROP TABLE [ Was: Table AM and DDLs] (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: Table AM and DROP TABLE [ Was: Table AM and DDLs]
|
List | pgsql-hackers |
On Wed, Feb 16, 2022 at 10:07 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 27/09/2021 14:59, Aleksander Alekseev wrote:
> Hi hackers,
>
>> As a matter of fact, I think the patch I suggested is the right approach:
>> let me elaborate on why.
>> [...]
>> It is straightforward to replace it by implementing the Table AM methods
>> above, but we are missing a callback on dropping the table. If we have that,
>> we can record the table-to-be-dropped in a similar manner to how the heap AM
>> does it and register a transaction callback using RegisterXactCallback.
>
> Since no one objected in 5 months, I assume Mats made a good point. At least,
> personally, I can't argue.
I agree that having a table AM callback at relation drop would make it
more consistent with creating and truncating a relation. Then again, the
indexam API doesn't have a drop-callback either.
That is actually a good point. We could add an on-drop-callback for the indexam as well, if we add it for tableam. Haven't looked at that though, so if you think it should be added, I can investigate.
But what can you actually do in the callback? WAL replay of dropping the
storage needs to work without running any AM-specific code. It happens
as part of replaying a commit record. So whatever action you do in the
callback will not be executed at WAL replay.
Also, because the callback
merely *schedules* things to happen at commit, it cannot generate
separate WAL records about dropping resources either.
Digressing slightly: This is actually a drawback and I have been looking for a way to do things on recovery.
Just to have an example: if you want to have an in-memory table that is distributed the problem will be that even though the table is empty, it might actually be of interest to fetch the contents from another node on recovery, but this is currently not possible since it is assumed that all actions are present in the WAL and a memory table would not use the WAL for this since one of the goals is to make it fast and avoid writes to disk.
This might be possible to piggyback on the first select or insert done on the table, but it makes the system more complicated since it is easy to miss one place where you need to do this fetching. If you always do this on recovery it is a single place that you need to add and the system will after that be in a predictable state.
In addition, if it were to be added to the first access of the table, it would add execution time to this first operation, but most users would assume that all such work is done at recovery and that the database is "warm" after recovery.
However, this is slightly outside the discussion for this proposed change, so we can ignore it for now.
Mats's in-memory table is an interesting example. I guess you don't even
try WAL-logging that, so it's OK that nothing happens at WAL replay. As
you said, the callback to schedule deletion of the shared memory block
and use an end-of-xact callback to perform the deletion. You're
basically re-inventing a pending-deletes mechanism similar to smgr's.
I think you could actually piggyback on smgr's pending-deletions
mechanism instead of re-inventing it. In the callback, you can call
smgrGetPendingDeletes(), and drop the shared memory segment for any
relation in that list.
Hmm... it is a good point that smgrGetPendingDeletes() can be used in the commit callback for something as simple as a memory table when all the data is local. It should also work well with a distributed optimistic storage engine when you certify the transaction at commit time. What will happen then is that the actual "drop command" will be sent out at commit time rather than when the command is actually executed.
My main interest is, however, to have an API that works for all kinds of storage engines, not just limited to local storage but also supporting distributed storage systems and also being able to interact with existing implementations. There are a few reasons why getting a notification when the table is dropped rather than when the commit is done is beneficial.
- In a distributed storage engine you might want to distribute changes speculatively when they happen so that the commit, once it occurs, will be fast. By sending out the action early, you allow work to start independently of the current machine, which will improve parallelization.
- In a distributed storage engine or order of the statements received remotely make a difference. For example, if you want to use a distributed locking scheme for your distributed storage, you are currently forced to implement an optimistic scheme while in reality you might want to distribute the drop and lock the table exclusively on all remote nodes (this is already what PostgreSQL does, locking the table on a drop). I do realize that distributed transactions are not that simple and there are other problems associated with this, but this would still introduce an unnecessary restriction on what you can do.
- A problem with optimistic protocols in general is that they drop in performance when you have a lot of writes. It is simply the case that other smaller transactions will constantly force a long-running transaction to be aborted. This also means that there is a risk that a transaction that drops a table will have to be aborted out of necessity since other transactions are updating the table. In a distributed system there will be more of those, so the odds of aborting "more important" transactions (in the sense of needing stronger locks) is higher.
- A smaller issue is that right now the storage manager (smgr) and the transaction system are quite tightly coupled, which makes it more difficult to make the storage system "pluggable". I think that not requiring the use of pendingDeletes would move one step in the direction of removing this coupling, but I am not entirely sure here.
It is likely that many of these problems can be worked around by placing restrictions on how DDLs can be used in transactions, but that would create unnecessary restrictions for the end-user. It might also be possible to find implementation workarounds by placing code at strategic points in the implementation, but this is error prone and the risk of making an error is higher.
Best wishes,
Mats Kindahl
- Heikki
pgsql-hackers by date: