Re: [PATCH] Partial foreign key updates in referential integrity triggers - Mailing list pgsql-hackers
From | Paul Martinez |
---|---|
Subject | Re: [PATCH] Partial foreign key updates in referential integrity triggers |
Date | |
Msg-id | CAF+2_SH8Lg7qz-Rz9HNU9aHvjiu8Db4ftYdJEhpdwC1LvgDsrQ@mail.gmail.com Whole thread Raw |
In response to | Re: [PATCH] Partial foreign key updates in referential integrity triggers (Matthias van de Meent <boekewurm+postgres@gmail.com>) |
Responses |
Re: [PATCH] Partial foreign key updates in referential integrity triggers
|
List | pgsql-hackers |
On Wed, Nov 24, 2021 at 10:59 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > I was looking through this example to see if it could be adapted for the > documentation. > > The way the users table is defined, it appears that "id" is actually > unique and the primary key ought to be just (id). The DELETE command > you show also just uses the id column to find the user, which would be > bad if the user id is not unique across tenants. If the id were unique, > then the foreign key from posts to users would just use the user id > column and the whole problem of the ON DELETE SET NULL action would go > away. If the primary key of users is indeed supposed to be (tenant_id, > id), then maybe the definition of the users.id column should not use > serial, and the DELETE command should also look at the tenant_id column. > (The same question applies to posts.id.) > > Also, you initially wrote that this is a denormalized schema. I think > if we keep the keys the way you show, then this isn't denormalized. But > if we considered users.id globally unique, then there would be > normalization concerns. > > What do you think? Regarding that specific example, in a production scenario, yes, the DELETE command should reference both columns. And if used for documentation both columns should be referenced for clarity/correctness. I don't think the exact semantics regarding the uniqueness of the id column are critical. Configuring separate auto-incrementing ids per tenant would be fairly complex; practically speaking, a single database with multi-tenant data will use serial to get auto-incrementing ids (or else use UUIDs to prevent conflicts). The possibility of conflicting ids likely won't arise until moving to a distributed environment, at which point queries should only be routed towards a single shard (where uniqueness will still hold), either by some higher level application-level context, or by including the tenant_id as part of the query. I think there are three separate motivating use cases for using (tenant_id, id) as primary keys everywhere in a multi-tenant database: 1) I initially encountered this problem while migrating a database to use Citus, which requires that primary keys (and any other uniqueness constraints) include the shard key, which forces the primary key to be (tenant_id, id). I'm not sure what constraints other sharding solutions enforce, but I don't feel like this feature is over-fitting to Citus' specific implementation -- it seems like a pretty reasonable/generalizable solution when sharding data: prefix all your indexes with the shard key. 2) As I mentioned in my response to Tom in my original proposal thread, and as Matthias alluded to, using composite primary keys grants significantly stronger referential integrity by preventing cross-tenant references. I think this represents a significant leap in the robustness and security of a schema, to the point where you could consider it a design flaw to _not_ use composite keys. https://www.postgresql.org/message-id/flat/CAF%2B2_SFFCjWMpxo0cj3yaqMavcb3Byd0bSG%2B0UPs7RVb8EF99g%40mail.gmail.com#c0e2b37b223bfbf8ece561f02865286c 3) For performance reasons, indexes on foreign keys will often be prefixed by the tenant_id to speed up index scans. (I think algorithmically doing an index lookup on (fk_id) vs. (tenant_id, fk_id) has the same complexity, but repeated index scans, such as when doing a join, should in practice be more efficient when including a tenant_id, because most queries will only reference a single tenant so the looked up values are more likely to be on the same pages.) If a foreign key only references the id column, then ON DELETE CASCADE triggers will only use the id column in their DELETE query. Thus, to ensure that deletes are still fast, you will need to create an index on (fk_id) in addition to the (tenant_id, fk_id) index, which would cause _significant_ database bloat. (In practice, the presence of both indexes will also confuse the query planner and now BOTH indexes will take up precious space in the database's working memory, so it really creates all sorts of problems.) Using a composite foreign key will ensure that ON DELETE CASCADE trigger query will use both columns. - Paul
pgsql-hackers by date: