Thread: on update / on delete performance of foreign keys
Hi I ran into some performance problems regarding foreign keys lately. My schema has about 20 tables, which each contain from 10 to 100.000 records. They have quite complicated interdependencies, modeled using foregin keys set to "on update cascade, on delete cascade". The schema stores data for multiple customers - Recently I wanted to extract the data for just a single customer. I duplicated the schema, and deleted all but one customer from the "customer" table. This worked as expected, but the delete took a few hours (!) on a moderatly fast machine (dual 1GHz PIII, RAID5-Array for postgres-data). As far as I understand the code, foreign keys are implemented using triggers, which in case of "on delete cascade" delete from the "slave"-table when a record from the "master"-table is deleted. I guess that the bad performance is due to running this trigger about 100.1000 times when deleting a lot of rows from a large table - and each time it has to find referencing tuples by doing an index scan - so the performance is about as bad as it would be if "delete * from table" would use an index scan. Since postgres already incoporates code to check foreign keys more efficiently (when doing alter table ... add constraint .. foreign key, postgres seems to use a merge or a hash join, instead of a nested loop), I wondered how hard it would be to use this for the triggers too. I imagined creating a statement-level trigger in parallel to the row-level triggers, and defining some threshold (let's say, more than 10% of the rows deleted). If the threshold is reached, the row-level trigger would just do nothing, and the statement-level trigger would delete the referencing records doing a join. Would this be feasable? And would it be something a newbie could tackle, or is it more involved than I think? greetings, Florian Pflug
Attachment
Florian G. Pflug wrote: > Hi > > I ran into some performance problems regarding foreign keys lately. > My schema has about 20 tables, which each contain from 10 to 100.000 > records. They have quite complicated interdependencies, modeled using > foregin keys set to "on update cascade, on delete cascade". > The schema stores data for multiple customers - Recently I wanted > to extract the data for just a single customer. I duplicated the schema, > and deleted all but one customer from the "customer" table. This worked > as expected, but the delete took a few hours (!) on a moderatly fast > machine (dual 1GHz PIII, RAID5-Array for postgres-data). PostgreSQL doesn't automatically add indexes to foreign-key columns. That sounds like the issue to me. -- Richard Huxton Archonet Ltd
"Florian G. Pflug" <fgp@phlo.org> writes: > when deleting a lot of rows from a large table - and each time it has to find > referencing tuples by doing an index scan Are you sure it was even an index scan? And not doing a sequential table scan for every deletion? In order to do an index scan you need an index on foreign key columns in the referencing tables, not just the referenced tables. Usually people only think about the referenced tables because usually that's all they need. Only when you start cascading updates and deletes do you need indexes on the columns in the referencing tables. -- greg
Richard Huxton wrote: > Florian G. Pflug wrote: >> I ran into some performance problems regarding foreign keys lately. >> My schema has about 20 tables, which each contain from 10 to 100.000 >> records. They have quite complicated interdependencies, modeled using >> foregin keys set to "on update cascade, on delete cascade". >> The schema stores data for multiple customers - Recently I wanted >> to extract the data for just a single customer. I duplicated the schema, >> and deleted all but one customer from the "customer" table. This worked >> as expected, but the delete took a few hours (!) on a moderatly fast >> machine (dual 1GHz PIII, RAID5-Array for postgres-data). > PostgreSQL doesn't automatically add indexes to foreign-key columns. > That sounds like the issue to me. Oh... *feeling a bit stupid*... Seems that I got confused, because it requires an index to exist on the referenced table (To speed up inserts, updates), but not on the referencing table... Still, I believe that even with an index, the performance will suffer when deleting a lot of rows from an referenced tabled, because for each row a trigger has to fire, and do an index scan. It's entirely possible though, that this is already optimized, and I just misread the code ;-) greetings, Florian Pflug
Attachment
On Mon, 24 Jan 2005, Florian G. Pflug wrote: > Since postgres already incoporates code to check foreign keys more > efficiently (when doing alter table ... add constraint .. foreign key, > postgres seems to use a merge or a hash join, instead of a nested loop), > I wondered how hard it would be to use this for the triggers too. > > I imagined creating a statement-level trigger in parallel to the > row-level triggers, and defining some threshold (let's say, more than > 10% of the rows deleted). If the threshold is reached, the row-level > trigger would just do nothing, and the statement-level trigger would > delete the referencing records doing a join. > > Would this be feasable? And would it be something a newbie could tackle, > or is it more involved than I think? It's a little more involved. The first is that I think there's no good way to tell the row trigger to do nothing (remember that the constraints may be deferred so simple flags aren't sufficient). The second is that these triggers will want to know which rows are deleted, but AFAIK statement-level triggers don't currently give you that information and deleting/changing any rows that aren't satisfied does not give the correct behavior. The no action case is actually a little more involved again as it needs to remove rows from the set of changed pk rows if new pk rows have come into existance for matching keys.
Stephan Szabo wrote: > On Mon, 24 Jan 2005, Florian G. Pflug wrote: > >>Since postgres already incoporates code to check foreign keys more >>efficiently (when doing alter table ... add constraint .. foreign key, >>postgres seems to use a merge or a hash join, instead of a nested loop), >>I wondered how hard it would be to use this for the triggers too. >> >>I imagined creating a statement-level trigger in parallel to the >>row-level triggers, and defining some threshold (let's say, more than >>10% of the rows deleted). If the threshold is reached, the row-level >>trigger would just do nothing, and the statement-level trigger would >>delete the referencing records doing a join. >> >>Would this be feasable? And would it be something a newbie could tackle, >>or is it more involved than I think? > > It's a little more involved. The first is that I think there's no good > way to tell the row trigger to do nothing (remember that the constraints > may be deferred so simple flags aren't sufficient). I would be content if my optimization works for the not-deferred case - I'd don't fully understand how deferred foreign keys are handled in postgres. (I guess I don't even fully understand their semantics - I use them only when doing bulk inserts, and there are either circular dependencies, or I don't feel like find the right table order ;-)) > The second is that > these triggers will want to know which rows are deleted, but AFAIK > statement-level triggers don't currently give you that information and > deleting/changing any rows that aren't satisfied does not give the correct > behavior. This I do not understand. Isn't it sufficient to delete any rows whose reference does not exist (for the on-delete-cascade case), or complain if such rows exist (for the no-action/restrict case)? The on-update-cascade case is difficult I guess - I'm not sure if my idea even works for that case, now that I think about it... > The no action case is actually a little more involved again as > it needs to remove rows from the set of changed pk rows if new pk rows > have come into existance for matching keys. Guess without understanding your previous comment I'm lost here too - I wouldn't care to check only changed rows - I would check them all - but only if some estimate shows that it will probably cheaper. At the moment I'm writing a few plpgsql functions that do what I want. They disable all constraint-related trigger, do a deleted, and then recursivly traverse all tables (following the foreign-keys), and do a "delete from .. where not exists (select 1 from ... where ...)". I'll if I stumble upon problems - maybe I'll suddenly understand your comments ;-))) greetings, Florian Pflug
Attachment
On Mon, 24 Jan 2005, Florian G. Pflug wrote: > Stephan Szabo wrote: > > On Mon, 24 Jan 2005, Florian G. Pflug wrote: > > > >>Since postgres already incoporates code to check foreign keys more > >>efficiently (when doing alter table ... add constraint .. foreign key, > >>postgres seems to use a merge or a hash join, instead of a nested loop), > >>I wondered how hard it would be to use this for the triggers too. > >> > >>I imagined creating a statement-level trigger in parallel to the > >>row-level triggers, and defining some threshold (let's say, more than > >>10% of the rows deleted). If the threshold is reached, the row-level > >>trigger would just do nothing, and the statement-level trigger would > >>delete the referencing records doing a join. > >> > >>Would this be feasable? And would it be something a newbie could tackle, > >>or is it more involved than I think? > > > > It's a little more involved. The first is that I think there's no good > > way to tell the row trigger to do nothing (remember that the constraints > > may be deferred so simple flags aren't sufficient). > I would be content if my optimization works for the not-deferred case - > I'd don't fully understand how deferred foreign keys are handled in > postgres. (I guess I don't even fully understand their semantics - I > use them only when doing bulk inserts, and there are either circular > dependencies, or I don't feel like find the right table order ;-)) Actually, thinking about it, this might not be so bad, you'd need to effectively have a stack of states (for events triggered by triggered actions perhaps many levels deep) but I think that's already effectively there. > > > The second is that > > these triggers will want to know which rows are deleted, but AFAIK > > statement-level triggers don't currently give you that information and > > deleting/changing any rows that aren't satisfied does not give the correct > > behavior. > This I do not understand. Isn't it sufficient to delete any rows whose > reference does not exist (for the on-delete-cascade case), or complain > if such rows exist (for the no-action/restrict case)? The > on-update-cascade case is difficult I guess - I'm not sure if my idea > even works for that case, now that I think about it... It's not sufficient to do the delete for non existant pk rows in the deferred case. I also think we'd need to decide on the behavior for the PostgreSQL case where a user trigger runs in between the delete and the action (for example, if I delete where pk=1 and then in between the delete and its action insert a row with pk=1 does the delete fire? The spec doesn't say much because I don't think you can run anything between the two.) insert into pk values (1); begin; insert into fk values (2); delete from pk; commit; AFAICT to follow the foreign key semantics if the foreign key check is deferred an error occurs on commit. Deleting the fk row on the delete from pk is not allowed. I think it may be valid for on delete no action even in the deferred case(*) , but I haven't done alot of thinking about it, but I think it's also invalid for deferred restrict since only the rows being deleted have the restrict applied to them, so an insert into pk values (2) between the delete and commit would allow the transaction to succeed AFAIK. (*) - I'm not sure how you'd necessarily give a complete error message if the error should really be that an insert was invalid but you noticed it on a delete check. I haven't thought about the update cases at all. > > The no action case is actually a little more involved again as > > it needs to remove rows from the set of changed pk rows if new pk rows > > have come into existance for matching keys. > Guess without understanding your previous comment I'm lost here too - I > wouldn't care to check only changed rows - I would check them all - but > only if some estimate shows that it will probably cheaper. > At the moment I'm writing a few plpgsql functions that do what I want. > They disable all constraint-related trigger, do a deleted, and then > recursivly traverse all tables (following the foreign-keys), and do > a "delete from .. where not exists (select 1 from ... where ...)". > I'll if I stumble upon problems - maybe I'll suddenly understand your > comments ;-))) Note that this may fail for cases where the original delete or any of the triggered deletes run statements that should cause triggered actions.
In article <41F5088D.8060702@phlo.org>, "Florian G. Pflug" <fgp@phlo.org> writes: >> PostgreSQL doesn't automatically add indexes to foreign-key >> columns. That sounds like the issue to me. > Oh... *feeling a bit stupid*... Seems that I got confused, because it > requires an index to exist on the referenced table (To speed up inserts, > updates), but not on the referencing table... That's not quite right. What PostgreSQL (or any other DBMS) requires from the referenced table is a UNIQUE constraint on the column in question so that the referencing table points to a single row, and UNIQUE constraints are usually implemented by indices.
Harald Fuchs <hf0722x@protecting.net> writes: > "Florian G. Pflug" <fgp@phlo.org> writes: >>> PostgreSQL doesn't automatically add indexes to foreign-key >>> columns. That sounds like the issue to me. >> Oh... *feeling a bit stupid*... Seems that I got confused, because it >> requires an index to exist on the referenced table (To speed up inserts, >> updates), but not on the referencing table... > That's not quite right. What PostgreSQL (or any other DBMS) requires > from the referenced table is a UNIQUE constraint on the column in > question so that the referencing table points to a single row, and > UNIQUE constraints are usually implemented by indices. You missed the point. PG does force you to put an index on the referenced column, but it does not force you to put one on the referencing column. However, deletions in the referenced table are going to be really slow if there is no index on the referencing column, because there's no fast way to look up referencing rows matching the key to be deleted. regards, tom lane
Stephan Szabo wrote: >>>The second is that >>>these triggers will want to know which rows are deleted, but AFAIK >>>statement-level triggers don't currently give you that information and >>>deleting/changing any rows that aren't satisfied does not give the correct >>>behavior. >> >>This I do not understand. Isn't it sufficient to delete any rows whose >>reference does not exist (for the on-delete-cascade case), or complain >>if such rows exist (for the no-action/restrict case)? The >>on-update-cascade case is difficult I guess - I'm not sure if my idea >>even works for that case, now that I think about it... > > It's not sufficient to do the delete for non existant pk rows in the > deferred case. I also think we'd need to decide on the behavior for the > PostgreSQL case where a user trigger runs in between the delete and the > action (for example, if I delete where pk=1 and then in between the delete > and its action insert a row with pk=1 does the delete fire? The spec > doesn't say much because I don't think you can run anything between the > two.) > > insert into pk values (1); > begin; > insert into fk values (2); > delete from pk; > commit; > > AFAICT to follow the foreign key semantics if the foreign key check is > deferred an error occurs on commit. Deleting the fk row on the delete > from pk is not allowed. So, does that mean that on-delete-cascade effectivly doesn't cascade if deferred? Or only to rows created _before_ the delete? (Altough this is not what your example shows) > I think it may be valid for on delete no action even in the deferred > case(*) , but I haven't done alot of thinking about it, but I think it's > also invalid for deferred restrict since only the rows being deleted have > the restrict applied to them, so an insert into pk values (2) between the > delete and commit would allow the transaction to succeed AFAIK. > > (*) - I'm not sure how you'd necessarily give a complete error message if > the error should really be that an insert was invalid but you noticed it > on a delete check. > > I haven't thought about the update cases at all. Hm... but, if postgres is required to show the exactly same behaviour, no matter if constraints are deferred or not, what then is the point of deferring constraints at all (apart from getting the error at a later point)? Or did I completly missunderstand you? Is the SQL-Standard online somewhere? I'd like to read what it has to say on deferred triggers - I'm still confused about their semantics (I couldn't even say what semantics they should have ;-) ). greetings, Florian Pflug PS: And thanks for your patience while explaining this stuff to me.
Attachment
On Tue, 25 Jan 2005, Florian G. Pflug wrote: > Stephan Szabo wrote: > > > It's not sufficient to do the delete for non existant pk rows in the > > deferred case. I also think we'd need to decide on the behavior for the > > PostgreSQL case where a user trigger runs in between the delete and the > > action (for example, if I delete where pk=1 and then in between the delete > > and its action insert a row with pk=1 does the delete fire? The spec > > doesn't say much because I don't think you can run anything between the > > two.) > > > > insert into pk values (1); > > begin; > > insert into fk values (2); > > delete from pk; > > commit; > > > > AFAICT to follow the foreign key semantics if the foreign key check is > > deferred an error occurs on commit. Deleting the fk row on the delete > > from pk is not allowed. > So, does that mean that on-delete-cascade effectivly doesn't cascade if > deferred? Or only to rows created _before_ the delete? (Altough this is > not what your example shows) No, just that a delete cascade only deletes rows that are dependent upon rows marked for deletion in the referenced table. If there were an fk row with 1 in the above as well, that row would be removed. > > I think it may be valid for on delete no action even in the deferred > > case(*) , but I haven't done alot of thinking about it, but I think it's > > also invalid for deferred restrict since only the rows being deleted have > > the restrict applied to them, so an insert into pk values (2) between the > > delete and commit would allow the transaction to succeed AFAIK. > > > > (*) - I'm not sure how you'd necessarily give a complete error message if > > the error should really be that an insert was invalid but you noticed it > > on a delete check. > > > > I haven't thought about the update cases at all. > Hm... but, if postgres is required to show the exactly same behaviour, > no matter if constraints are deferred or not, what then is the point of > deferring constraints at all (apart from getting the error at a later > point)? Or did I completly missunderstand you? I think you misunderstood, although I'm not sure where. One point is that while the check portion of the constraint may be deferred the actions are not. In the case of restrict with a deferred constraint, the same logic applies as per the delete cascade above. The action is defined upon the rows that were marked for deletion and their dependent rows, so flagging the fk=2 row on the delete would be invalid because it didn't have a dependent row to be marked for deletion. The reason why this would matter is that you could satisfy the constraint after the delete but before the commit in which case no error occurred. In the immediate case it's hard to come up with a situation where it matters because the invalid row would have already caused an error.
It's a bit more complicated than that as there are also locking issues, like what if other processes insert rows while some others are being deleted, really the whole thing isn't trivial. >> Since postgres already incoporates code to check foreign keys more >> efficiently (when doing alter table ... add constraint .. foreign key, >> postgres seems to use a merge or a hash join, instead of a nested loop), >> I wondered how hard it would be to use this for the triggers too. >> >> I imagined creating a statement-level trigger in parallel to the >> row-level triggers, and defining some threshold (let's say, more than >> 10% of the rows deleted). If the threshold is reached, the row-level >> trigger would just do nothing, and the statement-level trigger would >> delete the referencing records doing a join. >> >> Would this be feasable? And would it be something a newbie could tackle, >> or is it more involved than I think?