Thread: on update / on delete performance of foreign keys

on update / on delete performance of foreign keys

From
"Florian G. Pflug"
Date:
Hi

I ran into some performance problems regarding foreign keys lately.
My schema has about 20 tables, which each contain from 10 to 100.000
records. They have quite complicated interdependencies, modeled using
foregin keys set to "on update cascade, on delete cascade".
The schema stores data for multiple customers - Recently I wanted
to extract the data for just a single customer. I duplicated the schema,
and deleted all but one customer from the "customer" table. This worked
as expected, but the delete took a few hours (!) on a moderatly fast
machine (dual 1GHz PIII, RAID5-Array for postgres-data).

As far as I understand the code, foreign keys are implemented using
triggers, which in case of "on delete cascade" delete from the
"slave"-table when a record from the "master"-table is deleted. I guess
that the bad performance is due to running this trigger about 100.1000
times when deleting a lot of rows from a large table - and each time it
has to find referencing tuples by doing an index scan - so the
performance is about as bad as it would be if "delete * from table"
would use an index scan.

Since postgres already incoporates code to check foreign keys more
efficiently (when doing alter table ... add constraint .. foreign key,
postgres seems to use a merge or a hash join, instead of a nested loop),
I wondered how hard it would be to use this for the triggers too.

I imagined creating a statement-level trigger in parallel to the
row-level triggers, and defining some threshold (let's say, more than
10% of the rows deleted). If the threshold is reached, the row-level
trigger would just do nothing, and the statement-level trigger would
delete the referencing records doing a join.

Would this be feasable? And would it be something a newbie could tackle,
or is it more involved than I think?

greetings, Florian Pflug

Attachment

Re: on update / on delete performance of foreign keys

From
Richard Huxton
Date:
Florian G. Pflug wrote:
> Hi
>
> I ran into some performance problems regarding foreign keys lately.
> My schema has about 20 tables, which each contain from 10 to 100.000
> records. They have quite complicated interdependencies, modeled using
> foregin keys set to "on update cascade, on delete cascade".
> The schema stores data for multiple customers - Recently I wanted
> to extract the data for just a single customer. I duplicated the schema,
> and deleted all but one customer from the "customer" table. This worked
> as expected, but the delete took a few hours (!) on a moderatly fast
> machine (dual 1GHz PIII, RAID5-Array for postgres-data).

PostgreSQL doesn't automatically add indexes to foreign-key columns.
That sounds like the issue to me.

--
   Richard Huxton
   Archonet Ltd

Re: on update / on delete performance of foreign keys

From
Greg Stark
Date:
"Florian G. Pflug" <fgp@phlo.org> writes:

> when deleting a lot of rows from a large table - and each time it has to find
> referencing tuples by doing an index scan

Are you sure it was even an index scan? And not doing a sequential table scan
for every deletion? In order to do an index scan you need an index on foreign
key columns in the referencing tables, not just the referenced tables. Usually
people only think about the referenced tables because usually that's all they
need. Only when you start cascading updates and deletes do you need indexes on
the columns in the referencing tables.

--
greg

Re: on update / on delete performance of foreign keys

From
"Florian G. Pflug"
Date:
Richard Huxton wrote:
> Florian G. Pflug wrote:
>> I ran into some performance problems regarding foreign keys lately.
>> My schema has about 20 tables, which each contain from 10 to 100.000
>> records. They have quite complicated interdependencies, modeled using
>> foregin keys set to "on update cascade, on delete cascade".
>> The schema stores data for multiple customers - Recently I wanted
>> to extract the data for just a single customer. I duplicated the schema,
>> and deleted all but one customer from the "customer" table. This worked
>> as expected, but the delete took a few hours (!) on a moderatly fast
>> machine (dual 1GHz PIII, RAID5-Array for postgres-data).

> PostgreSQL doesn't automatically add indexes to foreign-key columns.
> That sounds like the issue to me.
Oh... *feeling a bit stupid*... Seems that I got confused, because it
requires an index to exist on the referenced table (To speed up inserts,
updates), but not on the referencing table...

Still, I believe that even with an index, the performance will suffer
when deleting a lot of rows from an referenced tabled, because for each
row a trigger has to fire, and do an index scan.
It's entirely possible though, that this is already optimized, and I
just misread the code ;-)

greetings, Florian Pflug

Attachment

Re: on update / on delete performance of foreign keys

From
Stephan Szabo
Date:
On Mon, 24 Jan 2005, Florian G. Pflug wrote:

> Since postgres already incoporates code to check foreign keys more
> efficiently (when doing alter table ... add constraint .. foreign key,
> postgres seems to use a merge or a hash join, instead of a nested loop),
> I wondered how hard it would be to use this for the triggers too.
>
> I imagined creating a statement-level trigger in parallel to the
> row-level triggers, and defining some threshold (let's say, more than
> 10% of the rows deleted). If the threshold is reached, the row-level
> trigger would just do nothing, and the statement-level trigger would
> delete the referencing records doing a join.
>
> Would this be feasable? And would it be something a newbie could tackle,
> or is it more involved than I think?

It's a little more involved.  The first is that I think there's no good
way to tell the row trigger to do nothing (remember that the constraints
may be deferred so simple flags aren't sufficient).  The second is that
these triggers will want to know which rows are deleted, but AFAIK
statement-level triggers don't currently give you that information and
deleting/changing any rows that aren't satisfied does not give the correct
behavior. The no action case is actually a little more involved again as
it needs to remove rows from the set of changed pk rows if new pk rows
have come into existance for matching keys.

Re: on update / on delete performance of foreign keys

From
"Florian G. Pflug"
Date:
Stephan Szabo wrote:
> On Mon, 24 Jan 2005, Florian G. Pflug wrote:
>
>>Since postgres already incoporates code to check foreign keys more
>>efficiently (when doing alter table ... add constraint .. foreign key,
>>postgres seems to use a merge or a hash join, instead of a nested loop),
>>I wondered how hard it would be to use this for the triggers too.
>>
>>I imagined creating a statement-level trigger in parallel to the
>>row-level triggers, and defining some threshold (let's say, more than
>>10% of the rows deleted). If the threshold is reached, the row-level
>>trigger would just do nothing, and the statement-level trigger would
>>delete the referencing records doing a join.
>>
>>Would this be feasable? And would it be something a newbie could tackle,
>>or is it more involved than I think?
>
> It's a little more involved.  The first is that I think there's no good
> way to tell the row trigger to do nothing (remember that the constraints
> may be deferred so simple flags aren't sufficient).
I would be content if my optimization works for the not-deferred case -
I'd don't fully understand how deferred foreign keys are handled in
postgres. (I guess I don't even fully understand their semantics - I
use them only when doing bulk inserts, and there are either circular
dependencies, or I don't feel like find the right table order ;-))

> The second is that
> these triggers will want to know which rows are deleted, but AFAIK
> statement-level triggers don't currently give you that information and
> deleting/changing any rows that aren't satisfied does not give the correct
> behavior.
This I do not understand. Isn't it sufficient to delete any rows whose
reference does not exist (for the on-delete-cascade case), or complain
if such rows exist (for the no-action/restrict case)? The
on-update-cascade case is difficult I guess - I'm not sure if my idea
even works for that case, now that I think about it...

> The no action case is actually a little more involved again as
> it needs to remove rows from the set of changed pk rows if new pk rows
> have come into existance for matching keys.
Guess without understanding your previous comment I'm lost here too - I
wouldn't care to check only changed rows - I would check them all - but
only if some estimate shows that it will probably cheaper.

At the moment I'm writing a few plpgsql functions that do what I want.
They disable all constraint-related trigger, do a deleted, and then
recursivly traverse all tables (following the foreign-keys), and do
a "delete from .. where not exists (select 1 from ... where ...)".
I'll if I stumble upon problems - maybe I'll suddenly understand your
comments ;-)))

greetings, Florian Pflug

Attachment

Re: on update / on delete performance of foreign keys

From
Stephan Szabo
Date:
On Mon, 24 Jan 2005, Florian G. Pflug wrote:

> Stephan Szabo wrote:
> > On Mon, 24 Jan 2005, Florian G. Pflug wrote:
> >
> >>Since postgres already incoporates code to check foreign keys more
> >>efficiently (when doing alter table ... add constraint .. foreign key,
> >>postgres seems to use a merge or a hash join, instead of a nested loop),
> >>I wondered how hard it would be to use this for the triggers too.
> >>
> >>I imagined creating a statement-level trigger in parallel to the
> >>row-level triggers, and defining some threshold (let's say, more than
> >>10% of the rows deleted). If the threshold is reached, the row-level
> >>trigger would just do nothing, and the statement-level trigger would
> >>delete the referencing records doing a join.
> >>
> >>Would this be feasable? And would it be something a newbie could tackle,
> >>or is it more involved than I think?
> >
> > It's a little more involved.  The first is that I think there's no good
> > way to tell the row trigger to do nothing (remember that the constraints
> > may be deferred so simple flags aren't sufficient).
> I would be content if my optimization works for the not-deferred case -
> I'd don't fully understand how deferred foreign keys are handled in
> postgres. (I guess I don't even fully understand their semantics - I
> use them only when doing bulk inserts, and there are either circular
> dependencies, or I don't feel like find the right table order ;-))

Actually, thinking about it, this might not be so bad, you'd need to
effectively have a stack of states (for events triggered by triggered
actions perhaps many levels deep) but I think that's already effectively
there.

> > > The second is that
> > these triggers will want to know which rows are deleted, but AFAIK
> > statement-level triggers don't currently give you that information and
> > deleting/changing any rows that aren't satisfied does not give the correct
> > behavior.
> This I do not understand. Isn't it sufficient to delete any rows whose
> reference does not exist (for the on-delete-cascade case), or complain
> if such rows exist (for the no-action/restrict case)? The
> on-update-cascade case is difficult I guess - I'm not sure if my idea
> even works for that case, now that I think about it...

It's not sufficient to do the delete for non existant pk rows in the
deferred case.  I also think we'd need to decide on the behavior for the
PostgreSQL case where a user trigger runs in between the delete and the
action (for example, if I delete where pk=1 and then in between the delete
and its action insert a row with pk=1 does the delete fire? The spec
doesn't say much because I don't think you can run anything between the
two.)

insert into pk values (1);
begin;
 insert into fk values (2);
 delete from pk;
commit;

AFAICT to follow the foreign key semantics if the foreign key check is
deferred an error occurs on commit.  Deleting the fk row on the delete
from pk is not allowed.

I think it may be valid for on delete no action even in the deferred
case(*) , but I haven't done alot of thinking about it, but I think it's
also invalid for deferred restrict since only the rows being deleted have
the restrict applied to them, so an insert into pk values (2) between the
delete and commit would allow the transaction to succeed AFAIK.

(*) - I'm not sure how you'd necessarily give a complete error message if
the error should really be that an insert was invalid but you noticed it
on a delete check.

I haven't thought about the update cases at all.

> > The no action case is actually a little more involved again as
> > it needs to remove rows from the set of changed pk rows if new pk rows
> > have come into existance for matching keys.
> Guess without understanding your previous comment I'm lost here too - I
> wouldn't care to check only changed rows - I would check them all - but
> only if some estimate shows that it will probably cheaper.

> At the moment I'm writing a few plpgsql functions that do what I want.
> They disable all constraint-related trigger, do a deleted, and then
> recursivly traverse all tables (following the foreign-keys), and do
> a "delete from .. where not exists (select 1 from ... where ...)".
> I'll if I stumble upon problems - maybe I'll suddenly understand your
> comments ;-)))

Note that this may fail for cases where the original delete or any of the
triggered deletes run statements that should cause triggered actions.

Re: on update / on delete performance of foreign keys

From
Harald Fuchs
Date:
In article <41F5088D.8060702@phlo.org>,
"Florian G. Pflug" <fgp@phlo.org> writes:

>> PostgreSQL doesn't automatically add indexes to foreign-key
>> columns. That sounds like the issue to me.

> Oh... *feeling a bit stupid*... Seems that I got confused, because it
> requires an index to exist on the referenced table (To speed up inserts,
> updates), but not on the referencing table...

That's not quite right.  What PostgreSQL (or any other DBMS) requires
from the referenced table is a UNIQUE constraint on the column in
question so that the referencing table points to a single row, and
UNIQUE constraints are usually implemented by indices.

Re: on update / on delete performance of foreign keys

From
Tom Lane
Date:
Harald Fuchs <hf0722x@protecting.net> writes:
> "Florian G. Pflug" <fgp@phlo.org> writes:
>>> PostgreSQL doesn't automatically add indexes to foreign-key
>>> columns. That sounds like the issue to me.

>> Oh... *feeling a bit stupid*... Seems that I got confused, because it
>> requires an index to exist on the referenced table (To speed up inserts,
>> updates), but not on the referencing table...

> That's not quite right.  What PostgreSQL (or any other DBMS) requires
> from the referenced table is a UNIQUE constraint on the column in
> question so that the referencing table points to a single row, and
> UNIQUE constraints are usually implemented by indices.

You missed the point.  PG does force you to put an index on the
referenced column, but it does not force you to put one on the
referencing column.  However, deletions in the referenced table are
going to be really slow if there is no index on the referencing column,
because there's no fast way to look up referencing rows matching the
key to be deleted.

            regards, tom lane

Re: on update / on delete performance of foreign keys

From
"Florian G. Pflug"
Date:
Stephan Szabo wrote:

>>>The second is that
>>>these triggers will want to know which rows are deleted, but AFAIK
>>>statement-level triggers don't currently give you that information and
>>>deleting/changing any rows that aren't satisfied does not give the correct
>>>behavior.
>>
>>This I do not understand. Isn't it sufficient to delete any rows whose
>>reference does not exist (for the on-delete-cascade case), or complain
>>if such rows exist (for the no-action/restrict case)? The
>>on-update-cascade case is difficult I guess - I'm not sure if my idea
>>even works for that case, now that I think about it...
>
> It's not sufficient to do the delete for non existant pk rows in the
> deferred case.  I also think we'd need to decide on the behavior for the
> PostgreSQL case where a user trigger runs in between the delete and the
> action (for example, if I delete where pk=1 and then in between the delete
> and its action insert a row with pk=1 does the delete fire? The spec
> doesn't say much because I don't think you can run anything between the
> two.)
>
> insert into pk values (1);
> begin;
>  insert into fk values (2);
>  delete from pk;
> commit;
>
> AFAICT to follow the foreign key semantics if the foreign key check is
> deferred an error occurs on commit.  Deleting the fk row on the delete
> from pk is not allowed.
So, does that mean that on-delete-cascade effectivly doesn't cascade if
deferred? Or only to rows created _before_ the delete? (Altough this is
not what your example shows)

> I think it may be valid for on delete no action even in the deferred
> case(*) , but I haven't done alot of thinking about it, but I think it's
> also invalid for deferred restrict since only the rows being deleted have
> the restrict applied to them, so an insert into pk values (2) between the
> delete and commit would allow the transaction to succeed AFAIK.
>
> (*) - I'm not sure how you'd necessarily give a complete error message if
> the error should really be that an insert was invalid but you noticed it
> on a delete check.
>
> I haven't thought about the update cases at all.
Hm... but, if postgres is required to show the exactly same behaviour,
no matter if constraints are deferred or not, what then is the point of
deferring constraints at all (apart from getting the error at a later
point)? Or did I completly missunderstand you?

Is the SQL-Standard online somewhere? I'd like to read what it has
to say on deferred triggers - I'm still confused about their semantics
(I couldn't even say what semantics they should have ;-) ).

greetings, Florian Pflug
PS: And thanks for your patience while explaining this stuff to me.

Attachment

Re: on update / on delete performance of foreign keys

From
Stephan Szabo
Date:
On Tue, 25 Jan 2005, Florian G. Pflug wrote:

> Stephan Szabo wrote:
>
> > It's not sufficient to do the delete for non existant pk rows in the
> > deferred case.  I also think we'd need to decide on the behavior for the
> > PostgreSQL case where a user trigger runs in between the delete and the
> > action (for example, if I delete where pk=1 and then in between the delete
> > and its action insert a row with pk=1 does the delete fire? The spec
> > doesn't say much because I don't think you can run anything between the
> > two.)
> >
> > insert into pk values (1);
> > begin;
> >  insert into fk values (2);
> >  delete from pk;
> > commit;
> >
> > AFAICT to follow the foreign key semantics if the foreign key check is
> > deferred an error occurs on commit.  Deleting the fk row on the delete
> > from pk is not allowed.
> So, does that mean that on-delete-cascade effectivly doesn't cascade if
> deferred? Or only to rows created _before_ the delete? (Altough this is
> not what your example shows)

No, just that a delete cascade only deletes rows that are dependent upon
rows marked for deletion in the referenced table.  If there were an fk row
with 1 in the above as well, that row would be removed.

> > I think it may be valid for on delete no action even in the deferred
> > case(*) , but I haven't done alot of thinking about it, but I think it's
> > also invalid for deferred restrict since only the rows being deleted have
> > the restrict applied to them, so an insert into pk values (2) between the
> > delete and commit would allow the transaction to succeed AFAIK.
> >
> > (*) - I'm not sure how you'd necessarily give a complete error message if
> > the error should really be that an insert was invalid but you noticed it
> > on a delete check.
> >
> > I haven't thought about the update cases at all.
> Hm... but, if postgres is required to show the exactly same behaviour,
> no matter if constraints are deferred or not, what then is the point of
> deferring constraints at all (apart from getting the error at a later
> point)? Or did I completly missunderstand you?

I think you misunderstood, although I'm not sure where. One point is that
while the check portion of the constraint may be deferred the actions are
not.

In the case of restrict with a deferred constraint, the same logic applies
as per the delete cascade above.  The action is defined upon the rows that
were marked for deletion and their dependent rows, so flagging the fk=2
row on the delete would be invalid because it didn't have a dependent row
to be marked for deletion.  The reason why this would matter is that you
could satisfy the constraint after the delete but before the commit in
which case no error occurred. In the immediate case it's hard to come up
with a situation where it matters because the invalid row would have
already caused an error.

Re: on update / on delete performance of foreign keys

From
PFC
Date:
    It's a bit more complicated than that as there are also locking issues,
like what if other processes insert rows while some others are being
deleted, really the whole thing isn't trivial.


>> Since postgres already incoporates code to check foreign keys more
>> efficiently (when doing alter table ... add constraint .. foreign key,
>> postgres seems to use a merge or a hash join, instead of a nested loop),
>> I wondered how hard it would be to use this for the triggers too.
>>
>> I imagined creating a statement-level trigger in parallel to the
>> row-level triggers, and defining some threshold (let's say, more than
>> 10% of the rows deleted). If the threshold is reached, the row-level
>> trigger would just do nothing, and the statement-level trigger would
>> delete the referencing records doing a join.
>>
>> Would this be feasable? And would it be something a newbie could tackle,
>> or is it more involved than I think?