Thread: Derived columns / denormalization

Derived columns / denormalization

From
"Jamie Tufnell"
Date:
Hi,

I (think I) want to denormalize an aggregate for performance reasons.
I think that is the right terminology, please correct me if not.
Consider these two tables:

items:id serial (pk),group_id int (fk),...

groups:id serial (pk),item_count int -- this is derived from (select count(*) from items
where group_id = id)...

item_count would be updated by insert/update/delete triggers on the
items table, hopefully that would ensure it is always correct?

I'm wondering is what I'm trying to do here pretty standard and are
there any gotchas I should be aware of?

Lastly, (assuming this is acceptable practice), are there any
shorthand ways of setting these kind of derived columns up?  It seems
like there's a lot of trigger/function writing that could be
automatically generated for a lot of cases.

Thanks!
Jamie


Re: Derived columns / denormalization

From
Tom Lane
Date:
"Jamie Tufnell" <diesql@googlemail.com> writes:
>  item_count int -- this is derived from (select count(*) from items
> where group_id = id)
>  ...

> item_count would be updated by insert/update/delete triggers on the
> items table, hopefully that would ensure it is always correct?

> I'm wondering is what I'm trying to do here pretty standard and are
> there any gotchas I should be aware of?

Concurrent updates to the items table make this much harder than
it might first appear.  If you're willing to serialize all your updating
transactions then you can make it work, but ...
        regards, tom lane


Re: Derived columns / denormalization

From
Erik Jones
Date:
On Jan 15, 2009, at 8:06 PM, Tom Lane wrote:

> "Jamie Tufnell" <diesql@googlemail.com> writes:
>> item_count int -- this is derived from (select count(*) from items
>> where group_id = id)
>> ...
>
>> item_count would be updated by insert/update/delete triggers on the
>> items table, hopefully that would ensure it is always correct?
>
>> I'm wondering is what I'm trying to do here pretty standard and are
>> there any gotchas I should be aware of?
>
> Concurrent updates to the items table make this much harder than
> it might first appear.  If you're willing to serialize all your  
> updating
> transactions then you can make it work, but ...

That was exactly the caveat I was about to point out.  That being  
said, keeping COUNT() values and other computed statistics based on  
other data in the database *is* a fairly common "tactic".  On method  
that I've used to great success to avoid the serialization problem is  
to have your triggers actually insert the necessary information for  
the update into a separate "update queue" table.  You then have  
separate process that routinely sweeps that update queue, aggregates  
the updates and then updates your count values in the groups table  
with the total update values for each groups entry with updates.

Erik Jones, Database Administrator
Engine Yard
Support, Scalability, Reliability
866.518.9273 x 260
Location: US/Pacific
IRC: mage2k







Re: Derived columns / denormalization

From
"Jamie Tufnell"
Date:
On 1/17/09, Erik Jones <ejones@engineyard.com> wrote:
> On Jan 15, 2009, at 8:06 PM, Tom Lane wrote:
>> "Jamie Tufnell" <diesql@googlemail.com> writes:
>>> item_count int -- this is derived from (select count(*) from items
>>> where group_id = id)
>>> ...
>>
>>> item_count would be updated by insert/update/delete triggers on the
>>> items table, hopefully that would ensure it is always correct?
>>
>> Concurrent updates to the items table make this much harder than
>> it might first appear.  If you're willing to serialize all your
>> updating
>> transactions then you can make it work, but ...
>
> That was exactly the caveat I was about to point out.  That being
> said, keeping COUNT() values and other computed statistics based on
> other data in the database *is* a fairly common "tactic".  On method
> that I've used to great success to avoid the serialization problem is
> to have your triggers actually insert the necessary information for
> the update into a separate "update queue" table.  You then have
> separate process that routinely sweeps that update queue, aggregates
> the updates and then updates your count values in the groups table
> with the total update values for each groups entry with updates.

Fortunately our items table rarely sees concurrent writes.  It's over
99% reads and is typically updated by just one user.  We are already
caching these aggregates and other data in a separate layer and my
goal is to see if I can get rid of that layer.

In light of your advice though, I will think things through a bit more first.

Thanks for your help!
Jamie