Thread: Derived columns / denormalization
Hi, I (think I) want to denormalize an aggregate for performance reasons. I think that is the right terminology, please correct me if not. Consider these two tables: items:id serial (pk),group_id int (fk),... groups:id serial (pk),item_count int -- this is derived from (select count(*) from items where group_id = id)... item_count would be updated by insert/update/delete triggers on the items table, hopefully that would ensure it is always correct? I'm wondering is what I'm trying to do here pretty standard and are there any gotchas I should be aware of? Lastly, (assuming this is acceptable practice), are there any shorthand ways of setting these kind of derived columns up? It seems like there's a lot of trigger/function writing that could be automatically generated for a lot of cases. Thanks! Jamie
"Jamie Tufnell" <diesql@googlemail.com> writes: > item_count int -- this is derived from (select count(*) from items > where group_id = id) > ... > item_count would be updated by insert/update/delete triggers on the > items table, hopefully that would ensure it is always correct? > I'm wondering is what I'm trying to do here pretty standard and are > there any gotchas I should be aware of? Concurrent updates to the items table make this much harder than it might first appear. If you're willing to serialize all your updating transactions then you can make it work, but ... regards, tom lane
On Jan 15, 2009, at 8:06 PM, Tom Lane wrote: > "Jamie Tufnell" <diesql@googlemail.com> writes: >> item_count int -- this is derived from (select count(*) from items >> where group_id = id) >> ... > >> item_count would be updated by insert/update/delete triggers on the >> items table, hopefully that would ensure it is always correct? > >> I'm wondering is what I'm trying to do here pretty standard and are >> there any gotchas I should be aware of? > > Concurrent updates to the items table make this much harder than > it might first appear. If you're willing to serialize all your > updating > transactions then you can make it work, but ... That was exactly the caveat I was about to point out. That being said, keeping COUNT() values and other computed statistics based on other data in the database *is* a fairly common "tactic". On method that I've used to great success to avoid the serialization problem is to have your triggers actually insert the necessary information for the update into a separate "update queue" table. You then have separate process that routinely sweeps that update queue, aggregates the updates and then updates your count values in the groups table with the total update values for each groups entry with updates. Erik Jones, Database Administrator Engine Yard Support, Scalability, Reliability 866.518.9273 x 260 Location: US/Pacific IRC: mage2k
On 1/17/09, Erik Jones <ejones@engineyard.com> wrote: > On Jan 15, 2009, at 8:06 PM, Tom Lane wrote: >> "Jamie Tufnell" <diesql@googlemail.com> writes: >>> item_count int -- this is derived from (select count(*) from items >>> where group_id = id) >>> ... >> >>> item_count would be updated by insert/update/delete triggers on the >>> items table, hopefully that would ensure it is always correct? >> >> Concurrent updates to the items table make this much harder than >> it might first appear. If you're willing to serialize all your >> updating >> transactions then you can make it work, but ... > > That was exactly the caveat I was about to point out. That being > said, keeping COUNT() values and other computed statistics based on > other data in the database *is* a fairly common "tactic". On method > that I've used to great success to avoid the serialization problem is > to have your triggers actually insert the necessary information for > the update into a separate "update queue" table. You then have > separate process that routinely sweeps that update queue, aggregates > the updates and then updates your count values in the groups table > with the total update values for each groups entry with updates. Fortunately our items table rarely sees concurrent writes. It's over 99% reads and is typically updated by just one user. We are already caching these aggregates and other data in a separate layer and my goal is to see if I can get rid of that layer. In light of your advice though, I will think things through a bit more first. Thanks for your help! Jamie