Thread: design help for performance
I have two tables A and B in a one (A) to many (B) relationship. There are 1.4 million records in table A and 44 million records in table B. In my web application any request for a record from table A is also going to need a count of associated records in table B. Several years ago I added table_b_rowcount to table A in order to minimize queries on table B. And now, as the application has grown, I am starting to having locking problems on table A. Any change to table B requires the that table_b_rowcount be updated on table A... The application has outgrown this solution.
So... is there a common solution to this problem?
culley
Hi Culley, Have you tried to create fk together with index on fk column on table B? What are results? Would be good if you could send the query and explain analyze... Sent from my Windows Phone ------------------------------ From: Culley Harrelson Sent: 21 December 2011 00:57 To: pgsql-general@postgresql.org Subject: [GENERAL] design help for performance I am bumping into some performance issues and am seeking help. I have two tables A and B in a one (A) to many (B) relationship. There are 1.4 million records in table A and 44 million records in table B. In my web application any request for a record from table A is also going to need a count of associated records in table B. Several years ago I added table_b_rowcount to table A in order to minimize queries on table B. And now, as the application has grown, I am starting to having locking problems on table A. Any change to table B requires the that table_b_rowcount be updated on table A... The application has outgrown this solution. So... is there a common solution to this problem? culley
Continued top-posting to remain consistent….
It isn’t that the application has outgrown the solution but rather the solution was never correct in the first place. You attempted pre-mature optimization and are getting burned because of it. The reference solution is simply:
SELECT a.*, COUNT(*) AS b_count
FROM a
JOIN b USING (a_id)
GROUP BY a.* {expanded * as needed)
Make sure table b has an index on the a.id column.
This is reference because you never want to introduce computed fields that keep track of other tables WITHOUT some kind of proof that the maintenance nightmare/overhead you are incurring is more than offset by the savings during usage.
Any further optimization requires two things:
Knowledge of the usage patterns of the affected data
Testing to prove that the alternative solutions out-perform the reference solution
Since you already have an existing query you should implement the reference solution above and then test and see whether it performs better or worse than you current solution. If it indeed performs better than move to it; and if it is still not good enough then you need to provide more information about what kinds of queries are hitting A and B as well as Insert/Delete patterns on Table B.
David J.
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Misa Simic
Sent: Tuesday, December 20, 2011 7:13 PM
To: Culley Harrelson; pgsql-general@postgresql.org
Subject: Re: [GENERAL] design help for performance
Hi Culley,
Have you tried to create fk together with index on fk column on table B?
What are results? Would be good if you could send the query and explain analyze...
Sent from my Windows Phone
From: Culley Harrelson
Sent: 21 December 2011 00:57
To: pgsql-general@postgresql.org
Subject: [GENERAL] design help for performance
I am bumping into some performance issues and am seeking help.
I have two tables A and B in a one (A) to many (B) relationship. There are 1.4 million records in table A and 44 million records in table B. In my web application any request for a record from table A is also going to need a count of associated records in table B. Several years ago I added table_b_rowcount to table A in order to minimize queries on table B. And now, as the application has grown, I am starting to having locking problems on table A. Any change to table B requires the that table_b_rowcount be updated on table A... The application has outgrown this solution.
So... is there a common solution to this problem?
culley
Continued top-posting to remain consistent….
It isn’t that the application has outgrown the solution but rather the solution was never correct in the first place. You attempted pre-mature optimization and are getting burned because of it. The reference solution is simply:
SELECT a.*, COUNT(*) AS b_count
FROM a
JOIN b USING (a_id)
GROUP BY a.* {expanded * as needed)
Make sure table b has an index on the a.id column.
This is reference because you never want to introduce computed fields that keep track of other tables WITHOUT some kind of proof that the maintenance nightmare/overhead you are incurring is more than offset by the savings during usage.
Any further optimization requires two things:
Knowledge of the usage patterns of the affected data
Testing to prove that the alternative solutions out-perform the reference solution
Since you already have an existing query you should implement the reference solution above and then test and see whether it performs better or worse than you current solution. If it indeed performs better than move to it; and if it is still not good enough then you need to provide more information about what kinds of queries are hitting A and B as well as Insert/Delete patterns on Table B.
David J.
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Misa Simic
Sent: Tuesday, December 20, 2011 7:13 PM
To: Culley Harrelson; pgsql-general@postgresql.org
Subject: Re: [GENERAL] design help for performance
Hi Culley,
Have you tried to create fk together with index on fk column on table B?
What are results? Would be good if you could send the query and explain analyze...
Sent from my Windows PhoneFrom: Culley Harrelson
Sent: 21 December 2011 00:57
To: pgsql-general@postgresql.org
Subject: [GENERAL] design help for performanceI am bumping into some performance issues and am seeking help.
I have two tables A and B in a one (A) to many (B) relationship. There are 1.4 million records in table A and 44 million records in table B. In my web application any request for a record from table A is also going to need a count of associated records in table B. Several years ago I added table_b_rowcount to table A in order to minimize queries on table B. And now, as the application has grown, I am starting to having locking problems on table A. Any change to table B requires the that table_b_rowcount be updated on table A... The application has outgrown this solution.
So... is there a common solution to this problem?
culley
On 21 Dec 2011, at 24:56, Culley Harrelson wrote: > Several years ago I added table_b_rowcount to table A in order to minimize queries on table B. And now, as the applicationhas grown, I am starting to having locking problems on table A. Any change to table B requires the that table_b_rowcountbe updated on table A... The application has outgrown this solution. When you update rowcount_b in table A, that locks the row in A of course, but there's more going on. Because a new versionof that row gets created, the references from B to A also need updating to that new version (creating new versionsof rows in B as well). I think that causes a little bit more locking than originally anticipated - it may even bethe cause of your locking problem. Instead, if you'd create a new table C that only holds the rowcount_b and a reference to A (in a 1:1 relationship), mostof those problems go away. It does add an extra foreign key reference to table A though, which means it will weigh downupdates and deletes there some more. CREATE TABLE C ( table_a_id int PRIMARY KEY REFERENCES table_a (id) ON UPDATE CASCADE ON DELETE CASCADE, table_b_rowcount int NOT NULL DEFAULT 0 ); Yes, those cascades are on purpose - the data in C is useless without the accompanying record in A. Also, the PK makes sureit stays a 1:1 relationship. Alban Hertroys -- Screwing up is an excellent way to attach something to the ceiling.
Have you tried to create fk together with index on fk column on table B?
What are results? Would be good if you could send the query and explain analyze...
Sent from my Windows Phone
From: Culley Harrelson
Sent: 21 December 2011 00:57
To: pgsql-general@postgresql.org
Subject: [GENERAL] design help for performance
I am bumping into some performance issues and am seeking help.
I have two tables A and B in a one (A) to many (B) relationship. There are 1.4 million records in table A and 44 million records in table B. In my web application any request for a record from table A is also going to need a count of associated records in table B. Several years ago I added table_b_rowcount to table A in order to minimize queries on table B. And now, as the application has grown, I am starting to having locking problems on table A. Any change to table B requires the that table_b_rowcount be updated on table A... The application has outgrown this solution.
So... is there a common solution to this problem?
culley
> -----Original Message----- > From: pgsql-general-owner@postgresql.org [mailto:pgsql-general- > owner@postgresql.org] On Behalf Of Alban Hertroys > Sent: Mittwoch, 21. Dezember 2011 08:53 > To: Culley Harrelson > Cc: pgsql-general@postgresql.org > Subject: Re: [GENERAL] design help for performance > > On 21 Dec 2011, at 24:56, Culley Harrelson wrote: > > > Several years ago I added table_b_rowcount to table A in order to > minimize queries on table B. And now, as the application has grown, I > am starting to having locking problems on table A. Any change to table > B requires the that table_b_rowcount be updated on table A... The > application has outgrown this solution. > > > When you update rowcount_b in table A, that locks the row in A of > course, but there's more going on. Because a new version of that row > gets created, the references from B to A also need updating to that new > version (creating new versions of rows in B as well). I think that > causes a little bit more locking than originally anticipated - it may > even be the cause of your locking problem. > > Instead, if you'd create a new table C that only holds the rowcount_b > and a reference to A (in a 1:1 relationship), most of those problems go > away. It does add an extra foreign key reference to table A though, > which means it will weigh down updates and deletes there some more. > > CREATE TABLE C ( > table_a_id int PRIMARY KEY > REFERENCES table_a (id) ON UPDATE CASCADE ON DELETE > CASCADE, > table_b_rowcount int NOT NULL DEFAULT 0 > ); > > Yes, those cascades are on purpose - the data in C is useless without > the accompanying record in A. Also, the PK makes sure it stays a 1:1 > relationship. > > Alban Hertroys Hello, it may help to combine Alban solution with yours but at the cost of a higher complexity: In table C use instead a column table_b_delta_rowcount (+1 /-1 , smallint) and only use INSERTs to maintain it, no UPDATEs (hence with a non unique index on id). Then regularily flush table C content to table A, in order to only have recent changes in C. Your query should then query both tables: SELECT A. table_b_rowcount + coalesce(sum(C.table_b_delta_rowcount)) FROM A LEFT OUTER JOIN B on (A.id=B.id) WHERE A.id = xxx Marc Mamin
I am not going to take it to the delta solution for now.
> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of Alban Hertroys
> Sent: Mittwoch, 21. Dezember 2011 08:53
> To: Culley Harrelson
> Cc: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] design help for performance
>Hello,> On 21 Dec 2011, at 24:56, Culley Harrelson wrote:
>
> > Several years ago I added table_b_rowcount to table A in order to
> minimize queries on table B. And now, as the application has grown, I
> am starting to having locking problems on table A. Any change to
table
> B requires the that table_b_rowcount be updated on table A... The
> application has outgrown this solution.
>
>
> When you update rowcount_b in table A, that locks the row in A of
> course, but there's more going on. Because a new version of that row
> gets created, the references from B to A also need updating to that
new
> version (creating new versions of rows in B as well). I think that
> causes a little bit more locking than originally anticipated - it may
> even be the cause of your locking problem.
>
> Instead, if you'd create a new table C that only holds the rowcount_b
> and a reference to A (in a 1:1 relationship), most of those problems
go
> away. It does add an extra foreign key reference to table A though,
> which means it will weigh down updates and deletes there some more.
>
> CREATE TABLE C (
> table_a_id int PRIMARY KEY
> REFERENCES table_a (id) ON UPDATE CASCADE ON DELETE
> CASCADE,
> table_b_rowcount int NOT NULL DEFAULT 0
> );
>
> Yes, those cascades are on purpose - the data in C is useless without
> the accompanying record in A. Also, the PK makes sure it stays a 1:1
> relationship.
>
> Alban Hertroys
it may help to combine Alban solution with yours but at the cost of a
higher complexity:
In table C use instead a column table_b_delta_rowcount (+1 /-1 ,
smallint) and only use INSERTs to maintain it, no UPDATEs (hence with a
non unique index on id).
Then regularily flush table C content to table A, in order to only have
recent changes in C.
Your query should then query both tables:
SELECT A. table_b_rowcount + coalesce(sum(C.table_b_delta_rowcount))
FROM A LEFT OUTER JOIN B on (A.id=B.id)
WHERE A.id = xxx
Marc Mamin
Hm... I think result on the end will be the same... I am not sure realation produce any locks on parent table... What produces locks is UPDATE, so is it on table A or C should make no difference... If simple join and count fk is so slow - other option would be materialized view... So it would need to include table C as materialized view but on the way to to don't make expensive calculations in real time during insert in B (and locking) There is a article about materialized views on postgres wiki... Sent from my Windows Phone ------------------------------ From: Culley Harrelson Sent: 21 December 2011 22:07 To: Marc Mamin Cc: Alban Hertroys; pgsql-general@postgresql.org Subject: Re: [GENERAL] design help for performance Thank you so much everyone! Introducing table C was indeed my next step but I was unsure if I was going to be just moving the locking problems from A to C. Locking on C is preferable to locking on A but it doesn't really solve the problem. It sounds like I should expect less locking on C because it doesn't relate to B. Thanks again, I am going to give it a try. I am not going to take it to the delta solution for now. On Wed, Dec 21, 2011 at 1:46 AM, Marc Mamin <M.Mamin@intershop.de> wrote: > > > -----Original Message----- > > From: pgsql-general-owner@postgresql.org [mailto:pgsql-general- > > owner@postgresql.org] On Behalf Of Alban Hertroys > > Sent: Mittwoch, 21. Dezember 2011 08:53 > > To: Culley Harrelson > > Cc: pgsql-general@postgresql.org > > Subject: Re: [GENERAL] design help for performance > > > > On 21 Dec 2011, at 24:56, Culley Harrelson wrote: > > > > > Several years ago I added table_b_rowcount to table A in order to > > minimize queries on table B. And now, as the application has grown, I > > am starting to having locking problems on table A. Any change to > table > > B requires the that table_b_rowcount be updated on table A... The > > application has outgrown this solution. > > > > > > When you update rowcount_b in table A, that locks the row in A of > > course, but there's more going on. Because a new version of that row > > gets created, the references from B to A also need updating to that > new > > version (creating new versions of rows in B as well). I think that > > causes a little bit more locking than originally anticipated - it may > > even be the cause of your locking problem. > > > > Instead, if you'd create a new table C that only holds the rowcount_b > > and a reference to A (in a 1:1 relationship), most of those problems > go > > away. It does add an extra foreign key reference to table A though, > > which means it will weigh down updates and deletes there some more. > > > > CREATE TABLE C ( > > table_a_id int PRIMARY KEY > > REFERENCES table_a (id) ON UPDATE CASCADE ON DELETE > > CASCADE, > > table_b_rowcount int NOT NULL DEFAULT 0 > > ); > > > > Yes, those cascades are on purpose - the data in C is useless without > > the accompanying record in A. Also, the PK makes sure it stays a 1:1 > > relationship. > > > > Alban Hertroys > > Hello, > > it may help to combine Alban solution with yours but at the cost of a > higher complexity: > > In table C use instead a column table_b_delta_rowcount (+1 /-1 , > smallint) and only use INSERTs to maintain it, no UPDATEs (hence with a > non unique index on id). > > Then regularily flush table C content to table A, in order to only have > recent changes in C. > Your query should then query both tables: > > SELECT A. table_b_rowcount + coalesce(sum(C.table_b_delta_rowcount)) > FROM A LEFT OUTER JOIN B on (A.id=B.id) > WHERE A.id = xxx > > Marc Mamin >
I think result on the end will be the same... I am not sure realation produce any locks on parent table...
What produces locks is UPDATE, so is it on table A or C should make no difference...
If simple join and count fk is so slow - other option would be materialized view... So it would need to include table C as materialized view but on the way to to don't make expensive calculations in real time during insert in B (and locking)
There is a article about materialized views on postgres wiki...
Sent from my Windows Phone
From: Culley Harrelson
Sent: 21 December 2011 22:07
To: Marc Mamin
Cc: Alban Hertroys; pgsql-general@postgresql.org
Subject: Re: [GENERAL] design help for performance
Thank you so much everyone! Introducing table C was indeed my next step but I was unsure if I was going to be just moving the locking problems from A to C. Locking on C is preferable to locking on A but it doesn't really solve the problem. It sounds like I should expect less locking on C because it doesn't relate to B. Thanks again, I am going to give it a try.
I am not going to take it to the delta solution for now.
> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of Alban Hertroys
> Sent: Mittwoch, 21. Dezember 2011 08:53
> To: Culley Harrelson
> Cc: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] design help for performance
>Hello,> On 21 Dec 2011, at 24:56, Culley Harrelson wrote:
>
> > Several years ago I added table_b_rowcount to table A in order to
> minimize queries on table B. And now, as the application has grown, I
> am starting to having locking problems on table A. Any change to
table
> B requires the that table_b_rowcount be updated on table A... The
> application has outgrown this solution.
>
>
> When you update rowcount_b in table A, that locks the row in A of
> course, but there's more going on. Because a new version of that row
> gets created, the references from B to A also need updating to that
new
> version (creating new versions of rows in B as well). I think that
> causes a little bit more locking than originally anticipated - it may
> even be the cause of your locking problem.
>
> Instead, if you'd create a new table C that only holds the rowcount_b
> and a reference to A (in a 1:1 relationship), most of those problems
go
> away. It does add an extra foreign key reference to table A though,
> which means it will weigh down updates and deletes there some more.
>
> CREATE TABLE C (
> table_a_id int PRIMARY KEY
> REFERENCES table_a (id) ON UPDATE CASCADE ON DELETE
> CASCADE,
> table_b_rowcount int NOT NULL DEFAULT 0
> );
>
> Yes, those cascades are on purpose - the data in C is useless without
> the accompanying record in A. Also, the PK makes sure it stays a 1:1
> relationship.
>
> Alban Hertroys
it may help to combine Alban solution with yours but at the cost of a
higher complexity:
In table C use instead a column table_b_delta_rowcount (+1 /-1 ,
smallint) and only use INSERTs to maintain it, no UPDATEs (hence with a
non unique index on id).
Then regularily flush table C content to table A, in order to only have
recent changes in C.
Your query should then query both tables:
SELECT A. table_b_rowcount + coalesce(sum(C.table_b_delta_rowcount))
FROM A LEFT OUTER JOIN B on (A.id=B.id)
WHERE A.id = xxx
Marc Mamin