Thread: Questions about horizontal partitioning
Now, it turns out that 99% of the time only about 8 colums are required, but all 230 columns are populated. However, legacy applications (which are run nightly for batch processing, but not during the day, when heavy volume occurs) require the old table design. New applications only select a limited number of columns, but require the old design names.
I want to do the following:
1. Split the table into two tables (one with 6 columns, the other with 224 columns), using the primary key to establish a 1-1 relationship between them.
2. Create a view that displays the join between the two tables.
By doing this, I'm hoping that the query optimizer is smart enough to see that if a query comes in and requests only the six columns (that are in the narrower table) that PostgreSQL won't have to load the wider table into the buffer pool, and thereby actually have to only access about 10% the amount of disk that it presently does.
Is this a sound theory? Is the query optimizer smart enough ignore part of a join when a portion of the join will have no effect on the end result set?
thanks
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
John Sales <spelunker334@yahoo.com> writes: > By doing this, I'm hoping that the query optimizer is smart enough to see that if a query comes in and requests only thesix columns (that are in the narrower table) that PostgreSQL won't have to load the wider table into the buffer pool,and thereby actually have to only access about 10% the amount of disk that it presently does. > Is this a sound theory? No. It still has to touch the second table to confirm the existence of rows to join to. regards, tom lane
John Sales writes:Good point..
> By doing this, I'm hoping that the query optimizer is smart enough to see that if a query comes in and requests only the six columns (that are in the narrower table) that PostgreSQL won't have to load the wider table into the buffer pool, and thereby actually have to only access about 10% the amount of disk that it presently does.
> Is this a sound theory?
No. It still has to touch the second table to confirm the existence of
rows to join to.
regards, tom lane
What if I made the second table optional using an outer join, so as not to imply the existance of rows in the "wide" table?
thanks
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 01/08/07 20:39, Tom Lane wrote: > John Sales <spelunker334@yahoo.com> writes: >> By doing this, I'm hoping that the query optimizer is smart >> enough to see that if a query comes in and requests only the >> six columns (that are in the narrower table) that PostgreSQL >> won't have to load the wider table into the buffer pool, and >> thereby actually have to only access about 10% the amount of >> disk that it presently does. > >> Is this a sound theory? > > No. It still has to touch the second table to confirm the > existence of rows to join to. But if a query /requests *only* the six columns (that are in the narrower table)/, why will the optimizer care about the other 224 columns? -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFFo5ZfS9HxQb37XmcRAtDRAJ41kKEN1Dv1iKXosTjy6IvMZKGccACfcZc9 e4pV+u0uLFisHcLu/gyuCvE= =q44l -----END PGP SIGNATURE-----
It would. A query that uses an inner join implies that a matching entry must exist in both tables - so the join must occur, otherwise you could be returning rows that don't satisfy the join condition.-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 01/08/07 20:39, Tom Lane wrote:John Sales <spelunker334@yahoo.com> writes:By doing this, I'm hoping that the query optimizer is smart enough to see that if a query comes in and requests only the six columns (that are in the narrower table) that PostgreSQL won't have to load the wider table into the buffer pool, and thereby actually have to only access about 10% the amount of disk that it presently does.Is this a sound theory?No. It still has to touch the second table to confirm the existence of rows to join to.But if a query /requests *only* the six columns (that are in the narrower table)/, why will the optimizer care about the other 224 columns?
-- Chander Ganesan Open Technology Group, Inc. One Copley Parkway, Suite 210 Morrisville, NC 27560 Phone: 877-258-8987/919-463-0999 http://www.otg-nc.com Expert PostgreSQL Training: http://www.otg-nc.com/training-courses/category.php?cat_id=8
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFFo5ZfS9HxQb37XmcRAtDRAJ41kKEN1Dv1iKXosTjy6IvMZKGccACfcZc9 e4pV+u0uLFisHcLu/gyuCvE= =q44l -----END PGP SIGNATURE----- ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org/
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 01/09/07 07:28, Chander Ganesan wrote: > Ron Johnson wrote: >> >> On 01/08/07 20:39, Tom Lane wrote: >> >>> John Sales <spelunker334@yahoo.com> writes: >>> >>>> By doing this, I'm hoping that the query optimizer is smart >>>> enough to see that if a query comes in and requests only the >>>> six columns (that are in the narrower table) that PostgreSQL >>>> won't have to load the wider table into the buffer pool, and >>>> thereby actually have to only access about 10% the amount of >>>> disk that it presently does. >>>> Is this a sound theory? >>>> >>> No. It still has to touch the second table to confirm the >>> existence of rows to join to. >>> >> >> But if a query /requests *only* the six columns (that are in the >> narrower table)/, why will the optimizer care about the other 224 >> columns? >> > It would. A query that uses an inner join implies that a matching entry > must exist in both tables - so the join must occur, otherwise you could > be returning rows that don't satisfy the join condition. Sure, if you were selecting those 6 columns from the "inner join view". <pause> Ah, now that I reread the OP, I see that that's what he seems to mean. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFFo5tdS9HxQb37XmcRApwEAKDiqD86q3sh5eePFrgH3+o4LbTAYwCg1Oys 3/WT7eJvbxfE4RDY3E99NAo= =ix6x -----END PGP SIGNATURE-----
On 09/01/07, Ron Johnson <ron.l.johnson@cox.net> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 01/08/07 20:39, Tom Lane wrote: > > John Sales <spelunker334@yahoo.com> writes: > >> By doing this, I'm hoping that the query optimizer is smart > >> enough to see that if a query comes in and requests only the > >> six columns (that are in the narrower table) that PostgreSQL > >> won't have to load the wider table into the buffer pool, and > >> thereby actually have to only access about 10% the amount of > >> disk that it presently does. > > > >> Is this a sound theory? > > > > No. It still has to touch the second table to confirm the > > existence of rows to join to. > > But if a query /requests *only* the six columns (that are in the > narrower table)/, why will the optimizer care about the other 224 > columns? > If you are doing an inner join (read normal join) the column has to exist in both tables to be in the final result. If your doing an outer join it depends upon its type (left, right or full) and then postgres may not optimise it out. Peter Childs
In theory, if the table with 6 columns was the child of the table with 200+ columns, and a PK-FK relationship existed, then the optimizer wouldn't need to check for the existence of the rows, since the PK-FK relationship would indicate that the rows did/do exist. However, I don't *think* that the optimizer takes that into account (though with PostgreSQL you never know, it's pretty smart stuff ;-) ) . Of course, that might make insert operations difficult, but if you are using a view to perform inserts you could probably handle that fairly easily in the on insert rule...-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 01/09/07 07:28, Chander Ganesan wrote:Ron Johnson wrote:On 01/08/07 20:39, Tom Lane wrote:John Sales <spelunker334@yahoo.com> writes:By doing this, I'm hoping that the query optimizer is smart enough to see that if a query comes in and requests only the six columns (that are in the narrower table) that PostgreSQL won't have to load the wider table into the buffer pool, and thereby actually have to only access about 10% the amount of disk that it presently does. Is this a sound theory?No. It still has to touch the second table to confirm the existence of rows to join to.But if a query /requests *only* the six columns (that are in the narrower table)/, why will the optimizer care about the other 224 columns?It would. A query that uses an inner join implies that a matching entry must exist in both tables - so the join must occur, otherwise you could be returning rows that don't satisfy the join condition.Sure, if you were selecting those 6 columns from the "inner join view". <pause> Ah, now that I reread the OP, I see that that's what he seems to mean.
-- Chander Ganesan Open Technology Group, Inc. One Copley Parkway, Suite 210 Morrisville, NC 27560 Phone: 877-258-8987/919-463-0999 http://www.otg-nc.com Expert PostgreSQL training: http://www.otg-nc.com/training-courses/category.php?cat_id=8
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFFo5tdS9HxQb37XmcRApwEAKDiqD86q3sh5eePFrgH3+o4LbTAYwCg1Oys 3/WT7eJvbxfE4RDY3E99NAo= =ix6x -----END PGP SIGNATURE----- ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly
Chander Ganesan <chander@otg-nc.com> writes: > In theory, if the table with 6 columns was the child of the table with > 200+ columns, and a PK-FK relationship existed, then the optimizer > wouldn't need to check for the existence of the rows, since the PK-FK > relationship would indicate that the rows did/do exist. No, that's still not right. With a LEFT JOIN you know that each row of the narrow table will produce at least one row in the join view. What you don't know is whether the row could produce more than one join row --- ie, is there more than one wide-table row that joins to it? To optimize away the join, the planner would have to find a unique constraint on the wide table's join column(s). This is certainly doable in principle, though I find it questionable whether the planner should spend cycles on every join query checking for something that won't be true in the vast majority of real-world queries. The main reason we have not considered it to date is that the correctness of the plan would then depend on a constraint that could get dropped --- but the plan would not fail when run, as indeed it wouldn't be touching that table at all. We really need some plan-invalidation infrastructure to force re-planning whenever a table's schema changes, and only then will it be safe for the planner to start depending on constraints for correctness- related decisions. (This is why constraint exclusion is off by default at the moment.) I've been wanting plan invalidation for a long time ... maybe it will finally happen in 8.3. regards, tom lane
On Tue, Jan 09, 2007 at 08:28:29 -0500, Chander Ganesan <chander@otg-nc.com> wrote: > It would. A query that uses an inner join implies that a matching entry > must exist in both tables - so the join must occur, otherwise you could > be returning rows that don't satisfy the join condition. While this might not be worth doing, a foreign key constraint could be used to establish the existance of the matching row, so that you wouldn't actually need to look in the other table to verify that if you didn't need any of the other columns in the other table. There will be some tricky cases for this, such as when there are deferred constraints or when updating the foreign key field.
On Tue, Jan 09, 2007 at 10:33:52 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > No, that's still not right. With a LEFT JOIN you know that each row of > the narrow table will produce at least one row in the join view. What > you don't know is whether the row could produce more than one join row > --- ie, is there more than one wide-table row that joins to it? Thanks for pointing that out. I only thought of half of the problem. > To optimize away the join, the planner would have to find a unique > constraint on the wide table's join column(s). This is certainly doable > in principle, though I find it questionable whether the planner should > spend cycles on every join query checking for something that won't be > true in the vast majority of real-world queries. The main reason we In this case the test would only be applied when no columns were being used in a table being joined to. Since that is also an unusual case, if that case could be quickly checked for, then it might conceivably be worth doing the more expensive test for the proper not null foreign key relation and unique constraint.
Bruno Wolff III <bruno@wolff.to> writes: > In this case the test would only be applied when no columns were being used > in a table being joined to. Since that is also an unusual case, if that case > could be quickly checked for, then it might conceivably be worth doing the > more expensive test for the proper not null foreign key relation and unique > constraint. Yeah. Right offhand I think that would be fairly expensive too :-(. The problem is that the query *DOES* have a reference to at least one column of the wide table, namely its join key. So you'd have to distinguish whether any references appear "above" the join. In the current planner structure I think that this information is computed, but not until it's far too late to be removing joins from the tree. regards, tom lane
Ron Johnson wrote:
On 01/08/07 20:39, Tom Lane wrote:
John Sales <spelunker334@yahoo.com> writes:
By doing this, I'm hoping that the query optimizer is smart
enough to see that if a query comes in and requests only the
six columns (that are in the narrower table) that PostgreSQL
won't have to load the wider table into the buffer pool, and
thereby actually have to only access about 10% the amount of
disk that it presently does.
No. It still has to touch the second table to confirm the
existence of rows to join to.
But if a query /requests *only* the six columns (that are in the
narrower table)/, why will the optimizer care about the other 224
columns?
It would. A query that uses an inner join implies that a matching entry must exist in both tables - so the join must occur, otherwise you could be returning rows that don't satisfy the join condition.
However, if the primary key is entirely within those six columns, there will have to be an index on it in both tables to enforce the primary key constraint. In that case, an inner join could be performed with an index lookup or an index scan plus hash join, for a query that didn’t use any other columns. Whether that translates into a significant I/O reduction depends on how wide and how frequently non-NULL those other columns are.
> However, if the primary key is entirely within those six columns, there > will have to be an index on it in both tables to enforce the primary key > constraint. In that case, an inner join could be performed with an index > lookup or an index scan plus hash join, for a query that didn't use any > other columns. Whether that translates into a significant I/O reduction > depends on how wide and how frequently non-NULL those other columns are. > > ... if someone is feeling pedagogical (and the answer isn't that complicated), could they explain why a simple index on the desired columns wouldn't be the best solution? Cheers Antoine