Thread: Questions about horizontal partitioning

Questions about horizontal partitioning

From
John Sales
Date:
Suppose I have a table - lets say it was developed by someone with little or no understanding of database design - and it has 230 columns.

Now, it turns out that 99% of the time only about 8 colums are required, but all 230 columns are populated.  However, legacy applications (which are run nightly for batch processing, but not during the day, when heavy volume occurs) require the old table design.  New applications only select a limited number of columns, but require the old design names.

I want to do the following:

1.  Split the table into two tables (one with 6 columns, the other with 224 columns), using the primary key to establish a 1-1 relationship between them.
2.  Create a view that displays the join between the two tables.

By doing this, I'm hoping that the query optimizer is smart enough to see that if a query comes in and requests only the six columns (that are in the narrower table) that PostgreSQL won't have to load the wider table into the buffer pool, and thereby actually have to only access about 10% the amount of disk that it presently does.

Is this a sound theory?  Is the query optimizer smart enough ignore part of a join when a portion of the join will have no effect on the end result set?

thanks

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

Re: Questions about horizontal partitioning

From
Tom Lane
Date:
John Sales <spelunker334@yahoo.com> writes:
> By doing this, I'm hoping that the query optimizer is smart enough to see that if a query comes in and requests only
thesix columns (that are in the narrower table) that PostgreSQL won't have to load the wider table into the buffer
pool,and thereby actually have to only access about 10% the amount of disk that it presently does. 

> Is this a sound theory?

No.  It still has to touch the second table to confirm the existence of
rows to join to.

            regards, tom lane

Re: Questions about horizontal partitioning

From
John Sales
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:
John Sales writes:
> By doing this, I'm hoping that the query optimizer is smart enough to see that if a query comes in and requests only the six columns (that are in the narrower table) that PostgreSQL won't have to load the wider table into the buffer pool, and thereby actually have to only access about 10% the amount of disk that it presently does.

> Is this a sound theory?

No. It still has to touch the second table to confirm the existence of
rows to join to.

regards, tom lane
Good point..

What if I made the second table optional using an outer join, so as not to imply the existance of rows in the "wide" table?

thanks

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

Re: Questions about horizontal partitioning

From
Ron Johnson
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/08/07 20:39, Tom Lane wrote:
> John Sales <spelunker334@yahoo.com> writes:
>> By doing this, I'm hoping that the query optimizer is smart
>> enough to see that if a query comes in and requests only the
>> six columns (that are in the narrower table) that PostgreSQL
>> won't have to load the wider table into the buffer pool, and
>> thereby actually have to only access about 10% the amount of
>> disk that it presently does.
>
>> Is this a sound theory?
>
> No.  It still has to touch the second table to confirm the
> existence of rows to join to.

But if a query /requests *only* the six columns (that are in the
narrower table)/, why will the optimizer care about the other 224
columns?



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFFo5ZfS9HxQb37XmcRAtDRAJ41kKEN1Dv1iKXosTjy6IvMZKGccACfcZc9
e4pV+u0uLFisHcLu/gyuCvE=
=q44l
-----END PGP SIGNATURE-----

Re: Questions about horizontal partitioning

From
Chander Ganesan
Date:
Ron Johnson wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/08/07 20:39, Tom Lane wrote: 
John Sales <spelunker334@yahoo.com> writes:   
By doing this, I'm hoping that the query optimizer is smart
enough to see that if a query comes in and requests only the
six columns (that are in the narrower table) that PostgreSQL
won't have to load the wider table into the buffer pool, and
thereby actually have to only access about 10% the amount of
disk that it presently does.     
Is this a sound theory?     
No.  It still has to touch the second table to confirm the
existence of rows to join to.   
But if a query /requests *only* the six columns (that are in the
narrower table)/, why will the optimizer care about the other 224
columns? 
It would.  A query that uses an inner join implies that a matching entry must exist in both tables - so the join must occur, otherwise you could be returning rows that don't satisfy the join condition.
-- 
Chander Ganesan
Open Technology Group, Inc.
One Copley Parkway, Suite 210
Morrisville, NC  27560
Phone: 877-258-8987/919-463-0999
http://www.otg-nc.com
Expert PostgreSQL Training: http://www.otg-nc.com/training-courses/category.php?cat_id=8



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFFo5ZfS9HxQb37XmcRAtDRAJ41kKEN1Dv1iKXosTjy6IvMZKGccACfcZc9
e4pV+u0uLFisHcLu/gyuCvE=
=q44l
-----END PGP SIGNATURE-----

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?
              http://archives.postgresql.org/ 


Re: Questions about horizontal partitioning

From
Ron Johnson
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/09/07 07:28, Chander Ganesan wrote:
> Ron Johnson wrote:
>>
>> On 01/08/07 20:39, Tom Lane wrote:
>>
>>> John Sales <spelunker334@yahoo.com> writes:
>>>
>>>> By doing this, I'm hoping that the query optimizer is smart
>>>> enough to see that if a query comes in and requests only the
>>>> six columns (that are in the narrower table) that PostgreSQL
>>>> won't have to load the wider table into the buffer pool, and
>>>> thereby actually have to only access about 10% the amount of
>>>> disk that it presently does.
>>>>       Is this a sound theory?
>>>>
>>> No.  It still has to touch the second table to confirm the
>>> existence of rows to join to.
>>>
>>
>> But if a query /requests *only* the six columns (that are in the
>> narrower table)/, why will the optimizer care about the other 224
>> columns?
>>
> It would.  A query that uses an inner join implies that a matching entry
> must exist in both tables - so the join must occur, otherwise you could
> be returning rows that don't satisfy the join condition.

Sure, if you were selecting those 6 columns from the "inner join
view".  <pause>  Ah, now that I reread the OP, I see that that's
what he seems to mean.



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFFo5tdS9HxQb37XmcRApwEAKDiqD86q3sh5eePFrgH3+o4LbTAYwCg1Oys
3/WT7eJvbxfE4RDY3E99NAo=
=ix6x
-----END PGP SIGNATURE-----

Re: Questions about horizontal partitioning

From
"Peter Childs"
Date:
On 09/01/07, Ron Johnson <ron.l.johnson@cox.net> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 01/08/07 20:39, Tom Lane wrote:
> > John Sales <spelunker334@yahoo.com> writes:
> >> By doing this, I'm hoping that the query optimizer is smart
> >> enough to see that if a query comes in and requests only the
> >> six columns (that are in the narrower table) that PostgreSQL
> >> won't have to load the wider table into the buffer pool, and
> >> thereby actually have to only access about 10% the amount of
> >> disk that it presently does.
> >
> >> Is this a sound theory?
> >
> > No.  It still has to touch the second table to confirm the
> > existence of rows to join to.
>
> But if a query /requests *only* the six columns (that are in the
> narrower table)/, why will the optimizer care about the other 224
> columns?
>

If you are doing an inner join (read normal join) the column has to
exist in both tables to be in the final result. If your doing an outer
join it depends upon its type (left, right or full) and then postgres
may not optimise it out.


Peter Childs

Re: Questions about horizontal partitioning

From
Chander Ganesan
Date:
Ron Johnson wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/09/07 07:28, Chander Ganesan wrote: 
Ron Johnson wrote:   
On 01/08/07 20:39, Tom Lane wrote:     
John Sales <spelunker334@yahoo.com> writes:         
By doing this, I'm hoping that the query optimizer is smart
enough to see that if a query comes in and requests only the
six columns (that are in the narrower table) that PostgreSQL
won't have to load the wider table into the buffer pool, and
thereby actually have to only access about 10% the amount of
disk that it presently does.     Is this a sound theory?              
No.  It still has to touch the second table to confirm the
existence of rows to join to.          
But if a query /requests *only* the six columns (that are in the
narrower table)/, why will the optimizer care about the other 224
columns?      
It would.  A query that uses an inner join implies that a matching entry
must exist in both tables - so the join must occur, otherwise you could
be returning rows that don't satisfy the join condition.   
Sure, if you were selecting those 6 columns from the "inner join
view".  <pause>  Ah, now that I reread the OP, I see that that's
what he seems to mean. 
In theory, if the table with 6 columns was the child of the table with 200+ columns, and a PK-FK relationship existed, then the optimizer wouldn't need to check for the existence of the rows, since the PK-FK relationship would indicate that the rows did/do exist.  However, I don't *think* that the optimizer takes that into account (though with PostgreSQL you never know, it's pretty smart stuff ;-) ) .  Of course, that might make insert operations difficult, but if you are using a view to perform inserts you could probably handle that fairly easily in the on insert rule...
-- 
Chander Ganesan
Open Technology Group, Inc.
One Copley Parkway, Suite 210
Morrisville, NC  27560
Phone: 877-258-8987/919-463-0999
http://www.otg-nc.com
Expert PostgreSQL training: http://www.otg-nc.com/training-courses/category.php?cat_id=8




-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFFo5tdS9HxQb37XmcRApwEAKDiqD86q3sh5eePFrgH3+o4LbTAYwCg1Oys
3/WT7eJvbxfE4RDY3E99NAo=
=ix6x
-----END PGP SIGNATURE-----

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate      subscribe-nomail command to majordomo@postgresql.org so that your      message can get through to the mailing list cleanly 

Re: Questions about horizontal partitioning

From
Tom Lane
Date:
Chander Ganesan <chander@otg-nc.com> writes:
> In theory, if the table with 6 columns was the child of the table with
> 200+ columns, and a PK-FK relationship existed, then the optimizer
> wouldn't need to check for the existence of the rows, since the PK-FK
> relationship would indicate that the rows did/do exist.

No, that's still not right.  With a LEFT JOIN you know that each row of
the narrow table will produce at least one row in the join view.  What
you don't know is whether the row could produce more than one join row
--- ie, is there more than one wide-table row that joins to it?

To optimize away the join, the planner would have to find a unique
constraint on the wide table's join column(s).  This is certainly doable
in principle, though I find it questionable whether the planner should
spend cycles on every join query checking for something that won't be
true in the vast majority of real-world queries.  The main reason we
have not considered it to date is that the correctness of the plan would
then depend on a constraint that could get dropped --- but the plan
would not fail when run, as indeed it wouldn't be touching that table
at all.  We really need some plan-invalidation infrastructure to force
re-planning whenever a table's schema changes, and only then will it be
safe for the planner to start depending on constraints for correctness-
related decisions.  (This is why constraint exclusion is off by default
at the moment.)

I've been wanting plan invalidation for a long time ... maybe it will
finally happen in 8.3.

            regards, tom lane

Re: Questions about horizontal partitioning

From
Bruno Wolff III
Date:
On Tue, Jan 09, 2007 at 08:28:29 -0500,
  Chander Ganesan <chander@otg-nc.com> wrote:
> It would.  A query that uses an inner join implies that a matching entry
> must exist in both tables - so the join must occur, otherwise you could
> be returning rows that don't satisfy the join condition.

While this might not be worth doing, a foreign key constraint could be used
to establish the existance of the matching row, so that you wouldn't actually
need to look in the other table to verify that if you didn't need any of the
other columns in the other table. There will be some tricky cases for this,
such as when there are deferred constraints or when updating the foreign key
field.

Re: Questions about horizontal partitioning

From
Bruno Wolff III
Date:
On Tue, Jan 09, 2007 at 10:33:52 -0500,
  Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> No, that's still not right.  With a LEFT JOIN you know that each row of
> the narrow table will produce at least one row in the join view.  What
> you don't know is whether the row could produce more than one join row
> --- ie, is there more than one wide-table row that joins to it?

Thanks for pointing that out. I only thought of half of the problem.

> To optimize away the join, the planner would have to find a unique
> constraint on the wide table's join column(s).  This is certainly doable
> in principle, though I find it questionable whether the planner should
> spend cycles on every join query checking for something that won't be
> true in the vast majority of real-world queries.  The main reason we

In this case the test would only be applied when no columns were being used
in a table being joined to. Since that is also an unusual case, if that case
could be quickly checked for, then it might conceivably be worth doing the
more expensive test for the proper not null foreign key relation and unique
constraint.

Re: Questions about horizontal partitioning

From
Tom Lane
Date:
Bruno Wolff III <bruno@wolff.to> writes:
> In this case the test would only be applied when no columns were being used
> in a table being joined to. Since that is also an unusual case, if that case
> could be quickly checked for, then it might conceivably be worth doing the
> more expensive test for the proper not null foreign key relation and unique
> constraint.

Yeah.  Right offhand I think that would be fairly expensive too :-(.
The problem is that the query *DOES* have a reference to at least one
column of the wide table, namely its join key.  So you'd have to
distinguish whether any references appear "above" the join.  In the
current planner structure I think that this information is computed,
but not until it's far too late to be removing joins from the tree.

            regards, tom lane

Re: Questions about horizontal partitioning

From
"David Lee Lambert"
Date:

 

Ron Johnson wrote:

 
On 01/08/07 20:39, Tom Lane wrote:
  
John Sales <spelunker334@yahoo.com> writes:
    
By doing this, I'm hoping that the query optimizer is smart
enough to see that if a query comes in and requests only the
six columns (that are in the narrower table) that PostgreSQL
won't have to load the wider table into the buffer pool, and
thereby actually have to only access about 10% the amount of
disk that it presently does.
      
No.  It still has to touch the second table to confirm the
existence of rows to join to.
    
 
But if a query /requests *only* the six columns (that are in the
narrower table)/, why will the optimizer care about the other 224
columns?
  

It would.  A query that uses an inner join implies that a matching entry must exist in both tables - so the join must occur, otherwise you could be returning rows that don't satisfy the join condition.

However,  if the primary key is entirely within those six columns,  there will have to be an index on it in both tables to enforce the primary key constraint.  In that case,  an inner join could be performed with an index lookup or an index scan plus hash join,  for a query that didn’t use any other columns.  Whether that translates into a significant I/O reduction depends on how wide and how frequently non-NULL those other columns are.

 

Re: Questions about horizontal partitioning

From
"Anton Melser"
Date:
> However,  if the primary key is entirely within those six columns,  there
> will have to be an index on it in both tables to enforce the primary key
> constraint.  In that case,  an inner join could be performed with an index
> lookup or an index scan plus hash join,  for a query that didn't use any
> other columns.  Whether that translates into a significant I/O reduction
> depends on how wide and how frequently non-NULL those other columns are.
>
>

... if someone is feeling pedagogical (and the answer isn't that
complicated), could they explain why a simple index on the desired
columns wouldn't be the best solution?
Cheers
Antoine