Thread: Partitioning such that key field of inherited tables no longer retains any selectivity

The subject line may not actually describe what I want to illustrate… 

Basically, let’s say we have a nicely partitioned data-set. Performance is a net win and I’m happy with it.
The partitioning scheme is equality based, rather than range based.

That is, each partition contains a subset of the data where partition_key = {some_value}, and of course we let constraint exclusion enable the optimiser to do its thing.

As such, all of the data contained in a given partition has the same value for partition_key. That field, within the scope of its partition – isn’t terribly useful anymore, and in my mind is wasting bytes – it’s only purpose really is to allow the CHECK constraint to verify the data is what it should be.


Wouldn’t it be nice if we could somehow create a child table where we could define a const field value, that did not need to be stored on disk at the tuple level?
This would allow the check constraint to supply the optimiser with the information it needs, while removing the need to consume disk to record a field whose value is always the same.


Extending this idea..
Postgresql could possibly look at any equality based check constraint for a table and instead of storing each field value verbatim, we could implicitly optimise away the need to write those field values to disk, on the understanding that those values can never change (unless the constraint is removed/altered).

I’m sure there are all kinds of worms in this canister, but I thought it might be an interesting discussion.


Cheers,

Tim
Tim Kane wrote
> The subject line may not actually describe what I want to illustrate…
>
> Basically, let’s say we have a nicely partitioned data-set. Performance is
> a
> net win and I’m happy with it.
> The partitioning scheme is equality based, rather than range based.
>
> That is, each partition contains a subset of the data where partition_key
> =
> {some_value}, and of course we let constraint exclusion enable the
> optimiser
> to do its thing.
>
> As such, all of the data contained in a given partition has the same value
> for partition_key. That field, within the scope of its partition – isn’t
> terribly useful anymore, and in my mind is wasting bytes – it’s only
> purpose
> really is to allow the CHECK constraint to verify the data is what it
> should
> be.
>
>
> Wouldn’t it be nice if we could somehow create a child table where we
> could
> define a const field value, that did not need to be stored on disk at the
> tuple level?
> This would allow the check constraint to supply the optimiser with the
> information it needs, while removing the need to consume disk to record a
> field whose value is always the same.
>
>
> Extending this idea..
> Postgresql could possibly look at any equality based check constraint for
> a
> table and instead of storing each field value verbatim, we could
> implicitly
> optimise away the need to write those field values to disk, on the
> understanding that those values can never change (unless the constraint is
> removed/altered).
>
> I’m sure there are all kinds of worms in this canister, but I thought it
> might be an interesting discussion.
>
>
> Cheers,
>
> Tim

Two approaches:
1. Standard virtual column name that, when used, gets rewritten into a
constant that is stored at the table level.
2. A way for a column's value to be defined as a function call.

Option 2 has the virtue of being more generally applicable but you'd need
some way to know that for any give table that a given function resolves to a
constant.  Maybe have a magic function like partitonid(tabloid) that if used
in a query would be interpreted in this way.  Combined with option 1 and the
stand column could be pre-defined in this way - if the partition constant
exists which is the main thing to avoid - increased checking/rewriting time
for non-partitioned tables.

David J.




--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Partitioning-such-that-key-field-of-inherited-tables-no-longer-retains-any-selectivity-tp5803549p5803561.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


David G Johnston <david.g.johnston@gmail.com> writes:
> Two approaches:
> 1. Standard virtual column name that, when used, gets rewritten into a
> constant that is stored at the table level.
> 2. A way for a column's value to be defined as a function call.

Recent versions of the SQL spec have a notion of "generated columns"
that I think subsumes both of these concepts.  We had a draft patch
awhile back that attempted to implement that feature.  It crashed
and burned for reasons I don't recall ... but certainly implementing
an already-standardized feature is more attractive than just inventing
behavior on our own.

            regards, tom lane




From: Tom Lane <tgl@sss.pgh.pa.us>
David G Johnston <david.g.johnston@gmail.com> writes:
Two approaches:
1. Standard virtual column name that, when used, gets rewritten into a
constant that is stored at the table level.
2. A way for a column's value to be defined as a function call.

Recent versions of the SQL spec have a notion of "generated columns"
that I think subsumes both of these concepts.  We had a draft patch
awhile back that attempted to implement that feature.  It crashed
and burned for reasons I don't recall ... but certainly implementing
an already-standardized feature is more attractive than just inventing
behavior on our own.


That sounds interesting.
Is this what you are referring to?  Actually, it looks like it would fit the bill and then some.


—————————————————

4.14.8 Base columns and generated columns

A column of a base table is either a base column or a generated column. A base column is one that is not a generated column. A generated column is one whose values are determined by evaluation of a generation expression, a <value expression> whose declared type is by implication that of the column. A generation expression can reference base columns of the base table to which it belongs but cannot otherwise access SQL- data. Thus, the value of the field corresponding to a generated column in row R is determined by the values of zero or more other fields of R.

A generated column GC depends on each column that is referenced by a <column reference> in its generation expression, and each such referenced column is a parametric column of GC.

 —————————————————




On Sun, May 11, 2014 at 2:08 PM, Tim Kane [via PostgreSQL] <[hidden email]> wrote:


From: Tom Lane <[hidden email]>
David G Johnston <[hidden email]> writes:
Two approaches:
1. Standard virtual column name that, when used, gets rewritten into a
constant that is stored at the table level.
2. A way for a column's value to be defined as a function call.

Recent versions of the SQL spec have a notion of "generated columns"
that I think subsumes both of these concepts.  We had a draft patch
awhile back that attempted to implement that feature.  It crashed
and burned for reasons I don't recall ... but certainly implementing
an already-standardized feature is more attractive than just inventing
behavior on our own.


That sounds interesting.
Is this what you are referring to?  Actually, it looks like it would fit the bill and then some.


—————————————————

4.14.8 Base columns and generated columns

A column of a base table is either a base column or a generated column. A base column is one that is not a generated column. A generated column is one whose values are determined by evaluation of a generation expression, a <value expression> whose declared type is by implication that of the column. A generation expression can reference base columns of the base table to which it belongs but cannot otherwise access SQL- data. Thus, the value of the field corresponding to a generated column in row R is determined by the values of zero or more other fields of R.

A generated column GC depends on each column that is referenced by a <column reference> in its generation expression, and each such referenced column is a parametric column of GC.

 —————————————————



​This is basically what I intended to describe in "option 2"...without the benefit of ever having really read the SQL standard.

So the planner would have to know that, for a given table, the generation expression results in a constant - would likely in fact have to be a constant expression like, assuming a non-number value, ='column_value', where the "=" sign indicates that this is a generation expression and not a stored value (like default behaves currently).

Given that value of the partition column is constant, and each child table has a name, is there some way, with the current implementation, to write a query like:

SELECT * 
FROM table_hierarchy
WHERE tableoid = ANY( get_tableoids_as_array('TABLE_NAME1'[,VARARGS]))

and have exclusion constraints work correctly?

Also, toward that end, it would seem that in this particular situation you could accomplish much the same by using dynamic SQL; though I guess that would depend on whether any given query needs to be able to return values from more than one table.

My fluency with respect to inheritance is poor so please forgive if I'm out in the rough on this one.

David J.



View this message in context: Re: Partitioning such that key field of inherited tables no longer retains any selectivity
Sent from the PostgreSQL - general mailing list archive at Nabble.com.
W dniu 11.05.2014 22:01, David G Johnston pisze:
On Sun, May 11, 2014 at 2:08 PM, Tim Kane [via PostgreSQL] <[hidden email]> wrote:
[------------------]

​This is basically what I intended to describe in "option 2"...without the benefit of ever having really read the SQL standard.

So the planner would have to know that, for a given table, the generation expression results in a constant - would likely in fact have to be a constant expression like, assuming a non-number value, ='column_value', where the "=" sign indicates that this is a generation expression and not a stored value (like default behaves currently).

wouldn't it be ways better, if the constraints for partitioning by inharitance were set at the "master" table, instead of the way it's currently done at the inharited tables (as exclusive CHECK-s there)?

I mean a constraint like a "function(table columns) reutrning table_name or tablespace_name of the actual target table"?

<start preudocode>
create table master (a int, b int, c int);
create table table_a (inharits master);
create table table_b (inharits master);

create function(a,b) returns text as $$ if a > b then return "table_a" else return "table_b"; end if; end $$
... or:
create function(a,b) returns tablespace as $$ if a > b then return tablespace("table_a") else return tablespace("table_b"); end if; end $$

alter table master add constraint "partitioning" check/select/route function(a,b);
<end pseudocode>


-R