Thread: partitioning question 1
hello -- my last email was apparently too long to respond to so i'll split it up into shorter pieces. my first question : my understanding of how range partitioning and constraint exclusion works leads me to believe that it does not buy any queryperformance that a clustered index doesn't already give you -- the advantages are all in maintainability. an indexis able to eliminate pages just as well as constraint exclusion is able to eliminate table partitions. the I/O advantagesof having queries target small subtables are the same as the I/O advantages of clustering the index : result pagesin a small range are very close to each other on disk. finally, since constraint exclusion isn't as flexible as indexing (i've seen old mailing list posts that say that constraintexclusion only works with static constants in where clauses, and only works with simple operators like >, < whichbasically forces btree indexes when i want to use gist) it is indeed likely that partitioning can be slower than onebig table with a clustered index. is my intuition completely off on this? best regards, ben
On Thu, 2010-10-28 at 09:36 -0700, Ben wrote: > hello -- > > my last email was apparently too long to respond to so i'll split it up into shorter pieces. my first question : > > my understanding of how range partitioning and constraint exclusion works leads me to believe that it does not buy anyquery performance that a clustered index doesn't already give you -- the advantages are all in maintainability. an indexis able to eliminate pages just as well as constraint exclusion is able to eliminate table partitions. the I/O advantagesof having queries target small subtables are the same as the I/O advantages of clustering the index : result pagesin a small range are very close to each other on disk. Not entirely true. One a clustered index will not stay clustered if you are still updating data that is in the partition. You shouldn't underestimate the benefit of smaller relations in terms of maintenance either. > > finally, since constraint exclusion isn't as flexible as indexing (i've seen old mailing list posts that say that constraintexclusion only works with static constants in where clauses, and only works with simple operators like >, < whichbasically forces btree indexes when i want to use gist) it is indeed likely that partitioning can be slower than onebig table with a clustered index. Yes the constraints have to be static. Not sure about the operator question honestly. > is my intuition completely off on this? You may actually want to look into expression indexes, not clustered ones. Sincerely, Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
thanks for the prompt response. some comments / questions below : On Oct 28, 2010, at 10:31 AM, Joshua D. Drake wrote: >> ...constraint exclusion is able to eliminate table partitions. the I/O advantages of having queries target small subtablesare the same as the I/O advantages of clustering the index : result pages in a small range are very close to eachother on disk. > > Not entirely true. One a clustered index will not stay clustered if you > are still updating data that is in the partition. You shouldn't > underestimate the benefit of smaller relations in terms of maintenance > either. in my situation, the update come in-order (it is timeseries data and the clustered index is on time.) so the table shouldremain relatively clustered. updates also happen relatively infrequently (once a day in one batch.) so it appearsthat we will continue to get the I/O benefits described above. are there any other benefits which partitioning provides for query performance (as opposed to update performance) besidesthe ones which i have mentioned? > Yes the constraints have to be static. Not sure about the operator > question honestly. this seems to severely restrict their usefulness -- our queries are data warehouse analytical -type queries, so the constraintsare usually data-driven (come from joining against other tables.) >> is my intuition completely off on this? > > You may actually want to look into expression indexes, not clustered > ones. what would expression indexes give me? thanks and best regards, ben
On Thu, 2010-10-28 at 11:44 -0700, Ben wrote: > > Yes the constraints have to be static. Not sure about the operator > > question honestly. > > this seems to severely restrict their usefulness -- our queries are data warehouse analytical -type queries, so the constraintsare usually data-driven (come from joining against other tables.) Well it does and it doesn't. Keep in mind that the constraint can be: date >= '2010-10-01" and date <= '2010-10-31' What it can't be is something that contains date_part() or extract() (as an example) > > >> is my intuition completely off on this? > > > > You may actually want to look into expression indexes, not clustered > > ones. Take a look at the docs: http://www.postgresql.org/docs/8.4/interactive/indexes-expressional.html It "could" be considered partitioning without breaking up the table, just the indexes. Sincerely, Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
On Oct 28, 2010, at 11:50 AM, Joshua D. Drake wrote: >>> Yes the constraints have to be static. Not sure about the operator >>> question honestly. >> >> this seems to severely restrict their usefulness -- our queries are data warehouse analytical -type queries, so the constraintsare usually data-driven (come from joining against other tables.) > > Well it does and it doesn't. Keep in mind that the constraint can be: > > date >= '2010-10-01" and date <= '2010-10-31' > > What it can't be is something that contains date_part() or extract() (as > an example) i think we are talking about two different things here: the constraints on the table, and the where-clause constraints ina query which may or may not trigger constraint exclusion. i understand that table constraints have to be constants --it doesn't make much sense otherwise. what i am wondering about is, will constraint exclusion be triggered for querieswhere the column that is being partitioned on is being constrained things that are not static constants, for instance,in a join. (i'm pretty sure the answer is no, because i think constraint exclusion happens before real query planning.) a concrete example : create table foo (i integer not null, j float not null); create table foo_1 (check ( i >= 0 and i < 10) ) inherits (foo); create table foo_2 (check ( i >= 10 and i < 20) ) inherits (foo); create table foo_3 (check ( i >= 20 and i < 30) ) inherits (foo); etc.. create table bar (i integer not null, k float not null); my understanding is that a query like select * from foo, bar using (i); can't use constraint exclusion, even if the histogram of i-values on table bar says they only live in the range 0-9, andso the query will touch all of the tables. i think this is not favorable compared to a single foo table with a well-maintainedbtree index on i. >>>> is my intuition completely off on this? >>> >>> You may actually want to look into expression indexes, not clustered >>> ones. > > Take a look at the docs: > > http://www.postgresql.org/docs/8.4/interactive/indexes-expressional.html > > It "could" be considered partitioning without breaking up the table, > just the indexes. do you mean partial indexes? i have to confess to not understanding how this is relevant -- how could partial indexes giveany advantage over a full clustered index? b
On Thu, 2010-10-28 at 12:25 -0700, Ben wrote: > i think we are talking about two different things here: the constraints on the table, and the where-clause constraintsin a query which may or may not trigger constraint exclusion. i understand that table constraints have to beconstants -- it doesn't make much sense otherwise. what i am wondering about is, will constraint exclusion be triggeredfor queries where the column that is being partitioned on is being constrained things that are not static constants,for instance, in a join. (i'm pretty sure the answer is no, because i think constraint exclusion happens beforereal query planning.) a concrete example : > > create table foo (i integer not null, j float not null); > create table foo_1 (check ( i >= 0 and i < 10) ) inherits (foo); > create table foo_2 (check ( i >= 10 and i < 20) ) inherits (foo); > create table foo_3 (check ( i >= 20 and i < 30) ) inherits (foo); > etc.. > > create table bar (i integer not null, k float not null); > > my understanding is that a query like > > select * from foo, bar using (i); > > can't use constraint exclusion, even if the histogram of i-values on table bar says they only live in the range 0-9, andso the query will touch all of the tables. i think this is not favorable compared to a single foo table with a well-maintainedbtree index on i. > My tests show you are incorrect: part_test=# explain analyze select * from foo join bar using (i) where i=9; QUERY PLAN ------------------------------------------------------------------------------------------------------------------ Nested Loop (cost=34.26..106.76 rows=200 width=20) (actual time=0.004..0.004 rows=0 loops=1) -> Append (cost=0.00..68.50 rows=20 width=12) (actual time=0.004..0.004 rows=0 loops=1) -> Seq Scan on foo (cost=0.00..34.25 rows=10 width=12) (actual time=0.001..0.001 rows=0 loops=1) Filter: (i = 9) -> Seq Scan on foo_1 foo (cost=0.00..34.25 rows=10 width=12) (actual time=0.000..0.000 rows=0 loops=1) Filter: (i = 9) -> Materialize (cost=34.26..34.36 rows=10 width=12) (never executed) -> Seq Scan on bar (cost=0.00..34.25 rows=10 width=12) (never executed) Filter: (i = 9) Total runtime: 0.032 ms (10 rows) -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
On Oct 28, 2010, at 12:44 PM, Joshua D. Drake wrote: > > My tests show you are incorrect: > > > part_test=# explain analyze select * from foo join bar using (i) where > i=9; > QUERY > PLAN > ------------------------------------------------------------------------------------------------------------------ > Nested Loop (cost=34.26..106.76 rows=200 width=20) (actual > time=0.004..0.004 rows=0 loops=1) > -> Append (cost=0.00..68.50 rows=20 width=12) (actual > time=0.004..0.004 rows=0 loops=1) > -> Seq Scan on foo (cost=0.00..34.25 rows=10 width=12) > (actual time=0.001..0.001 rows=0 loops=1) > Filter: (i = 9) > -> Seq Scan on foo_1 foo (cost=0.00..34.25 rows=10 width=12) > (actual time=0.000..0.000 rows=0 loops=1) > Filter: (i = 9) > -> Materialize (cost=34.26..34.36 rows=10 width=12) (never > executed) > -> Seq Scan on bar (cost=0.00..34.25 rows=10 width=12) (never > executed) > Filter: (i = 9) > Total runtime: 0.032 ms > (10 rows) strange. my tests don't agree with your tests : create table foo (i integer not null, j float not null); create table foo_1 ( check (i >= 0 and i < 10) ) inherits (foo); create table foo_2 ( check (i >= 10 and i < 20) ) inherits (foo); create table foo_3 ( check (i >= 20 and i < 30) ) inherits (foo); create index foo_1_idx on foo_1 (i); create index foo_2_idx on foo_2 (i); create index foo_3_idx on foo_3 (i); insert into foo_1 select generate_series, generate_series from generate_series(0,9); insert into foo_2 select generate_series, generate_series from generate_series(10,19); insert into foo_3 select generate_series, generate_series from generate_series(20,29); create table bar (i integer not null, k float not null); create index bar_idx on bar (i); insert into bar select generate_series, -generate_series from generate_series(0,9); vacuum analyze; explain analyze select * from foo join bar using (i); QUERY PLAN ------------------------------------------------------------------------------------------------------------------ Hash Join (cost=1.23..42.29 rows=98 width=20) (actual time=0.056..0.118 rows=10 loops=1) Hash Cond: (public.foo.i = bar.i) -> Append (cost=0.00..32.70 rows=1970 width=12) (actual time=0.008..0.043 rows=30 loops=1) -> Seq Scan on foo (cost=0.00..29.40 rows=1940 width=12) (actual time=0.001..0.001 rows=0 loops=1) -> Seq Scan on foo_1 foo (cost=0.00..1.10 rows=10 width=12) (actual time=0.005..0.008 rows=10 loops=1) -> Seq Scan on foo_2 foo (cost=0.00..1.10 rows=10 width=12) (actual time=0.003..0.006 rows=10 loops=1) -> Seq Scan on foo_3 foo (cost=0.00..1.10 rows=10 width=12) (actual time=0.003..0.006 rows=10 loops=1) -> Hash (cost=1.10..1.10 rows=10 width=12) (actual time=0.025..0.025 rows=10 loops=1) -> Seq Scan on bar (cost=0.00..1.10 rows=10 width=12) (actual time=0.005..0.013 rows=10 loops=1) Total runtime: 0.205 ms (10 rows) i'm running pg 8.4.3 with constraint_exclusion=on (just to be safe.) best, b
On Thu, 2010-10-28 at 12:59 -0700, Ben wrote:
> explain analyze select * from foo join bar using (i);
vs
explain analyze select * from foo join bar using (i) where i=9;
On Thu, 2010-10-28 at 12:59 -0700, Ben wrote: > On Oct 28, 2010, at 12:44 PM, Joshua D. Drake wrote: > > > > My tests show you are incorrect: > > > > > > part_test=# explain analyze select * from foo join bar using (i) where > > i=9; > > QUERY > > PLAN > > ------------------------------------------------------------------------------------------------------------------ > > Nested Loop (cost=34.26..106.76 rows=200 width=20) (actual > > time=0.004..0.004 rows=0 loops=1) > > -> Append (cost=0.00..68.50 rows=20 width=12) (actual > > time=0.004..0.004 rows=0 loops=1) > > -> Seq Scan on foo (cost=0.00..34.25 rows=10 width=12) > > (actual time=0.001..0.001 rows=0 loops=1) > > Filter: (i = 9) > > -> Seq Scan on foo_1 foo (cost=0.00..34.25 rows=10 width=12) > > (actual time=0.000..0.000 rows=0 loops=1) > > Filter: (i = 9) > > -> Materialize (cost=34.26..34.36 rows=10 width=12) (never > > executed) > > -> Seq Scan on bar (cost=0.00..34.25 rows=10 width=12) (never > > executed) > > Filter: (i = 9) > > Total runtime: 0.032 ms > > (10 rows) > > strange. my tests don't agree with your tests : Do you have constraint_exclusion turned on? You should verify with show constraint_exclusion (I saw what you wrote below). JD P.S. Blatant plug, you coming to http://www.postgresqlconference.org ? -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
whoops, didn't see the i=9 (linebreak! linebreak!) nonetheless that is a static constant constraint on the column i, and i was asking if constraint exclusions would work fordynamic constraints (like those derived from a table joined against.) so for example the bar table has only 0-9 in itshistogram for i, but constraint exclusion can't use that to eliminate tables foo_2 and foo_3. this is precisely the kindof information an index can use via join selectivity. i am not going to the pg conference, sorry to say. b On Oct 28, 2010, at 1:48 PM, Joshua D. Drake wrote: > On Thu, 2010-10-28 at 12:59 -0700, Ben wrote: >> On Oct 28, 2010, at 12:44 PM, Joshua D. Drake wrote: >>> >>> My tests show you are incorrect: >>> >>> >>> part_test=# explain analyze select * from foo join bar using (i) where >>> i=9; >>> QUERY >>> PLAN >>> ------------------------------------------------------------------------------------------------------------------ >>> Nested Loop (cost=34.26..106.76 rows=200 width=20) (actual >>> time=0.004..0.004 rows=0 loops=1) >>> -> Append (cost=0.00..68.50 rows=20 width=12) (actual >>> time=0.004..0.004 rows=0 loops=1) >>> -> Seq Scan on foo (cost=0.00..34.25 rows=10 width=12) >>> (actual time=0.001..0.001 rows=0 loops=1) >>> Filter: (i = 9) >>> -> Seq Scan on foo_1 foo (cost=0.00..34.25 rows=10 width=12) >>> (actual time=0.000..0.000 rows=0 loops=1) >>> Filter: (i = 9) >>> -> Materialize (cost=34.26..34.36 rows=10 width=12) (never >>> executed) >>> -> Seq Scan on bar (cost=0.00..34.25 rows=10 width=12) (never >>> executed) >>> Filter: (i = 9) >>> Total runtime: 0.032 ms >>> (10 rows) >> >> strange. my tests don't agree with your tests : > > Do you have constraint_exclusion turned on? You should verify with show > constraint_exclusion (I saw what you wrote below). > > JD > > P.S. Blatant plug, you coming to http://www.postgresqlconference.org ? > > > -- > PostgreSQL.org Major Contributor > Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 > Consulting, Training, Support, Custom Development, Engineering > http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt >
> -----Original Message----- > From: Ben [mailto:midfield@gmail.com] > Sent: Thursday, October 28, 2010 12:37 PM > To: pgsql-performance@postgresql.org > Subject: partitioning question 1 > > hello -- > > my last email was apparently too long to respond to so i'll > split it up into shorter pieces. my first question : > > my understanding of how range partitioning and constraint > exclusion works leads me to believe that it does not buy any > query performance that a clustered index doesn't already give > you -- the advantages are all in maintainability. an index > is able to eliminate pages just as well as constraint > exclusion is able to eliminate table partitions. the I/O > advantages of having queries target small subtables are the > same as the I/O advantages of clustering the index : result > pages in a small range are very close to each other on disk. > > finally, since constraint exclusion isn't as flexible as > indexing (i've seen old mailing list posts that say that > constraint exclusion only works with static constants in > where clauses, and only works with simple operators like >, < > which basically forces btree indexes when i want to use gist) > it is indeed likely that partitioning can be slower than one > big table with a clustered index. > > is my intuition completely off on this? > > best regards, ben > If your SELECT retrieves substantial amount of records, table scan could be more efficient than index access. Now, if while retrieving large amount of records "WHERE clause" of this SELECT still satisfies constraints on some partition(s), then obviously one (or few) partition scans will be more efficient than full table scan of non-partitioned table. So, yes partitioning provides performance improvements, not only maintenance convenience. Regards, Igor Neyman
On Oct 29, 2010, at 7:38 AM, Igor Neyman wrote: >> is my intuition completely off on this? >> >> best regards, ben >> > > If your SELECT retrieves substantial amount of records, table scan could > be more efficient than index access. > > Now, if while retrieving large amount of records "WHERE clause" of this > SELECT still satisfies constraints on some partition(s), then obviously > one (or few) partition scans will be more efficient than full table scan > of non-partitioned table. > > So, yes partitioning provides performance improvements, not only > maintenance convenience. my impression was that a *clustered* index would give a lot of the same I/O benefits, in a more flexible way. if you'reclustered on the column in question, then an index scan for a range is much like a sequential scan over a partition(as far as i understand.) b
> -----Original Message----- > From: Ben [mailto:midfield@gmail.com] > Sent: Friday, October 29, 2010 12:16 PM > To: Igor Neyman > Cc: pgsql-performance@postgresql.org > Subject: Re: partitioning question 1 > > On Oct 29, 2010, at 7:38 AM, Igor Neyman wrote: > > >> is my intuition completely off on this? > >> > >> best regards, ben > >> > > > > If your SELECT retrieves substantial amount of records, table scan > > could be more efficient than index access. > > > > Now, if while retrieving large amount of records "WHERE clause" of > > this SELECT still satisfies constraints on some partition(s), then > > obviously one (or few) partition scans will be more efficient than > > full table scan of non-partitioned table. > > > > So, yes partitioning provides performance improvements, not only > > maintenance convenience. > > my impression was that a *clustered* index would give a lot > of the same I/O benefits, in a more flexible way. if you're > clustered on the column in question, then an index scan for a > range is much like a sequential scan over a partition (as far > as i understand.) > > b > Even with clustered index you still read index+table, which is more expensive than just table scan (in situation I described above). PG clustered index is not the same as SQL Server clustered index (which includes actual table pages on the leaf level). Igor Neyman