A Guide to Constraint Exclusion (Partitioning) - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | A Guide to Constraint Exclusion (Partitioning) |
Date | |
Msg-id | 1121251997.3970.237.camel@localhost.localdomain Whole thread Raw |
Responses |
Re: A Guide to Constraint Exclusion (Partitioning)
Re: [Bizgres-general] A Guide to Constraint Exclusion Re: A Guide to Constraint Exclusion (Partitioning) |
List | pgsql-hackers |
A Guide to the Constraint Exclusion Feature =========================================== Simon Riggs 2ndQuadrant simon@2ndquadrant.com INTRODUCTION Constraint Exclusion (CE) is an optimizer patch submitted for PostgreSQL 8.1. CE aims to greatly improve the performance for certain types of common queries against large tables. CE extends the logic originally developed for Partial Indexes to allow the optimizer to avoid scanning particular tables for certain queries. CE will exclude a table from the query plan when the query's WHERE clause is proven not to "overlap" with any of the Constraints defined on those tables. No new syntax is required to take advantage of this feature, which is designed to work in conjunction with existing Inheritance features. Some syntax *is* required to declare Constraints, which are already ANSI/ISO SQL:2003 compliant. In other RDBMS these features are often referred to as Partitioning. The PostgreSQL CE feature is much more widely applicable than that name might suggest and will be of benefit to many users, not just for Business Intelligence applications. The aim of this guide is to help prospective users understand how the feature will work, to allow improvement during the 8.1 beta test cycle. All feedback is welcome. TABLE PARTITIONING OVERVIEW Let's look at a practical example of how this works and the performance benefits it provides: In many BI workloads there are a small number of very large tables, often referred to as Fact tables. In order to improve load and query performance against those tables it is often useful to split these larger tables into pieces, known as partitions. PostgreSQL 8.0 could support pseudo-partitioning using the Inheritance feature. Here's a BI Fact table for a Retail DW as an example of that: CREATE TABLE Sales_DateItemOutlet ( DateKey Integer , OutletKey Integer , ItemKey Integer , SoldQty Integer ); This table could be accessed using a query likeSELECT sum(soldqty) FROM Sales_DateItemOutletWHERE DateKey between 20050101and 20050101 This can then be split up so that we have a number of sub-tables: CREATE TABLE Sales_Jan_DateItemOutlet () INHERITS (Sales_DateItemOutlet); CREATE TABLE Sales_Feb_DateItemOutlet () INHERITS (Sales_DateItemOutlet); CREATE TABLE Sales_Mar_DateItemOutlet () INHERITS (Sales_DateItemOutlet); When we load this we put January's data in the Jan table etc, though put no rows at all in the "parent" table. Now, if we run our example query againSELECT sum(soldqty) FROM Sales_DateItemOutletWHERE DateKey between 20050101 and 20050101 we find that the query willScan all rows in Sales_DateItemOutlet (which is empty)Scan all rows in Sales_Jan_DateItemOutletScanall rows in Sales_Feb_DateItemOutletScan all rows in Sales_Mar_DateItemOutlet and return the correct answer. But we know that the query did not really need to have scanned the Feb and Mar tables, since these do not contain any data that would satisfy the query. (The full EXPLAIN output is not shown above, for clarity only.) The new CE functionality aims to improve the performance of such queries. Ideally we would like the query to Scan all rows in Sales_Jan_DateItemOutlet and ignore the parent table, and all other child tables. To do allow this, we must provide more declarative information to allow the optimizer to understand as much as we do. To do this, we add constraints on to each table, so that it is clear what rows they can contain. CREATE TABLE Sales_Jan_DateItemOutlet ( CHECK (DateKey BETWEEN 20050101 AND 20050131) ) INHERITS (Sales_DateItemOutlet); CREATE TABLE Sales_Feb_DateItemOutlet ( CHECK (DateKey BETWEEN 20050201 AND 20050229) ) INHERITS (Sales_DateItemOutlet); CREATE TABLE Sales_Mar_DateItemOutlet ( CHECK (DateKey BETWEEN 20050301 AND 20050331) ) INHERITS (Sales_DateItemOutlet); Now, when we execute our test query, the optimizer can understand that the Feb and Mar tables can *never* hold rows that would match our query. As a result it is able to provably exclude them from the query plan. Thus, with Constraint Exclusion enabled our test querySELECT sum(soldqty) FROM Sales_DateItemOutletWHERE DateKey between20050101 and 20050101 will perform the following scansScan all rows in Sales_DateItemOutlet (which is empty)Scan all rows in Sales_Jan_DateItemOutlet Running an EXPLAIN or EXPLAIN ANALYZE will allow you to see which tables have been included in the query. There is no explicit message to say that a table has been excluded. We aren't able to exclude the parent table from the above query because no Constraint was defined upon it. Since, in our example, the parent is empty there will be little effect on the query performance. It would be a mistake to attempt to get around this by placing a Constraint on the parent, since that would then automatically be created on all child tables also. So we can never exclude the parent without automatically excluding *all* of the children also. Overall then, if the partitions are all the same size, we will have reduced execution time of the query to around one-third of its previous elapsed time. If we had 10 or 100 child tables, certain queries could be 10 or 100 times faster than without the CE feature. In summary, the CE feature will be a huge performance gain for qualifying queries against large tables in PostgreSQL databases. Since no new syntax is required, existing PostgreSQL databases designed to take advantage of inheritance will automatically benefit from the performance enhancements. CONSTRAINT EXCLUSION If you wish to enable this feature you must set the server parameterenable_constraint_exclusion = true This is a user-settable parameter, so could be enabled for specific users or queries. Constraint exclusion is not enabled by default. This is because PostgreSQL does not yet have full plan invalidation when DDL changes are made. If a change were made to any of the tables, the query would not automatically re-optimize, so it would be possible to get an incorrect answer returned from a CE plan. This is not a fault of CE, but simply a side-effect of the current lack of full plan invalidation in PostgreSQL. CE will work only when the query contains a direct and simple restriction of the table, such as Attribute > Constant If Operators are available for that datatype, CE will use = < <= > >= <> These operators must be defined as IMMUTABLE. CE will work for most simple Constraints. Constraints are already limited to IMMUTABLE predicates, such as Attribute > Constant. Constraints likeAttribute BETWEEN Const1 and Const2 can be used to produce ranges of values, and would be generally described as RANGE PARTITIONING. Constraints likeAttribute IN (Const1, Const2, Const3 ...) can be used to produce lists of values, and would generally be described as LIST PARTITIONING. All of these Constraint types could be mixed, to allow Constraints likeAttribute1 IN (Const1, Const2) AND Attribute2 >= Const3 AND Attribute2 < Const4 Thus, the CE feature allows a very flexible partitioning scheme to be developed that could mix LIST, RANGE style partitioning clauses. Currently, there is no restriction that all constraints *must* be mutually exclusive, nor even that the constraints may be similar on each table. This can be useful for some designs where the inheritance hierarchy is not "disjoint", as UML would term this situation. CE does not prevent direct access to one of the child tables in an inheritance hierarchy. In this case, no exclusion test would be performed. Exclusion tests are performed *only* when the parent table in an inheritance hierarchy is accessed. Exclusion tests are performed even if the inheritance hierarchy is many levels deep (e.g. parent-child- grandchild). CE also supports multiple inheritance. PostgreSQL makes no restriction upon what indices are defined on the various child tables. CE is not dependent upon the existence or absence of indices, only upon the Query's WHERE clause and the Constraints defined upon the tables. CE can be very useful for large historical databases. As an example, PostgreSQL would allow a historical data table split into sections like this: - Last 3 months: one table per week, each table defined with 3 indices - 3-12 months: one table per month, no indices defined - 12-36 months: tablespaces defined on hierarchical storage (near-line), with no indices, so de-archived when required for use. The CE feature would allow the optimizer to avoid wasteful de-archiving of the older data, with an appropriate database/query design. The current patch also includes a suite of 700 tests that successfully exclude child tables in all common query types (whilst returning the correct answer!). These include examples of List, Range and mixed partitioning scenarios. CURRENT RESTRICTIONS It is not yet possible to specify that Constraints on child tables will be mutually exclusive of each other. Currently, it would be up to the designer to ensure that, if desired. It is not yet possible to specify that an inheritance parent has no rows, and, if so, should always be excluded from the query. If a parent table has a Constraint defined upon it, then this will be automatically copied to all child tables. Currently, there is no way to tell which Constraints have been inherited from the parent, so exclusion tests will be re-executed against all child tables. This will cause additional optimization time. Currently, all child tables will be considered. It may be possible in the future to pre-sort the list of child tables, so that optimization time can be reduced for parent tables with large numbers of partitions. Currently, there is no index on the pg_inherits system table. As a result, parents with more than 1000 child tables are likely to experience longer than desirable planning times for their queries. CE checks will not currently recognise STABLE functions within a query. So WHERE clauses such asDateKey > CURRENT DATE will not cause exclusion because CURRENT DATE is a STABLE function. CE checks are not made when the parent table is involved in a join. Other existing restrictions on Inherited tables continue to apply. Further enhancements to the CE feature can be expected in the future. COMPARISONS WITH OTHER RDBMS In brief: Teradata and Oracle already have table Partitioning. Oracle's partitioning allows either LIST or RANGE partitioning. The new PostgreSQL CE feature is considerably more flexible than Oracle's declarative Partitioning syntax, and is achieved without adding non-standard SQL extensions. Both Sybase and MS SQLServer 2005 have new Partitioning features released this year. The Microsoft feature does allow joins between tables to exclude partitions. However, it also requires that all partitions should have identical indexes, which is a major blocking factor to the use of very large tables. DB2 has partitioning using UNION ALL views, which includes somewhat similar functionality to the PostgreSQL CE feature. DB2 uses the PARTITION keyword to signify a different type of feature, so do not be confused that they do have this feature with declarative syntax. =================================================================
pgsql-hackers by date: