Re: Declarative partitioning - another take - Mailing list pgsql-hackers
From | Amit Langote |
---|---|
Subject | Re: Declarative partitioning - another take |
Date | |
Msg-id | 795e4902-b70c-a18c-7e88-c40effee0f06@lab.ntt.co.jp Whole thread Raw |
In response to | Re: Declarative partitioning - another take (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Declarative partitioning - another take
|
List | pgsql-hackers |
On 2016/11/04 9:16, Robert Haas wrote: > Apologies if I've made some of these comments before and/or missed > comments you've made on these topics. The size of this patch set is > so large that it's hard to keep track of everything. Thanks for the reviews! > Re-reviewing 0001: > > + indicate which table columns are used as partition key. For example, > > s/are used as/are part of the/ Fixed. > > + third table columns make up the partition key. A zero in this array > + indicates that the corresponding partition key column is an expression > + over the table columns, rather than a simple column reference. > > I think you can leave out "over the table columns". Done. > + columns or expressions forms the <firstterm>partitioning key</firstterm> > > s/forms/form/ Actually the sentence there is: The parenthesized list of columns or expressions forms the the <firstterm>partitioning key</firstterm> for the table > > + The table itself is empty. A data row inserted into the table is routed > > s/The table/The partitioned table/ Done. > > + * Anything mentioned in the expressions. We must ignore the column > + * references which will count as self-dependency items; in this case, > + * the depender is the table itself (there is no such thing as partition > + * key object). > > "depender" is not really a word, and the parenthetical note isn't very > clear. Maybe: We must ignore the column references, which will depend > on the table itself; there is no separate partition key object. I see "depender" being used as a variable name, but I guess it's not appropriate to use the same in comments. In any case, I adopted your text. > > + heap_close(pg_partitioned_table, RowExclusiveLock); > > It seems like it would be safe to do this right after > CatalogUpdateIndexes(pg_partitioned_table, tuple), and I'd recommend > that you do. Not for performance or anything, but just to keep > related code together. I see, done. > > /* > * Resolve possibly-defaulted operator class specification > */ > -static Oid > +Oid > GetIndexOpClass(List *opclass, Oid attrType, > > Perhaps we should rename this function to ResolveOpClass, since it's > now going to be used for both indexes and partitioning keys. Aha, good idea! Done. > + * Sets *is_expr if attnum is found to be referenced in some partition key > + * expression. > > is_expr doesn't seem quite as clear as, say, used_by_expr or used_in_expr. > > Also, the specification for this function doesn't seem to be very > clear about what this is supposed to do if the same column is both an > explicit partitioning column and also used in an expression, and the > code looks like it'll return with *is_expr set based on whichever use > it encounters first. If that's intended behavior, maybe add a comment > like: It's possible for a column to be used both directly and as part > of a partition key expression; if that happens, *is_expr may end up as > either true or false. That's OK for current uses of this function, > because *is_expr is only used to tailor the error message text. OK, I added the note as you suggest. > > + if (is_expr) > + *is_expr = false; > + if (attnum == partattno) > + return true; > > I think you should adjust this (and the other bit in the same > function) so that you don't set *is_expr until you're committed to > returning. Done. > > + index = -1; > + while ((index = bms_next_member(expr_attrs, index)) > 0) > + { > + AttrNumber attno = index + FirstLowInvalidHeapAttributeNumber; > + > + if (attno == attnum) > + return true; > + } > > How about bms_is_member(expr_attrs, attnum - > FirstLowInvalidHeapAttributeNumber), instead of looping? Done that way, thanks. > + errmsg("cannot reference relation \"%s\"", > RelationGetRelationName(pkrel)), > + errdetail("Referencing partitioned tables in foreign > key constraints is not supported."))); > > I think you could get rid of the errdetail and just have the error > message be "cannot reference partitioned table \"%s\"". OK, done. > > + errmsg("column \"%s\" appears twice in > partition key", pelem->name), > > It could be there three times! How about column \"%s\" appears more > than once in partition key? (I see that you seem to have adapted this > from some code in parse_utilcmd.c, which perhaps should also be > adjusted, but that's a job for another patch.) Done. > > + /* > + * Strip any top-level COLLATE clause. This ensures that we treat > + * "x COLLATE y" and "(x COLLATE y)" alike. > + */ > > But you don't, right? Unless I am confused, which is possible, the > latter COLLATE will be ignored, while the former one will set the > collation to be used in the context of partitioning comparisons. The code immediately following the comment does in fact strip the top-level COLLATE clause. while (IsA(expr, CollateExpr)) expr = (Node *) ((CollateExpr *) expr)->arg; So that the following two specifications are equivalent which is the intent: create table p (a text) partition by range (a collate "en_US"); vs. create table p (a text) partition by range ((a collate "en_US")); > Re-reviewing 0002: > > + if (fout->remoteVersion >= 100000) > + { > + PQExpBuffer acl_subquery = createPQExpBuffer(); > + PQExpBuffer racl_subquery = createPQExpBuffer(); > + PQExpBuffer initacl_subquery = createPQExpBuffer(); > + PQExpBuffer initracl_subquery = createPQExpBuffer(); > + > + PQExpBuffer attacl_subquery = createPQExpBuffer(); > + PQExpBuffer attracl_subquery = createPQExpBuffer(); > + PQExpBuffer attinitacl_subquery = createPQExpBuffer(); > + PQExpBuffer attinitracl_subquery = createPQExpBuffer(); > > It seems unnecessary to repeat all of this. The only differences > between the 10000 version and the 9600 version are: > > 60,61c60 > < "AS changed_acl, " > < "CASE WHEN c.relkind = 'P' THEN > pg_catalog.pg_get_partkeydef(c.oid) ELSE NULL END AS partkeydef " > --- >> "AS changed_acl " > 73c72 > < "WHERE c.relkind in ('%c', '%c', '%c', '%c', > '%c', '%c', '%c') " > --- >> "WHERE c.relkind in ('%c', '%c', '%c', '%c', '%c', '%c') " > 87,88c86 > < RELKIND_MATVIEW, RELKIND_FOREIGN_TABLE, > < RELKIND_PARTITIONED_TABLE); > --- >> RELKIND_MATVIEW, RELKIND_FOREIGN_TABLE); > > But none of that is really a problem. Sure, the 'P' case will never > arise in 9.6, but so what? I'd really like to not keep duplicating > these increasingly-complex hunks of code if we can find some way to > avoid that. We cannot reference pg_catalog.pg_get_partkeydef() in the SQL query that getTables() sends to pre-10 servers, right? But I suppose we need not call it in that particular SQL query in the first place. How about we do it in the following manner in getSchemaData(): if (g_verbose) write_msg(NULL, "reading partition key information for interesting tables\n"); getTablePartitionKeyInfo(fout, tblinfo, numTables); I have implemented the same. > > /* find all the inheritance information */ > > - appendPQExpBufferStr(query, "SELECT inhrelid, inhparent FROM > pg_inherits"); > + appendPQExpBufferStr(query, > + "SELECT inhrelid, inhparent " > + "FROM pg_inherits " > + "WHERE inhparent NOT > IN (SELECT oid FROM pg_class WHERE relkind = 'P')"); > > I think you need to update the comment. "Find inheritance > information, excluding implicit inheritance via partitioning. We're > not interested in that case because $REASON." I just realized that this hunk really belongs in patch 0004. In any case, I added the explanatory comment like you suggest. > Re-reviewing 0003: > > + <para> > + If this table is a partition, one cannot perform <literal>DROP > NOT NULL</> > + on a column if it is marked not null in the parent table. > + </para> > > This would, presumably, also be true for inheritance. I think we > could just leave this out. Actually, it isn't true for the regular inheritance situation. create table parent (a int not null); create table child () inherits (parent); alter table child alter a drop not null; -- this works (bug?) vs. create table p (a int, b int, c int) partition by range (a); create table p1 partition of p for values start (1) end (10); alter table p1 alter a drop not null; -- this causes error Once NOT NULL constraints start using pg_constraint [1] to keep track of inheritance (which currently is not kept track of), the above illustrated bug will be fixed. If that patch gets committed, it will have taken care of the partitioning case as well. But in the meantime, we can proceed with enforcing inheritance on NOT NULL constraint for *partitions*, because they only ever have one parent and hence do not need elaborate coninhcount based scheme, I think. Thoughts? > > + as partition of the target table. The partition bound specification must > > s/as partition/as a partition/ Fixed. > > + correspond to the partitioning method and partitioning key of the target > > I think that in most places were are referring to the "partitioning > method" (with ing) but the "partition key" (without ing). Let's try to > be consistent. I'm inclined to switch to "partitioning method" and "partitioning key", but do you mean just the documentation or throughout? Beside documentation, I mean source code comments, error messages, etc. I have assumed throughout. > > + table. The table to be attached must have all the columns as the target > + table and no more; moreover, the column types must also match. Also, it > + must have all the matching constraints as the target table. > > s/all the columns/all of the same columns/ Fixed. > The second sentence doesn't seem quite grammatical. And why would > that be true anyway? Partitions can have extra constraints, and if > they lack constraints that are present on the partitioned table, those > constraints will be added and validated, right? It's true that partitions can have extra constraints which don't affect the discussion. What I meant to say with the second sentence is that any check constraints defined on the parent table (being referred to as the target table) must be present in the table being attached. It's the same rule as for regular inheritance. I didn't change how that works in the partitioning case. So no, check constraints of the parent table that are missing in the table being attached are not added and validated as part of the attach partition command processing. I did change the second sentence to: Also, it must have all the <literal>NOT NULL</literal> and <literal>CHECK</literal> constraints present in the target table. > > + A full table scan is performed on the table being attached to check that > + existing rows in the table are within the specified partition bound. > + If it is known in advance that no partition bound violating rows are > + present in the table, the above scan can be skipped by specifying the > + <literal>NO VALIDATE</> option. > + Default behavior is to perform the scan, as if the > <literal>VALIDATE</literal> > + option were specified. > > I don't think it's OK to allow the partition to be added if it > contains rows that might not be valid. We are generally vary wary > about adding options that can cause data integrity failures and I > think we shouldn't start here, either. On the other hand, it's also > not desirable for adding a partition to take O(n) time in all cases. > So what would be nice is if the new partition could self-certify that > contains no problematic rows by having a constraint that matches the > new partitioning constraint. Then we can skip the scan if that > constraint is already present. I agree that NO VALIDATE is undesirable and it would be better if we could get rid of the same. I want to clarify one thing though: what does it mean for the new partition to have a constraint that *matches* the new partitioning constraint? Does it mean the new partition's constraint *implies* the partitioning constraint? Or as certified by equal()? > > + inherited columns. One can also specify table constraints, in addition > > Delete comma. Done. > > + to those inherited from the parent. If a check constraint with the name > + matching one of the parent's constraint is specified, it is merged with > + the latter, provided the specified condition is same. > > Doesn't that effectively delete the merged constraint? > > Suggest: "If the parent already has a check constraint with the same > name as a constraint specified for the child, the conditions must be > the same." > > +) FOR VALUES IN ('los angeles', 'san fransisco'); > > That's not you you spell San Francisco. Oops. I changed the example such that specified values are properly capitalized. <programlisting> CREATE TABLE cities ( name text not null, population int, ) PARTITION BY LIST (initcap(name)); </programlisting></para> <programlisting> CREATE TABLE cities_west PARTITION OF cities ( CONSTRAINT city_id_nonzero CHECK (city_id != 0) ) FOR VALUES IN ('Los Angeles', 'San Francisco'); </programlisting></para> Previously, the key was: PARTITION BY LIST (lower(name)) > + Create partition of a list partitioned table that itself is further > + partitioned and then create its partition: > > s/itself is/is itself/ > s/then create its partition/then add a partition to it/ Fixed. > > + if (!is_local || con->coninhcount == 0) > + con->coninhcount++; > > I would think that you could get rid of the "if" and just say > con->coninhcount = 1. That's better. So: /* * In case of partitions, an inherited constraint must be * inherited only once since it cannot have multiple parents and * it is never considered local. */ if (rel->rd_rel->relispartition) { con->coninhcount = 1; con->conislocal = false; } The above is enforced in both MergeWithExistingConstraint() and MergeConstraintsIntoExisting(). > It seems to me (and perhaps the comment could > say this) that for a partitioned table, we can simplify the handling > of coninhcount and conislocal. Since partitioning hierarchies don't > allow multiple inheritance, any inherited constraint must be inherited > exactly once. Since a partition cannot have inheritance children -- > except by being partitioned itself -- there is no point in tracking > conislocal, so we always store it as false. Agreed. > + > +void > +StorePartitionBound(Relation rel, Node *bound) > > Header comment, please! Sorry about that, added. > > + (void) SysCacheGetAttr(RELOID, tuple, Anum_pg_class_relpartbound, > + &isnull); > + Assert(isnull); > > We try not to do unnecessary work in non-assert-enabled builds solely > for the benefit of assert-enabled builds. We also try not to end up > with variables that are only used in assert-enabled builds but not > marked PG_USED_FOR_ASSERTS_ONLY, because those tend to cause compiler > warnings. I'm not sure an compiler would be smart enough to warn > about this, but I suggest adding an #ifdef USE_ASSERT_CHECKING with a > block inside where the offending variable is declared. Actually, I > think you need to move a bit more code. Hmm. Something like: > > #ifdef USE_ASSERT_CHECKING > { > Form_pg_class classForm; > bool isnull; > > classForm = (Form_pg_class) GETSTRUCT(tuple); > Assert(!classForm->relispartition); > > (void) SysCacheGetAttr(RELOID, tuple, Anum_pg_class_relpartbound, > &isnull); > Assert(isnull); > } > #endif Done this way, thanks. > > + * are same in common cases, of which we only store one. > > "and we only store one of them." Done. > > + /* > + * Never put a > null into the values array, flag > + * instead for > the code further down below where > + * we > construct the actual relcache struct. > + */ > + > found_null_partition = true; > + > null_partition_index = i; > > How about we elog(ERROR, ...) if found_null_partition is already set? Makes sense. However, let me mention here that duplicates either within one partition's list or across the partitions are not possible. That's because in case of the former, we de-duplicate before storing the list into the catalog and the latter would simply be an overlap error. Could this be made an Assert() then? > > + foreach(cell, non_null_values) > + { > + PartitionListValue *src = > lfirst(cell); > + > + all_values[i] = (PartitionListValue *) > + > palloc(sizeof(PartitionListValue)); > + all_values[i]->value = > datumCopy(src->value, > + > key->parttypbyval[0], > + > key->parttyplen[0]); > + all_values[i]->index = src->index; > + i++; > + } > > Why do we need to datumCopy() here when we just did that in the > previous loop? Why did the previous loop need to datumCopy(), anyway? > Can't we just pass the same datums around? I understand that you > need to copy the data into the correct context at the end, but doing > two copies prior to that seems excessive. Agreed. Now the datumCopying() happens only once when copying to the relcache context. > +get_leaf_partition_oids(Oid relid, int lockmode) > > Instead of implementing this, can you just use find_all_inheritors()? OK, got rid of get_leaf_partition_oids(). > > + /* > + * Is lower_val = upper_val? > + */ > + if (lower_val && upper_val) > > So, that comment does not actually match that code. That if-test is > just checking whether both bounds are finite. What I think you should > be explaining here is that if lower_val and upper_val are both > non-infinite, and if the happen to be equal, we want to emit an > equality test rather than a >= test plus a <= test because $REASON. OK, I have added some explanatory comments around this code. > > + operoid = get_opfamily_member(key->partopfamily[col], > + > key->parttypid[col], > + > key->parttypid[col], > + strategy); > + if (!OidIsValid(operoid)) > + { > + operoid = get_opfamily_member(key->partopfamily[col], > + > key->partopcintype[col], > + > key->partopcintype[col], > + > strategy); > + *need_relabel = true; > + } > > There's a comment explaining THAT you do this ("Use either the column > type as the operator datatype or opclass's declared input type.") but, > to repeat a complaint that I've made often, nothing explaining why. > In this particular case, what's not clear - in my opinion - is why you > need to try two different possibilities, and why at least one of those > possibilities is guaranteed to work. I gather that if the opfamily > doesn't contain an operator for the actual type of the partitioning > column, you think it will certainly contain one for the input type of > the operator class (which seems right), and that the input type of the > operator class will be binary-compatible with the type of the > partitioning column (which is less-obviously true, and needs some > explanation). Sorry, it is indeed not clear why the code is the way it it. I expanded the comment explaining why (borrowing from what you wrote above). > > I also think that this function should elog(ERROR, ...) if by any > chance the second get_opfamily_member() call also fails. Otherwise the > error might occur quite a bit downstream and be hard to understand. OK, added an elog(ERROR, ...) after the second get_opfamily_member() call. > > + ReleaseSysCache(tuple); > + heap_close(parent, AccessShareLock); > > I think you had better not release the lock here - i.e. pass NoLock. > We don't release locks on tables until transaction commit, except for > catalog tables. This also comes up in, at least, > transformPartitionOf(). Ah, fixed. > + PartitionRangeBound *b1 = (*(PartitionRangeBound *const *) a); > + PartitionRangeBound *b2 = (*(PartitionRangeBound *const *) b); > + PartitionKey key = (PartitionKey) arg; > + > + return partition_rbound_cmp(key, b1, b2); > > Whitespace. Fixed. > > + * as partition and schema consists of columns definitions corresponding > > the schema consists > > - if (recurse) > + /* Force inheritance recursion, if partitioned table. */ > + if (recurse || rel->rd_rel->relkind == > RELKIND_PARTITIONED_TABLE) > > I would instead error out if ONLY is specified for a partitioned table. I forgot this instance of forced-recursion-if-partitioned, fixed. I was not quite sure what the error message would say; how about: ereport(ERROR, (errcode(ERRCODE_WRONG_OBJECT_TYPE), errmsg("must truncate child tables too"))); > + /* > + * If the table is source table of ATTACH PARTITION > command, following > + * check is unnecessary. > + */ > > As usual, comment should say WHY. OK, expanded the comment explaining why. > > + if (partqualstate && !ExecQual(partqualstate, > econtext, true)) > + ereport(ERROR, > + > (errcode(ERRCODE_CHECK_VIOLATION), > + errmsg("child table > contains a row violating partition bound specification"))); > > Why not mimic the existing phrasing? "partition constraint is > violated by some row" Agreed, done. > > What happens if you try to attach a table as a partition of itself or > one of its ancestors? The latter fails with "already a partition" error. The former case was not being handled at all which has now been fixed. ATExecAddInherit() prevents that case as an instance of preventing the circularity of inheritance. It says: ERROR: circular inheritance not allowed. And then: DETAIL: "rel" is already a child of "rel". Should ATExecAttachPartition() use the same trick and keep using the same message(s)? The above detail message doesn't quite sound appropriate when one tries to attach a table as partition of itself. > > - errmsg("column \"%s\" in child table > must be marked NOT NULL", > - attributeName))); > + errmsg("column \"%s\" > in child table must be marked NOT NULL", > + > attributeName))); > > Whitespace-only hunk; revert. You cannot fight the power of pgindent. Oops, fixed. > > + errmsg("cannot attach table that is a > inheritance child as partition"))); > > an inheritance child > > + errmsg("cannot attach a temporary relation of another > session as partition "))); > > Extra space. Fixed. > > + errdetail("Table being > attached should contain only the columns" > + " present > in parent."))); > > Suggest: "New partition should contain only..." > > Also, don't break error messages into multiple strings. Make it one > long string and let pgindent deal. Done. > + | FOR VALUES START '(' range_datum_list ')' lb_inc > + END_P '(' > range_datum_list ')' ub_inc > > Just a random idea. Would for VALUES FROM ( ... ) TO ( ... ) be more > idiomatic than START and END? It would actually. Should I go ahead and change to FROM (...) TO (...)? Related to range partitioning, should we finalize on inclusive START/FROM and exclusive END/TO preventing explicit specification of the inclusivity? > +static void > +transformPartitionOf(CreateStmtContext *cxt, Node *bound) > +{ > + TupleDesc tupdesc; > + int i; > + RangeVar *part = cxt->relation; > + RangeVar *partof = linitial(cxt->inhRelations); > + Relation parentRel; > + > + parentRel = heap_openrv(partof, AccessShareLock); > + > > If there's anyway that we might do heap_openrv() on the same RangeVar > at multiple places in the code, it presents security and integrity > hazards because the referent of that RangeVar might change in the > meantime. I suspect there is such a risk here - won't we need to open > the relation again when we actually want to do the operation? > > Also, we should never acquire a lower-level lock on a relation and > then, further downstream, acquire a higher-level lock on the same > object. To do so creates a deadlock hazard. That seems like it might > be a problem here, too. I think I have modified things so that the concerns you express are taken care of. In particular, I got rid of transformPartitionOf() altogether, moving part of its responsibilities to MergeAttributes() where the parent table is locked anyway and calling transformPartitionBound from DefineRelation() (we needed access to the parent's PartitionKey for that). > > + /*if (ldatum->infinite && rdatum->infinite) > + ereport(ERROR, > + > (errcode(ERRCODE_INVALID_TABLE_DEFINITION), > + errmsg("both START > and END datum for column \"%s\" cannot be UNBOUNDED", > + colname), > + > parser_errposition(cxt->pstate, rdatum->location)));*/ > > Commented-out code is bad. Oops, removed. > > + if (list_length(spec->lowerdatums) > partnatts) > + ereport(ERROR, > + > (errcode(ERRCODE_INVALID_TABLE_DEFINITION), > + errmsg("START has more values > than the length of the partition key"), > + > parser_errposition(cxt->pstate, > + > exprLocation(list_nth(spec->lowerdatums, > + > list_length(spec->lowerdatums) - 1))))); > + else if (list_length(spec->lowerdatums) < partnatts) > + ereport(ERROR, > + > (errcode(ERRCODE_INVALID_TABLE_DEFINITION), > + errmsg("START has fewer > values than the length of the partition key"), > + > parser_errposition(cxt->pstate, > + > exprLocation(list_nth(spec->lowerdatums, > + > list_length(spec->lowerdatums) - 1))))); > > It would be worth looking for a way to unify these cases. Like "START > must specify exactly one value per partitioning column". Agreed. The code is trying to generate too specific error messages. > + * Same oids? No mind order - in the list case, it > matches the order > + * in which partition oids are returned by a > pg_inherits scan, whereas > + * in the range case, they are in order of ranges of individual > + * partitions. XXX - is the former unsafe? > > Probably. It likely depends on the physical ordering of tuples in the > table, which can change. This comment is outdated. RelationBuildPartitionDesc() now always puts the OIDs into the array in a canonical order (in both list and range cases) using. I have updated the comment. > > + * BoundCollection encapsulates a set of partition bounds of either physical > + * or logical relations. It is associated with a partitioned relation of > + * which the aforementioned relations are partitions. > > "physical or logical relations" is unfamiliar terminology. What I am trying to say there is that one can associate a BoundCollection with either an actual partitioned table or with a transient partitioned relation, say, a partitioned joinrel. In case of the former, the BoundCollection is part of the table's partition descriptor. I can see though that "logical/physical relations" terminology is misleading at best. I rewrote the comment: /* * BoundCollection encapsulates a set of partition bounds. It is usually * associated with partitioned tables as part of its partition descriptor. * * The internal structure is opaque outside partition.c. */ typedef struct BoundCollectionData *BoundCollection; Attached updated patches take care of the above comments and few other fixes. There are still a few items I have not addressed right away: - Remove NO VALIDATE clause in ATTACH PARTITION and instead rely on the new partition's constraints to skip the validation scan - Remove the syntax to specify inclusivity of each of START and END bounds of range partitions and instead assume inclusive START and exclusive END Also, what used to be "[PATCH 5/9] Refactor optimizer's inheritance set expansion code" is no longer included in the set. I think we can go back to that topic later as I mentioned last week [2]. Thanks, Amit [1] https://www.postgresql.org/message-id/16589.1465997664%40sss.pgh.pa.us [2] https://www.postgresql.org/message-id/0a299fbf-e5a1-dc3d-eb5e-11d3601eac16%40lab.ntt.co.jp
Attachment
- 0001-Catalog-and-DDL-for-partitioned-tables-11.patch
- 0002-psql-and-pg_dump-support-for-partitioned-tables-11.patch
- 0003-Catalog-and-DDL-for-partitions-11.patch
- 0004-psql-and-pg_dump-support-for-partitions-11.patch
- 0005-Teach-a-few-places-to-use-partition-check-quals-11.patch
- 0006-Introduce-a-PartitionTreeNode-data-structure-11.patch
- 0007-Tuple-routing-for-partitioned-tables-11.patch
- 0008-Update-DDL-Partitioning-chapter-to-reflect-new-devel-11.patch
pgsql-hackers by date: