Re: Declarative partitioning - another take - Mailing list pgsql-hackers
From | Rushabh Lathia |
---|---|
Subject | Re: Declarative partitioning - another take |
Date | |
Msg-id | CAGPqQf0KXQYEb1pwRSkH3bZ9iZWwByXcD8uhcq9=BjdhGN2r1Q@mail.gmail.com Whole thread Raw |
In response to | Re: Declarative partitioning - another take (Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>) |
Responses |
Re: Declarative partitioning - another take
|
List | pgsql-hackers |
Hi Amit,
I was just reading through your patches and here are some quick review commentsfor 0001-Catalog-and-DDL-for-partitioned-tables-17.patch.
Review comments for 0001-Catalog-and-DDL-for-partitioned-tables-17.patch:
1)
@@ -1102,9 +1104,10 @@ heap_create_with_catalog(const char *relname,
{
/* Use binary-upgrade override for pg_class.oid/relfilenode? */
if (IsBinaryUpgrade &&
- (relkind == RELKIND_RELATION || relkind == RELKIND_SEQUENCE ||
- relkind == RELKIND_VIEW || relkind == RELKIND_MATVIEW ||
- relkind == RELKIND_COMPOSITE_TYPE || relkind == RELKIND_FOREIGN_TABLE))
+ (relkind == RELKIND_RELATION || relkind == RELKIND_PARTITIONED_TABLE ||
+ relkind == RELKIND_SEQUENCE || relkind == RELKIND_VIEW ||
+ relkind == RELKIND_MATVIEW || relkind == RELKIND_COMPOSITE_TYPE ||
+ relkind == RELKIND_FOREIGN_TABLE))
You should add the RELKIND_PARTITIONED_TABLE at the end of if condition that
will make diff minimal. While reading through the patch I noticed that there
is inconsistency - someplace its been added at the end and at few places its
at the start. I think you can make add it at the end of condition and be
consistent with each place.
2)
+ /*
+ * We need to transform the raw parsetrees corresponding to partition
+ * expressions into executable expression trees. Like column defaults
+ * and CHECK constraints, we could not have done the transformation
+ * earlier.
+ */
Additional space before "Like column defaults".
3)
- char relkind;
+ char relkind,
+ expected_relkind;
Newly added variable should be define separately with its type. Something like:
char relkind;
+ char expected_relkind;
4)
a)
+ /* Prevent partitioned tables from becoming inheritance parents */
+ if (parent_rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+ ereport(ERROR,
+ (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("cannot inherit from partitioned table \"%s\"",
+ parent->relname)));
+
need alignment for last line.
b)
+ atttuple = SearchSysCacheAttName(RelationGetRelid(rel), pelem->name);
+ if (!HeapTupleIsValid(atttuple))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_COLUMN),
+ errmsg("column \"%s\" named in partition key does not exist",
+ pelem->name)));
+ attform = (Form_pg_attribute) GETSTRUCT(atttuple);
+
+ if (attform->attnum <= 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_COLUMN),
+ errmsg("cannot use system column \"%s\" in partition key",
+ pelem->name)));
need alignment for last line of ereport
c)
+ /* Disallow ROW triggers on partitioned tables */
+ if (stmt->row && rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+ ereport(ERROR,
+ (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("\"%s\" is a partitioned table",
+ RelationGetRelationName(rel)),
+ errdetail("Partitioned tables cannot have ROW triggers.")));
need alignment
5)
@@ -2512,6 +2579,7 @@ transformAlterTableStmt(Oid relid, AlterTableStmt *stmt,
cxt.blist = NIL;
cxt.alist = NIL;
cxt.pkey = NULL;
+ cxt.ispartitioned = rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;
I think adding bracket will look code more clear.
+ cxt.ispartitioned = (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
6)
+ * RelationBuildPartitionKey
+ * Build and attach to relcache partition key data of relation
+ *
+ * Partitioning key data is stored in CacheMemoryContext to ensure it survives
+ * as long as the relcache. To avoid leaking memory in that context in case
+ * of an error partway through this function, we build the structure in the
+ * working context (which must be short-lived) and copy the completed
+ * structure into the cache memory.
extra space before "To avoid leaking memory"
7)
+ /* variable-length fields start here, but we allow direct access to partattrs */
+ int2vector partattrs; /* attribute numbers of columns in the
Why partattrs is allow direct access - its not really clear from the comments.
I will continue reading more patch and testing functionality.. will share the
comments as I have it.
comments as I have it.
Thanks,
On Tue, Nov 22, 2016 at 2:45 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Updated patches attached. I merged what used to be 0006 and 0007 into one.
On 2016/11/19 2:23, Robert Haas wrote:
> On Fri, Nov 18, 2016 at 5:59 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Oh but wait, that means I can insert rows with NULLs in the range
>> partition key if I choose to insert it directly into the partition,
>> whereas I have been thinking all this while that there could never be
>> NULLs in the partition key of a range partition. What's more,
>> get_qual_for_partbound() (patch 0003) emits a IS NOT NULL constraint for
>> every partition key column in case of a range partition. Is that
>> wrongheaded altogether? (also see my reply to your earlier message about
>> NULLs in the range partition key)
>
> The easiest thing to do might be to just enforce that all of the
> partition key columns have to be not-null when the range-partitioned
> table is defined, and reject any attempt to DROP NOT NULL on them
> later. That's probably better that shoehorning it into the table
> constraint.
Agreed that forcing range partitioning columns to be NOT NULL during table
creation would be a better approach. But then we would have to reject
using expressions in the range partition key, right?
>> Thanks for the idea below!
>>
>>> 1. Forget the idea of a tree. Instead, let the total number of tables
>>> in the partitioning hierarchy be N and let the number of those that
[ ... ]
>>>
>>> No recursion, minimal pointer chasing, no linked lists. The whole
>>> thing is basically trivial aside from the cost of
>>> list/range_partition_for_tuple itself; optimizing that is a different
>>> project. I might have some details slightly off here, but hopefully
>>> you can see what I'm going for: you want to keep the computation that
>>> happens in get_partition_for_tuple() to an absolute minimum, and
>>> instead set things up in advance so that getting the partition for a
>>> tuple is FAST. And you want the data structures that you are using in
>>> that process to be very compact, hence arrays instead of linked lists.
>>
>> This sounds *much* better. Here is a quick attempt at coding the design
>> you have outlined above in the attached latest set of patches.
>
> That shrank both 0006 and 0007 substantially, and it should be faster,
> too. I bet you can shrink them further:
Some changes described below have reduced the size to a certain degree.
>
> - Why is PartitionKeyExecInfo a separate structure and why does it
> have a NodeTag? I bet you can dump the node tag, merge it into
> PartitionDispatch, and save some more code and some more
> pointer-chasing.
OK, I merged the fields of what used to be PartitionKeyExecInfo into
PartitionDispatchData as the latter's new fields key and keystate.
> - I still think it's a seriously bad idea for list partitioning and
> range partitioning to need different code-paths all over the place
> here. List partitions support nulls but not multi-column partitioning
> keys and range partitions support multi-column partitioning keys but
> not nulls, but you could use an internal structure that supports both.
> Then you wouldn't need partition_list_values_bsearch and also
> partition_rbound_bsearch; you could have one kind of bound structure
> that can be bsearch'd for either list or range. You might even be
> able to unify list_partition_for_tuple and range_partition_for_tuple
> although that looks a little harder. In either case, you bsearch for
> the greatest value <= the value you have. The only difference is that
> for list partitioning, you have to enforce at the end that it is an
> equal value, whereas for range partitioning less-than-or-equal-to is
> enough. But you should still be able to arrange for more code
> sharing.
I have considered these suggestions in the latest patch. Now instead of
PartitionListInfo, PartitionRangeInfo, and BoundCollectionData structs,
there is only one PartitionBoundInfo which consolidates the partition
bound information of a partitioned table. Some of the fields are
applicable only to one of list or range case; for example, null-accepting
list partition index, infinite status of individual range datums.
Also, there is now only one binary search function named
partition_bound_bsearch() which invokes a comparison function named
partition_bound_cmp(). The former searches a probe (a partition bound or
tuple) within a PartitionBoundInfo, which is passed all the way down to
the comparison function.
Also, we no longer have list_partition_for_tuple() and
range_partition_for_tuple(). Instead, in get_partition_for_tuple()
itself, there is a bsearch followed by list and range partitioning
specific steps based on the returned offset.
> - I don't see why you need the bound->lower stuff any more. If
> rangeinfo.bounds[offset] is a lower bound for a partition, then
> rangeinfo.bounds[offset+1] is either (a) the upper bound for that
> partition and the partition is followed by a "gap" or (b) both the
> upper bound for that partition and the lower bound for the next
> partition. With the inclusive/exclusive bound stuff gone, every range
> bound has the same sense: if the probed value is <= the bound then
> we're supposed to be a lower-numbered partition, but if > then we're
> supposed to be in this partition or a higher-numbered one.
OK, I've managed to get rid of lower. At least it is no longer kept in
the new relcache struct PartitionBoundInfo. It is still kept in
PartitionRangeBound which is used to hold individual range bounds when
sorting them (during relcache build). Comparisons invoked during the
aforementioned sorting step still need to distinguish between lower and
upper bounds (such that '1)' < '[1').
Tuple-routing no longer needs to look at lower. In that case, what you
described above applies.
As a result, one change became necessary: to how we flag individual range
bound datum as infinite or not. Previously, it was a regular Boolean
value (either infinite or not) and to distinguish +infinity from
-infinity, we looked at whether the bound is lower or upper (the lower
flag). Now, instead, the variable holding the status of individual range
bound datum is set to a ternary value: RANGE_DATUM_FINITE (0),
RANGE_DATUM_NEG_INF (1), and RANGE_DATUM_POS_INF (2), which still fits in
a bool. Upon encountering an infinite range bound datum, whether it's
negative or positive infinity derives the comparison result. Consider the
following example:
partition p1 from (1, unbounded) to (1, 1);
partition p2 from (1, 1) to (1, 10);
partition p3 from (1, 10) to (1, unbounded);
partition p4 from (2, unbounded) to (2, 1);
... so on
In this case, we need to be able to conclude, say, (1, -inf) < (1, 15) <
(1, +inf), so that tuple (1, 15) is assigned to the proper partition.
Does this last thing sound reasonable?
Thanks,
Amit
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
Rushabh Lathia
pgsql-hackers by date: