Re: Declarative partitioning - Mailing list pgsql-hackers

From Amit Langote
Subject Re: Declarative partitioning
Date
Msg-id a616514e-c7a0-ea14-a877-077542f88aae@lab.ntt.co.jp
Whole thread Raw
In response to Re: Declarative partitioning  (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
Responses Re: Declarative partitioning  (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
Re: Declarative partitioning  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 2016/08/05 21:38, Ashutosh Bapat wrote:
>>> Consider lists ('e', 'i', 'f'), ('h', 'd','m') and ('l', 'b', 'a') for a
>>> list partitioned tables. I am suggesting that we arrange them as
>>> ('a','b','l'), ('d', 'h', 'm') and ('e', 'f', 'i'). If the given row
>> (either
>>> for comparison or for inserting) has value 'c', we will search for it in
>>> ('a','b','l') but will be eliminate other two partitions since the
>> second's
>>> partition's lowest value is higher than 'c' and lowest values of rest of
>> the
>>> partitions are higher than that of the second partition. Without this
>> order
>>> among the partitions, we will have to compare lowest values of all the
>>> partitions.
>>
>> I would think that for that case what we'd want to do is create two
>> lists.  One looks like this:
>>
>> [ 'a', 'b', 'd', 'e', f', 'h', 'i', 'l', 'm' ]
>>
>> The other looks like this:
>>
>> [3, 3, 2, 1, 1, 2, 1, 1, 3, 2]
>>
> 
> small correction; there's an extra 1 here. Every partition in the example
> has only three values.
> 
> 
>>
>> Given a particular value, bsearch the first list.  If the value is not
>> found, it's not in any partition.  If it is found, then go look up the
>> same array index in the second list; that's the containing partition.
>>
> 
> +1, if we could do it. It will need a change in the way Amit's patch stores
> partitioning scheme in PartitionDesc.

Okay, I will try to implement this in the next version of the patch.

One thing that comes to mind is what if a user wants to apply hash
operator class equality to list partitioned key by specifying a hash
operator class for the corresponding column.  In that case, we would not
have the ordering procedure with an hash operator class, hence any
ordering based optimization becomes impossible to implement.  The current
patch rejects a column for partition key if its type does not have a btree
operator class for both range and list methods, so this issue doesn't
exist, however it could be seen as a limitation.

> This way specifications {('e', 'i', 'f'), ('h', 'd','m') and ('l', 'b',
> 'a')} and {('h', 'd','m') , ('e', 'i', 'f'), and ('l', 'b', 'a')} which are
> essentially same, are represented in different ways. It makes matching
> partitions for partition-wise join a bit tedius. We have to make sure that
> the first array matches for both the joining relations and then make sure
> that all the values belonging to the same partition for one table also
> belong to the same partition in the other table. Some more complex logic
> for matching subsets of lists for partition-wise join.

So, we have 3 choices for the internal representation of list partitions:

Choice 1 (the current approach):  Load them in the same order as they are
found in the partition catalog:

Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a', 'e'}
Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'}

In this case, mismatch on the first list would make the two tables
incompatibly partitioned, whereas they really aren't incompatible.

Choice 2: Representation with 2 arrays:

Table 1: ['a', 'b', 'c', 'd', 'e', 'f'], [3, 1, 2, 2, 3, 1]
Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [2, 3, 1, 1, 2, 3]

It still doesn't help the case of pairwise joins because it's hard to tell
which value belongs to which partition (the 2nd array carries the original
partition numbers).  Although it might still work for tuple-routing.

Choice 3: Order all lists' elements for each list individually and then
order the lists themselves on their first values:

Table 1: p3 {'a', 'e'}, p2 {'b', 'f'}, p1 {'c', 'd'}
Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'}

This representation makes pairing partitions for pairwise joining
convenient but for tuple-routing we still need to visit each partition in
the worst case.

> At least for straight forward partitioned table matching it helps to have
> both these array look same independent of the user specification. From that
> point of view, the partition be ordered by their lowest or highest list
> values and the second array is the index in the ordered set. For both the
> specifications above, the list will look like
> 
> [ 'a', 'b', 'd', 'e', f', 'h', 'i', 'l', 'm' ]
> [1, 1, 2, 3, 3, 2, 3, 1, 2]

IIUC, this seems like a combination of 2 and 3 above:

So, we have the following list partitions (as read from the catalog)

Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a', 'e'}
Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'}

By applying the method of 3:

Table 1: p3 {'a', 'e'}, p2 {'b', 'f'}, p1 {'c', 'd'}
Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'}

Then applying 2:

Table 1: ['a', 'b', 'c', 'd', 'e', 'f'], [1, 2, 3, 3, 1, 2], [3, 1, 2]
Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [1, 2, 3, 3, 1, 2], [2, 3, 1]

This is user-specification independent representation wherein the
partition numbers in the 2nd array are based on canonical representation
(ordered lists).  To check pairwise join compatibility, simply compare the
first two arrays.  As to which partitions (think OIDs, RTEs whatever) pair
with each other, simply pair corresponding elements of the 3rd array which
are original partitions numbers (or OIDs).  Also when routing a tuple, we
find partition number in the array 2 and then look up the array 3 to get
the actual partition to insert the tuple.

Neither of these representations make the logic of checking pairwise-join
compatibility and pairing a subset of partitions (on either side) any
easier, although I haven't given it a good thought yet.

Thanks,
Amit





pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: No longer possible to query catalogs for index capabilities?
Next
From: Ashutosh Bapat
Date:
Subject: Re: Declarative partitioning