Re: Declarative partitioning - Mailing list pgsql-hackers

From Ashutosh Bapat
Subject Re: Declarative partitioning
Date
Msg-id CAFjFpRcZoxr+F-OjHu77gj2EGUCJBwibNUc655n-_2afi2XBnw@mail.gmail.com
Whole thread Raw
In response to Re: Declarative partitioning  (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
Responses Re: Declarative partitioning  (Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>)
List pgsql-hackers
FOR VALUE clause of a partition does not allow a constant expression like (10000/5 -1). It gives syntax error
regression=# create table pt1_p1 partition of pt1 for values start (0) end ((10000/5) - 1);
ERROR:  syntax error at or near "("
LINE 1: ...pt1_p1 partition of pt1 for values start (0) end ((10000/5) ...

Shouldn't we allow constant expressions here?

If this has been already discussed, please forgive me and point out the relevant mail chain.

On Tue, Aug 9, 2016 at 12:48 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
What strikes me odd about these patches is RelOptInfo has remained unmodified. For a base partitioned table, I would expect it to be marked as partitioned may be indicating the partitioning scheme. Instead of that, I see that the code directly deals with PartitionDesc, PartitionKey which are part of Relation. It uses other structures like PartitionKeyExecInfo. I don't have any specific observation as to why we need such information in RelOptInfo, but lack of it makes me uncomfortable. It could be because inheritance code does not require any mark in RelOptInfo and your patch re-uses inheritance code. I am not sure.

On Tue, Aug 9, 2016 at 9:17 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2016/08/09 6:02, Robert Haas wrote:
> On Mon, Aug 8, 2016 at 1:40 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> +1, if we could do it. It will need a change in the way Amit's patch stores
>>> partitioning scheme in PartitionDesc.
>>
>> Okay, I will try to implement this in the next version of the patch.
>>
>> One thing that comes to mind is what if a user wants to apply hash
>> operator class equality to list partitioned key by specifying a hash
>> operator class for the corresponding column.  In that case, we would not
>> have the ordering procedure with an hash operator class, hence any
>> ordering based optimization becomes impossible to implement.  The current
>> patch rejects a column for partition key if its type does not have a btree
>> operator class for both range and list methods, so this issue doesn't
>> exist, however it could be seen as a limitation.
>
> Yes, I think you should expect careful scrutiny of that issue.  It
> seems clear to me that range partitioning requires a btree opclass,
> that hash partitioning requires a hash opclass, and that list
> partitioning requires at least one of those things.  It would probably
> be reasonable to pick one or the other and insist that list
> partitioning always requires exactly that, but I can't see how it's
> reasonable to insist that you must have both types of opclass for any
> type of partitioning.

So because we intend to implement optimizations for list partition
metadata that presuppose existence of corresponding btree operator class,
we should just always require user to specify one (or error out if user
doesn't specify and a default one doesn't exist).  That way, we explicitly
do not support specify hash equality operator for list partitioning.

>> So, we have 3 choices for the internal representation of list partitions:
>>
>> Choice 1 (the current approach):  Load them in the same order as they are
>> found in the partition catalog:
>>
>> Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3 {'a', 'e'}
>> Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'}
>>
>> In this case, mismatch on the first list would make the two tables
>> incompatibly partitioned, whereas they really aren't incompatible.
>
> Such a limitation seems clearly unacceptable.  We absolutely must be
> able to match up compatible partitioning schemes without getting
> confused by little details like the order of the partitions.

Agreed.  Will change my patch to adopt the below method.

>> Choice 2: Representation with 2 arrays:
>>
>> Table 1: ['a', 'b', 'c', 'd', 'e', 'f'], [3, 1, 2, 2, 3, 1]
>> Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [2, 3, 1, 1, 2, 3]
>>
>> It still doesn't help the case of pairwise joins because it's hard to tell
>> which value belongs to which partition (the 2nd array carries the original
>> partition numbers).  Although it might still work for tuple-routing.
>
> It's very good for tuple routing.  It can also be used to match up
> partitions for pairwise joins.  Compare the first arrays.  If they are
> unequal, stop.  Else, compare the second arrays, incrementally
> building a mapping between them and returning false if the mapping
> turns out to be non-bijective.  For example, in this case, we look at
> index 0 and decide that 3 -> 2.  We look at index 1 and decide 1 -> 3.
> We look at index 2 and decide 2 -> 1.  We look at index 4 and find
> that we already have a mapping for 2, but it's compatible because we
> need 2 -> 1 and that's what is already there.  Similarly for the
> remaining entries.  This is really a pretty easy loop to write and it
> should run very quickly.

I see, it does make sense to try to implement this way.

>> Choice 3: Order all lists' elements for each list individually and then
>> order the lists themselves on their first values:
>>
>> Table 1: p3 {'a', 'e'}, p2 {'b', 'f'}, p1 {'c', 'd'}
>> Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'}
>>
>> This representation makes pairing partitions for pairwise joining
>> convenient but for tuple-routing we still need to visit each partition in
>> the worst case.
>
> I think this is clearly not good enough for tuple routing.  If the
> algorithm I proposed above turns out to be too slow for matching
> partitions, then we could keep both this representation and the
> previous one.  We are not limited to just one.  But I don't think
> that's likely to be the case.

I agree.  Let's see how the option 2 turns out.

> Also, note that all of this presupposes we're doing range
> partitioning, or perhaps list partitioning with a btree opclass.  For
> partitioning based on a hash opclass, you'd organize the data based on
> the hash values rather than range comparisons.

Yes, the current patch does not implement hash partitioning, although I
have to think about how to support the hash case when designing the
internal data structures.


By the way, I am planning to start a new thread with the latest set of
patches which I will post in a day or two.  I have tried to implement all
the bug fixes and improvements that have been suggested on this thread so
far.  Thanks to all those who reviewed and gave their comments.  Please
check this page to get a link to the new thread:
https://commitfest.postgresql.org/10/611/

Thanks,
Amit





--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

pgsql-hackers by date:

Previous
From: Ants Aasma
Date:
Subject: Re: Proposal for CSN based snapshots
Next
From: Amit Langote
Date:
Subject: Re: Declarative partitioning