Re: Declarative partitioning - Mailing list pgsql-hackers

From Ashutosh Bapat
Subject Re: Declarative partitioning
Date
Msg-id CAFjFpRdvAPAoH+wUF+f4jfui3pKGM6i78DHBT3GTfoUO4Aj=+g@mail.gmail.com
Whole thread Raw
In response to Re: Declarative partitioning  (Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>)
Responses Re: Declarative partitioning  (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
List pgsql-hackers
<div dir="ltr">What strikes me odd about these patches is RelOptInfo has remained unmodified. For a base partitioned
table,I would expect it to be marked as partitioned may be indicating the partitioning scheme. Instead of that, I see
thatthe code directly deals with PartitionDesc, PartitionKey which are part of Relation. It uses other structures like
PartitionKeyExecInfo.I don't have any specific observation as to why we need such information in RelOptInfo, but lack
ofit makes me uncomfortable. It could be because inheritance code does not require any mark in RelOptInfo and your
patchre-uses inheritance code. I am not sure.<br /></div><div class="gmail_extra"><br /><div class="gmail_quote">On
Tue,Aug 9, 2016 at 9:17 AM, Amit Langote <span dir="ltr"><<a href="mailto:Langote_Amit_f8@lab.ntt.co.jp"
target="_blank">Langote_Amit_f8@lab.ntt.co.jp</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0
00 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On 2016/08/09 6:02, Robert Haas wrote:<br /> >
OnMon, Aug 8, 2016 at 1:40 AM, Amit Langote<br /> > <<a
href="mailto:Langote_Amit_f8@lab.ntt.co.jp">Langote_Amit_f8@lab.ntt.co.jp</a><wbr/>> wrote:<br /> >>> +1,
ifwe could do it. It will need a change in the way Amit's patch stores<br /> >>> partitioning scheme in
PartitionDesc.<br/> >><br /> >> Okay, I will try to implement this in the next version of the patch.<br />
>><br/> >> One thing that comes to mind is what if a user wants to apply hash<br /> >> operator class
equalityto list partitioned key by specifying a hash<br /> >> operator class for the corresponding column.  In
thatcase, we would not<br /> >> have the ordering procedure with an hash operator class, hence any<br /> >>
orderingbased optimization becomes impossible to implement.  The current<br /> >> patch rejects a column for
partitionkey if its type does not have a btree<br /> >> operator class for both range and list methods, so this
issuedoesn't<br /> >> exist, however it could be seen as a limitation.<br /> ><br /> > Yes, I think you
shouldexpect careful scrutiny of that issue.  It<br /> > seems clear to me that range partitioning requires a btree
opclass,<br/> > that hash partitioning requires a hash opclass, and that list<br /> > partitioning requires at
leastone of those things.  It would probably<br /> > be reasonable to pick one or the other and insist that list<br
/>> partitioning always requires exactly that, but I can't see how it's<br /> > reasonable to insist that you
musthave both types of opclass for any<br /> > type of partitioning.<br /><br /></span>So because we intend to
implementoptimizations for list partition<br /> metadata that presuppose existence of corresponding btree operator
class,<br/> we should just always require user to specify one (or error out if user<br /> doesn't specify and a default
onedoesn't exist).  That way, we explicitly<br /> do not support specify hash equality operator for list
partitioning.<br/><span class=""><br /> >> So, we have 3 choices for the internal representation of list
partitions:<br/> >><br /> >> Choice 1 (the current approach):  Load them in the same order as they are<br
/>>> found in the partition catalog:<br /> >><br /> >> Table 1: p1 {'b', 'f'}, p2 {'c', 'd'}, p3
{'a','e'}<br /> >> Table 2: p1 {'c', 'd'}, p2 {'a', 'e'}, p3 {'b', 'f'}<br /> >><br /> >> In this
case,mismatch on the first list would make the two tables<br /> >> incompatibly partitioned, whereas they really
aren'tincompatible.<br /> ><br /> > Such a limitation seems clearly unacceptable.  We absolutely must be<br />
>able to match up compatible partitioning schemes without getting<br /> > confused by little details like the
orderof the partitions.<br /><br /></span>Agreed.  Will change my patch to adopt the below method.<br /><span
class=""><br/> >> Choice 2: Representation with 2 arrays:<br /> >><br /> >> Table 1: ['a', 'b', 'c',
'd','e', 'f'], [3, 1, 2, 2, 3, 1]<br /> >> Table 2: ['a', 'b', 'c', 'd', 'e', 'f'], [2, 3, 1, 1, 2, 3]<br />
>><br/> >> It still doesn't help the case of pairwise joins because it's hard to tell<br /> >> which
valuebelongs to which partition (the 2nd array carries the original<br /> >> partition numbers).  Although it
mightstill work for tuple-routing.<br /> ><br /> > It's very good for tuple routing.  It can also be used to
matchup<br /> > partitions for pairwise joins.  Compare the first arrays.  If they are<br /> > unequal, stop. 
Else,compare the second arrays, incrementally<br /> > building a mapping between them and returning false if the
mapping<br/> > turns out to be non-bijective.  For example, in this case, we look at<br /> > index 0 and decide
that3 -> 2.  We look at index 1 and decide 1 -> 3.<br /> > We look at index 2 and decide 2 -> 1.  We look
atindex 4 and find<br /> > that we already have a mapping for 2, but it's compatible because we<br /> > need 2
->1 and that's what is already there.  Similarly for the<br /> > remaining entries.  This is really a pretty easy
loopto write and it<br /> > should run very quickly.<br /><br /></span>I see, it does make sense to try to implement
thisway.<br /><span class=""><br /> >> Choice 3: Order all lists' elements for each list individually and then<br
/>>> order the lists themselves on their first values:<br /> >><br /> >> Table 1: p3 {'a', 'e'}, p2
{'b','f'}, p1 {'c', 'd'}<br /> >> Table 2: p2 {'a', 'e'}, p1 {'b', 'f'}, p3 {'c', 'd'}<br /> >><br />
>>This representation makes pairing partitions for pairwise joining<br /> >> convenient but for
tuple-routingwe still need to visit each partition in<br /> >> the worst case.<br /> ><br /> > I think this
isclearly not good enough for tuple routing.  If the<br /> > algorithm I proposed above turns out to be too slow for
matching<br/> > partitions, then we could keep both this representation and the<br /> > previous one.  We are not
limitedto just one.  But I don't think<br /> > that's likely to be the case.<br /><br /></span>I agree.  Let's see
howthe option 2 turns out.<br /><span class=""><br /> > Also, note that all of this presupposes we're doing range<br
/>> partitioning, or perhaps list partitioning with a btree opclass.  For<br /> > partitioning based on a hash
opclass,you'd organize the data based on<br /> > the hash values rather than range comparisons.<br /><br
/></span>Yes,the current patch does not implement hash partitioning, although I<br /> have to think about how to
supportthe hash case when designing the<br /> internal data structures.<br /><br /><br /> By the way, I am planning to
starta new thread with the latest set of<br /> patches which I will post in a day or two.  I have tried to implement
all<br/> the bug fixes and improvements that have been suggested on this thread so<br /> far.  Thanks to all those who
reviewedand gave their comments.  Please<br /> check this page to get a link to the new thread:<br /><a
href="https://commitfest.postgresql.org/10/611/"rel="noreferrer" target="_blank">https://commitfest.postgresql.<wbr
/>org/10/611/</a><br/><br /> Thanks,<br /> Amit<br /><br /><br /></blockquote></div><br /><br clear="all" /><br />--
<br/><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">Best Wishes,<br />Ashutosh Bapat<br
/>EnterpriseDBCorporation<br />The Postgres Database Company<br /></div></div></div> 

pgsql-hackers by date:

Previous
From: Sridhar N Bamandlapally
Date:
Subject: Re: Surprising behaviour of \set AUTOCOMMIT ON
Next
From: Masahiko Sawada
Date:
Subject: Re: Logical Replication WIP