Re: [HACKERS] [POC] hash partitioning - Mailing list pgsql-hackers

From amul sul
Subject Re: [HACKERS] [POC] hash partitioning
Date
Msg-id CAAJ_b97mTb=dG2pv6+1ougxEVZFVnZJajW+0QHj46mEE7WsoOQ@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] [POC] hash partitioning  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] [POC] hash partitioning  (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
Re: [HACKERS] [POC] hash partitioning  (Dilip Kumar <dilipbalaut@gmail.com>)
List pgsql-hackers
Hi,

Please find the following updated patches attached:

0001-Cleanup.patch : Does some cleanup and code refactoring required
for hash partition patch. Otherwise, there will be unnecessary diff in
0002 patch

0002-hash-partitioning_another_design-v3.patch: Addressed review
comments given by Ashutosh and Robert.

On Wed, May 10, 2017 at 11:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, May 3, 2017 at 9:09 AM, amul sul <sulamul@gmail.com> wrote:
>> Fixed in the attached version.
>
> +[ PARTITION BY { HASH | RANGE | LIST } ( { <replaceable
> class="parameter">column_name</replaceable> | ( <replaceable
> class="parameter">expression</replaceable> ) } [ COLLATE <replaceable
>
> In the department of severe nitpicking, I would have expected this to
> either use alphabetical order (HASH | LIST | RANGE) or to add the new
> method at the end on the theory that we probably did the important
> ones first (RANGE | LIST | HASH).
>
Fixed in the attached version.

> +  WITH ( MODULUS <replaceable class="PARAMETER">value</replaceable>,
> REMAINDER <replaceable class="PARAMETER">value</replaceable> ) }
>
> Maybe value -> modulus and value -> remainder?
>
Fixed in the attached version.

>       <para>
> +      When creating a hash partition, <literal>MODULUS</literal> should be
> +      greater than zero and <literal>REMAINDER</literal> should be greater than
> +      or equal to zero.  Every <literal>MODULUS</literal> must be a factor of
> +      the next larger modulus.
> [ ... and it goes on from there ... ]
>
> This paragraph is fairly terrible, because it's a design spec that I
> wrote, not an explanation intended for users.  Here's an attempt to
> improve it:
>
> ===
> When creating a hash partition, a modulus and remainder must be
> specified.  The modulus must be a positive integer, and the remainder
> must a non-negative integer less than the modulus.  Typically, when
> initially setting up a hash-partitioned table, you should choose a
> modulus equal to the number of partitions and assign every table the
> same modulus and a different remainder (see examples, below).
> However, it is not required that every partition have the same
> modulus, only that every modulus which occurs among the children of a
> hash-partitioned table is a factor of the next larger modulus.  This
> allows the number of partitions to be increased incrementally without
> needing to move all the data at once.  For example, suppose you have a
> hash-partitioned table with 8 children, each of which has modulus 8,
> but find it necessary to increase the number of partitions to 16.  You
> can detach one of the modulus-8 partitions, create two new modulus-16
> partitions covering the same portion of the key space (one with a
> remainder equal to the remainder of the detached partition, and the
> other with a remainder equal to that value plus 8), and repopulate
> them with data.  You can then repeat this -- perhaps at a later time
> -- for each modulus-8 partition until none remain.  While this may
> still involve a large amount of data movement at each step, it is
> still better than having to create a whole new table and move all the
> data at once.
> ===
>
Thanks a lot, added in attached version.

> +CREATE TABLE postal_code (
> +    code         int not null,
> +    city_id      bigint not null,
> +    address      text
> +) PARTITION BY HASH (code);
>
> It would be fairly silly to hash-partition the postal_code table,
> because there aren't enough postal codes to justify it.  Maybe make
> this a lineitem or order table, and partition on the order number.
> Also, extend the example to show creating 4 partitions with modulus 4.
>
Understood, added order table example.

> +                if (spec->strategy != PARTITION_STRATEGY_HASH)
> +                    elog(ERROR, "invalid strategy in partition bound spec");
>
> I think this should be an ereport() if it can happen or an Assert() if
> it's supposed to be prevented by the grammar.
>
Used Assert() in the attach version patch, also changed same for RANGE
and LIST in 0001- cleanup patch.

> +            if (!(datumIsEqual(b1->datums[i][0], b2->datums[i][0],
> +                               true, sizeof(int)) &&
>
> It doesn't seem necessary to use datumIsEqual() here.  You know the
> datums are pass-by-value, so why not just use == ?  I'd include a
> comment but I don't think using datumIsEqual() adds anything here
> except unnecessary complexity.  More broadly, I wonder why we're
> cramming this into the datums arrays instead of just adding another
> field to PartitionBoundInfoData that is only used by hash
> partitioning.
>
Fixed in the attached version.

>                     /*
> +                     * Check rule that every modulus must be a factor of the
> +                     * next larger modulus.  For example, if you have a bunch
> +                     * of partitions that all have modulus 5, you can add a new
> +                     * new partition with modulus 10 or a new partition with
> +                     * modulus 15, but you cannot add both a partition with
> +                     * modulus 10 and a partition with modulus 15, because 10
> +                     * is not a factor of 15.  However, you could
> simultaneously
> +                     * use modulus 4, modulus 8, modulus 16, and modulus 32 if
> +                     * you wished, because each modulus is a factor of the next
> +                     * larger one.  You could also use modulus 10, modulus 20,
> +                     * and modulus 60. But you could not use modulus 10,
> +                     * modulus 15, and modulus 60 for the same reason.
> +                     */
>
> I think just the first sentence is fine here; I'd nuke the rest of this.
>
Fixed in the attached version.

> The block that follows could be merged into the surrounding block.
> There's no need to increase the indentation level here, so let's not.
> I also suspect that the code itself is wrong.  There are two ways a
> modulus can be invalid: it can either fail to be a multiple of the
> next lower-modulus, or it can fail to be a factor of the next-higher
> modulus.  I think your code only checks the latter.  So for example,
> if the current modulus list is (4, 36), your code would correctly
> disallow 3 because it's not a factor of 4 and would correctly disallow
> 23 because it's not a factor of 36, but it looks to me like it would
> allow 9 because that's a factor of 36. However, then the list would be
> (4, 9, 36), and 4 is not a factor of 9.
>
This case is already handled in previous patch and similar regression
test does exists in create_table.sql, see this in v2 patch.

  +-- check partition bound syntax for the hash partition
  +CREATE TABLE hash_parted (
  +   a int
  +) PARTITION BY HASH (a);
  +CREATE TABLE hpart_1 PARTITION OF hash_parted FOR VALUES WITH
(modulus 10, remainder 1);
  +CREATE TABLE hpart_2 PARTITION OF hash_parted FOR VALUES WITH
(modulus 50, remainder 0);
  +-- modulus 25 is factor of modulus of 50 but 10 is not factor of 25.
  +CREATE TABLE fail_part PARTITION OF hash_parted FOR VALUES WITH
(modulus 25, remainder 2);

> +                    greatest_modulus = DatumGetInt32(datums[ndatums - 1][0]);
>
> Here, insert: /* Normally, the lowest remainder that could conflict
> with the new partition is equal to the remainder specified for the new
> partition, but when the new partition has a modulus higher than any
> used so far, we need to adjust. */
>
> +                    place = spec->remainder;
> +                    if (place >= greatest_modulus)
> +                        place = place % greatest_modulus;
>
Fixed in the attached version.

> Here, insert: /* Check every potentially-conflicting remainder. */
>
> +                    do
> +                    {
> +                        if (boundinfo->indexes[place] != -1)
> +                        {
> +                            overlap = true;
> +                            with = boundinfo->indexes[place];
> +                            break;
> +                        }
> +                        place = place + spec->modulus;
>
> Maybe use += ?
>
Fixed.

> +                    } while (place < greatest_modulus);
>
> + * Used when sorting hash bounds across all hash modulus
> + * for hash partitioning
>
> This is not a very descriptive comment.  Maybe /* We sort hash bounds
> by modulus, then by remainder. */
>
Fixed.

> +cal_hash_value(FmgrInfo *partsupfunc, int nkeys, Datum *values, bool *isnull)
>
> I agree with Ashutosh's critique of this name.
>
Fixed.

> +    /*
> +     * Cache hash function information, similar to how record_eq() caches
> +     * equality operator information.  (Perhaps no SQL syntax could cause
> +     * PG_NARGS()/nkeys to change between calls through the same FmgrInfo.
> +     * Checking nkeys here is just defensiveness.)
> +     */
>
> Unless I'm missing something, this comment does not actually describe
> what the code does.  Each call to the function repeats the same
> TypeCacheEntry lookups.  I'm not actually sure whether caching here
> can actually help - is there any situation in which the same FmgrInfo
> will get used repeatedly here?  But if it is possible then this code
> fails to achieve its intended objective.
>
This code is no longer exists in new satisfies_hash_partition() code.

> Another problem with this code is that, unless I'm missing something,
> it completely ignores the opclass the user specified and just looks up
> the default hash opclass.  I think you should create a non-default
> hash opclass for some data type -- maybe create one for int4 that just
> returns the input value unchanged -- and test that the specifying
> default hash opclass routes tuples according to hash_uint32(val) %
> modulus while specifying your customer opclass routes tuples according
> to val % modulus.
>
> Unless I'm severely misunderstanding the situation this code is
> seriously undertested.
>
You are correct, I've missed to opclass handling.  Fixed in the
attached version, and added same case regression test.

> +             * Identify a btree opclass to use. Currently, we use only btree
> +             * operators, which seems enough for list and range partitioning.
>
> This comment is false, right?
>
Not really, this has been re-added due to indentation change.

> +                        appendStringInfoString(buf, "FOR VALUES");
> +                        appendStringInfo(buf, " WITH (modulus %d,
> remainder %d)",
> +                                         spec->modulus, spec->remainder);
>
> You could combine these.
>
I am not sure about this, I've used same code style exist in
get_rule_expr() for range and list.  Do you want me to change this for
other partitioning as well?

> +ALTER TABLE hash_parted2 ATTACH PARTITION fail_part FOR VALUES WITH
> (modulus 0, remainder 1);
> +ERROR:  invalid bound specification for a hash partition
> +HINT:  modulus must be greater than zero
> +ALTER TABLE hash_parted2 ATTACH PARTITION fail_part FOR VALUES WITH
> (modulus 8, remainder 8);
> +ERROR:  invalid bound specification for a hash partition
> +HINT:  modulus must be greater than remainder
> +ALTER TABLE hash_parted2 ATTACH PARTITION fail_part FOR VALUES WITH
> (modulus 3, remainder 2);
> +ERROR:  invalid bound specification for a hash partition
> +HINT:  every modulus must be factor of next largest modulus
>
> It seems like you could merge the hint back into the error:
>
> ERROR: hash partition modulus must be greater than 0
> ERROR: hash partition remainder must be less than modulus
> ERROR: every hash partition modulus must be a factor of the next larger modulus
>
Added same in the attached version. Thanks again.

> +DETAIL:  Partition key of the failing row contains (HASHa, b) = (c, 5).
>
> That's obviously garbled somehow.
>
Oops.  Fixed in the attached version.

> +hash_partbound_elem:
> +        NonReservedWord Iconst
> +            {
> +                $$ = makeDefElem($1, (Node *)makeInteger($2), @1);
> +            }
> +        ;
> +
> +hash_partbound:
> +        hash_partbound_elem ',' hash_partbound_elem
> +            {
> +                $$ = list_make2($1, $3);
> +            }
> +        ;
>
> I don't think that it's the grammar's job to enforce that exactly two
> options are present.  It should allow any number of options, and some
> later code, probably during parse analysis, should check that the ones
> you need are present and that there are no invalid ones.  See the code
> for EXPLAIN, VACUUM, etc.
>
Tried to fixed in the attached version.

> Regarding the test cases, I think that you've got a lot of tests for
> failure scenarios (which is good) but not enough for success
> scenarios.  For example, you test that inserting a row into the wrong
> hash partition fails, but not (unless I missed it) that tuple routing
> succeeds.  I think it would be good to have a test where you insert
> 1000 or so rows into a hash partitioned table just to see it all work.
>
I am quite unsure about this test, now sure how can we verify correct
tuple routing?

> Also, you haven't done anything about the fact that constraint
> exclusion doesn't work for hash partitioned tables, a point I raised
> in http://postgr.es/m/CA+Tgmob7RsN5A=ehgYbLPx--c5CmptrK-dB=Y-v--o+TKyfteA@mail.gmail.com
> and which I still think is quite important.  I think that to have a
> committable patch for this feature that would have to be addressed.
>
Do you mean, we should come up with special handling(pre-pruning) for
hash partitioning or modify constraints exclusion so that it will
handle hash partition expression and cases that you have discussed in
thread[1] as well?  I was under the impression that we might going to
have this as a separate feature proposal.


1]. https://www.postgresql.org/message-id/CA%2BTgmoaE9NZ_RiqZQLp2aJXPO4E78QxkQYL-FR2zCDop96Ahdg%40mail.gmail.com

Regards,
Amul Sul

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

pgsql-hackers by date:

Previous
From: tushar
Date:
Subject: [HACKERS] Getting error at the time of dropping subscription and few moreissues
Next
From: amul sul
Date:
Subject: Re: [HACKERS] [POC] hash partitioning