Re: [HACKERS] [POC] hash partitioning - Mailing list pgsql-hackers

From amul sul
Subject Re: [HACKERS] [POC] hash partitioning
Date
Msg-id CAAJ_b97NK6MB9fr0esOohJ_XFFYVcTBgRR99byPdr=eT6ygERA@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] [POC] hash partitioning  (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
Responses Re: [HACKERS] [POC] hash partitioning  (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
List pgsql-hackers
On Fri, May 12, 2017 at 10:39 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Fri, May 12, 2017 at 6:08 PM, amul sul <sulamul@gmail.com> wrote:
>> Hi,
>>
>> Please find the following updated patches attached:
>>
>> 0001-Cleanup.patch : Does some cleanup and code refactoring required
>> for hash partition patch. Otherwise, there will be unnecessary diff in
>> 0002 patch
>
> Thanks for splitting the patch.
>
> +                if (isnull[0])
> +                    cur_index = partdesc->boundinfo->null_index;
> This code assumes that null_index will be set to -1 when has_null is false. Per
> RelationBuildPartitionDesc() this is true. But probably we should write this
> code as
> if (isnull[0])
> {
>     if (partdesc->boundinfo->has_null)
>         cur_index = partdesc->boundinfo->null_index;
> }
> That way we are certain that when has_null is false, cur_index = -1 similar to
> the original code.
>
Okay will add this.  I still don't understood point of having has_null
variable, if no null accepting partition exists then null_index is
alway set to -1 in RelationBuildPartitionDesc.  Anyway, let not change
the original code.

> Additional arguement to ComputePartitionAttrs() isn't used anywhere in this
> patch, so may be this better be part of 0002. If we do this the only change
> that will remain in patch is the refactoring of RelationBuildPartitionDesc(),
> so we may consider merging it into 0002, unless we find that some more
> refactoring is needed. But for now, having it as a separate patch helps.
>
Okay.

> Here's some more comments on 0002
>
> + * In the case of hash partitioning, datums is a 2-D array, stores modulus and
> + * remainder values at datums[x][0] and datums[x][1] respectively for each
> + * partition in the ascending order.
>
> This comment about datums should appear in a paragraph of itself and may be
> rephrased as in the attached patch. May be we could also add a line about
> ndatums for hash partitioned tables as in the attached patch.
>
Thanks, looks good to me; will include this.

[...]
>
> +        if (key->strategy == PARTITION_STRATEGY_HASH)
> +        {
> +            ndatums = nparts;
> +            hbounds = (PartitionHashBound **) palloc(nparts *
> +
> sizeof(PartitionHashBound *));
> +            i = 0;
> +            foreach (cell, boundspecs)
> +            {
> +                PartitionBoundSpec *spec = lfirst(cell);
> +
> [ clipped ]
> +                hbounds[i]->index = i;
> +                i++;
> +            }
> For list and range partitioned table we order the bounds so that two
> partitioned tables have them in the same order irrespective of order in which
> they are specified by the user or hence stored in the catalogs. The partitions
> then get indexes according the order in which their bounds appear in ordered
> arrays of bounds. Thus any two partitioned tables with same partition
> specification always have same PartitionBoundInfoData. This helps in
> partition-wise join to match partition bounds of two given tables.  Above code
> assigns the indexes to the partitions as they appear in the catalogs. This
> means that two partitioned tables with same partition specification but
> different order for partition bound specification will have different
> PartitionBoundInfoData represenation.
>
> If we do that, probably partition_bounds_equal() would reduce to just matching
> indexes and the last element of datums array i.e. the greatest modulus datum.
> If ordered datums array of two partitioned table do not match exactly, the
> mismatch can be because missing datums or different datums. If it's a missing
> datum it will change the greatest modulus or have corresponding entry in
> indexes array as -1. If the entry differs it will cause mismatching indexes in
> the index arrays.
>
Make sense, will fix this.

[...]
>
> +                    if (offset < 0)
> +                    {
> +                        nmod = DatumGetInt32(datums[0][0]);
> +                        valid_bound = (nmod % spec->modulus) == 0;
> +                    }
> +                    else
> +                    {
> +                        pmod = DatumGetInt32(datums[offset][0]);
> +                        valid_bound = (spec->modulus % pmod) == 0;
> +
> +                        if (valid_bound && (offset + 1) < ndatums)
> +                        {
> +                            nmod = DatumGetInt32(datums[offset + 1][0]);
> +                            valid_bound = (nmod % spec->modulus) == 0;
> +                        }
> +                    }
> May be name the variables as prev_mod(ulus) and next_mod(ulus) for better
> readability.
>
Okay, will rename to prev_modulus and next_modulus resp.

> + *   for p_p1: satisfies_hash_partition(2, 1, hash_fn(a), hash_fn(b))
> + *   for p_p2: satisfies_hash_partition(4, 2, hash_fn(a), hash_fn(b))
> + *   for p_p3: satisfies_hash_partition(8, 0, hash_fn(a), hash_fn(b))
> + *   for p_p4: satisfies_hash_partition(8, 4, hash_fn(a), hash_fn(b))
> The description here may be read as if we are calling the same hash function
> for both a and b, but that's not true. So, you may want to clarify that
> in hash_fn(a) hash_fn means hash function specified for key a.
>
Okay.

>
> +        if (key->partattrs[i] != 0)
> +        {
> +            keyCol = (Node *) makeVar(1,
> +                                      key->partattrs[i],
> +                                      key->parttypid[i],
> +                                      key->parttypmod[i],
> +                                      key->parttypcoll[i],
> +                                      0);
> +
> +            /* Form hash_fn(value) expression */
> +            keyCol = (Node *) makeFuncExpr(key->partsupfunc[i].fn_oid,
> +                                    get_fn_expr_rettype(&key->partsupfunc[i]),
> +                                    list_make1(keyCol),
> +                                    InvalidOid,
> +                                    InvalidOid,
> +                                    COERCE_EXPLICIT_CALL);
> +        }
> +        else
> +        {
> +            keyCol = (Node *) copyObject(lfirst(partexprs_item));
> +            partexprs_item = lnext(partexprs_item);
> +        }
> I think we should add FuncExpr for column Vars as well as expressions.
>
Okay, will fix this.

> The logic to compare two bounds is duplicated in qsort_partition_hbound_cmp()
> and partition_bound_cmp(). Should we separate it into a separate function
> accepting moduli and remainders. That way in case we change it in future, we
> have to change only one place.
>
Okay.

> I think we need more comments for compute_hash_value(), mix_hash_value() and
> satisfies_hash_partition() as to what each of them accepts and what it
> computes.
>
> +        /* key's hash values start from third argument of function. */
> +        if (!PG_ARGISNULL(i + 2))
> +        {
> +            values[i] = PG_GETARG_DATUM(i + 2);
> +            isnull[i] = false;
> +        }
> +        else
> +            isnull[i] = true;
> You could write this as
> isnull[i] = PG_ARGISNULL(i + 2);
> if (isnull[i])
>     values[i] = PG_GETARG_DATUM(i + 2);
>
Okay.

>
> +         * Identify a btree or hash opclass to use. Currently, we use only
> +         * btree operators, which seems enough for list and range partitioning,
> +         * and hash operators for hash partitioning.
>
> The wording, if not read carefully, might be read as "we use only btree
> operators".  I suggest we rephrase it as "Identify opclass to use. For
> list and range
> partitioning we use only btree operators, which seems enough for those. For
> hash partitioning, we use hash operators." for clarity.
>
Okay

> +                    foreach (lc, $5)
> +                    {
> +                        DefElem    *opt = (DefElem *) lfirst(lc);
> A search on WITH in gram.y shows that we do not handle WITH options in gram.y.
> Usually they are handled at the transformation stage. Why is this an exception?
> If you do that, we can have all the error handling in
> transformPartitionBound().
>
If so, ForValues need to return list for hash and PartitionBoundSpec
for other two; wouldn't  that break code consistency? And such
validation is not new in gram.y see xmltable_column_el.

> +DATA(insert OID = 5028 ( satisfies_hash_partition PGNSP PGUID 12 1 0
> 2276 0 f f f f f f i s 3 0 16 "23 23 2276" _null_ _null_ _null_ _null_
> _null_ satisfies_hash_partition _null_ _null_ _null_ ));
> Why is third argument to this function ANY? Shouldn't it be INT4ARRAY (variadic
> INT4)?
>
Will use INT4ARRAY in next patch, but I am little sceptical of it.  we
need an unsigned int32, but unfortunately there is not variadic uint32
support.  How about INT8ARRAY?

Regards,
Amul



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] Latest Data::Dumper breaks hstore_plperl regression test
Next
From: amul sul
Date:
Subject: Re: [HACKERS] [POC] hash partitioning