Thread: [HACKERS] [POC] hash partitioning

[HACKERS] [POC] hash partitioning

From

Yugo Nagata

Date:

28 February 2017, 17:33:13

Hi all,

Now we have a declarative partitioning, but hash partitioning is not
implemented yet. Attached is a POC patch to add the hash partitioning
feature. I know we will need more discussions about the syntax and other
specifications before going ahead the project, but I think this runnable
code might help to discuss what and how we implement this.

* Description

In this patch, the hash partitioning implementation is basically based
on the list partitioning mechanism. However, partition bounds cannot be
specified explicitly, but this is used internally as hash partition
index, which is calculated when a partition is created or attached.

The tentative syntax to create a partitioned table is as bellow;

 CREATE TABLE h (i int) PARTITION BY HASH(i) PARTITIONS 3 USING hashint4;

The number of partitions is specified by PARTITIONS, which is currently
constant and cannot be changed, but I think this is needed to be changed in
some manner. A hash function is specified by USING. Maybe, specifying hash
function may be ommitted, and in this case, a default hash function
corresponding to key type will be used.

A partition table can be create as bellow;

 CREATE TABLE h1 PARTITION OF h;
 CREATE TABLE h2 PARTITION OF h;
 CREATE TABLE h3 PARTITION OF h;

FOR VALUES clause cannot be used, and the partition bound is
calclulated automatically as partition index of single integer value.

When trying create partitions more than the number specified
by PARTITIONS, it gets an error.

postgres=# create table h4 partition of h;
ERROR:  cannot create hash partition more than 3 for h

An inserted record is stored in a partition whose index equals
abs(hashfunc(key)) % <number_of_partitions>. In the above
example, this is abs(hashint4(i))%3.

postgres=# insert into h (select generate_series(0,20));
INSERT 0 21

postgres=# select *,tableoid::regclass from h;
 i  | tableoid 
----+----------
  0 | h1
  1 | h1
  2 | h1
  4 | h1
  8 | h1
 10 | h1
 11 | h1
 14 | h1
 15 | h1
 17 | h1
 20 | h1
  5 | h2
 12 | h2
 13 | h2
 16 | h2
 19 | h2
  3 | h3
  6 | h3
  7 | h3
  9 | h3
 18 | h3
(21 rows)

* Todo / discussions

In this patch, we cannot change the number of partitions specified
by PARTITIONS. I we can change this, the partitioning rule
(<partition index> = abs(hashfunc(key)) % <number_of_partitions>)
is also changed and then we need reallocatiing records between
partitions.

In this patch, user can specify a hash function USING. However,
we migth need default hash functions which are useful and
proper for hash partitioning. 

Currently, even when we issue SELECT query with a condition,
postgres looks into all partitions regardless of each partition's
constraint, because this is complicated such like "abs(hashint4(i))%3 = 0".

postgres=# explain select * from h where i = 10;
                        QUERY PLAN                        
----------------------------------------------------------
 Append  (cost=0.00..125.62 rows=40 width=4)
   ->  Seq Scan on h  (cost=0.00..0.00 rows=1 width=4)
         Filter: (i = 10)
   ->  Seq Scan on h1  (cost=0.00..41.88 rows=13 width=4)
         Filter: (i = 10)
   ->  Seq Scan on h2  (cost=0.00..41.88 rows=13 width=4)
         Filter: (i = 10)
   ->  Seq Scan on h3  (cost=0.00..41.88 rows=13 width=4)
         Filter: (i = 10)
(9 rows)

However, if we modify a condition into a same expression
as the partitions constraint, postgres can exclude unrelated
table from search targets. So, we might avoid the problem
by converting the qual properly before calling predicate_refuted_by().

postgres=# explain select * from h where abs(hashint4(i))%3 = abs(hashint4(10))%3;
                        QUERY PLAN                        
----------------------------------------------------------
 Append  (cost=0.00..61.00 rows=14 width=4)
   ->  Seq Scan on h  (cost=0.00..0.00 rows=1 width=4)
         Filter: ((abs(hashint4(i)) % 3) = 2)
   ->  Seq Scan on h3  (cost=0.00..61.00 rows=13 width=4)
         Filter: ((abs(hashint4(i)) % 3) = 2)
(5 rows)

Best regards,
Yugo Nagata

-- 
Yugo Nagata <nagata@sraoss.co.jp>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

hash_partition.patch

Re: [HACKERS] [POC] hash partitioning

From

Aleksander Alekseev

Date:

28 February 2017, 18:05:36

Hi, Yugo.

Looks like a great feature! I'm going to take a closer look on your code
and write a feedback shortly. For now I can only tell that you forgot
to include some documentation in the patch.

I've added a corresponding entry to current commitfest [1]. Hope you
don't mind. If it's not too much trouble could you please register on a
commitfest site and add yourself to this entry as an author? I'm pretty
sure someone is using this information for writing release notes or
something like this.

[1] https://commitfest.postgresql.org/13/1059/

On Tue, Feb 28, 2017 at 11:33:13PM +0900, Yugo Nagata wrote:
> Hi all,
>
> Now we have a declarative partitioning, but hash partitioning is not
> implemented yet. Attached is a POC patch to add the hash partitioning
> feature. I know we will need more discussions about the syntax and other
> specifications before going ahead the project, but I think this runnable
> code might help to discuss what and how we implement this.
>
> * Description
>
> In this patch, the hash partitioning implementation is basically based
> on the list partitioning mechanism. However, partition bounds cannot be
> specified explicitly, but this is used internally as hash partition
> index, which is calculated when a partition is created or attached.
>
> The tentative syntax to create a partitioned table is as bellow;
>
>  CREATE TABLE h (i int) PARTITION BY HASH(i) PARTITIONS 3 USING hashint4;
>
> The number of partitions is specified by PARTITIONS, which is currently
> constant and cannot be changed, but I think this is needed to be changed in
> some manner. A hash function is specified by USING. Maybe, specifying hash
> function may be ommitted, and in this case, a default hash function
> corresponding to key type will be used.
>
> A partition table can be create as bellow;
>
>  CREATE TABLE h1 PARTITION OF h;
>  CREATE TABLE h2 PARTITION OF h;
>  CREATE TABLE h3 PARTITION OF h;
>
> FOR VALUES clause cannot be used, and the partition bound is
> calclulated automatically as partition index of single integer value.
>
> When trying create partitions more than the number specified
> by PARTITIONS, it gets an error.
>
> postgres=# create table h4 partition of h;
> ERROR:  cannot create hash partition more than 3 for h
>
> An inserted record is stored in a partition whose index equals
> abs(hashfunc(key)) % <number_of_partitions>. In the above
> example, this is abs(hashint4(i))%3.
>
> postgres=# insert into h (select generate_series(0,20));
> INSERT 0 21
>
> postgres=# select *,tableoid::regclass from h;
>  i  | tableoid
> ----+----------
>   0 | h1
>   1 | h1
>   2 | h1
>   4 | h1
>   8 | h1
>  10 | h1
>  11 | h1
>  14 | h1
>  15 | h1
>  17 | h1
>  20 | h1
>   5 | h2
>  12 | h2
>  13 | h2
>  16 | h2
>  19 | h2
>   3 | h3
>   6 | h3
>   7 | h3
>   9 | h3
>  18 | h3
> (21 rows)
>
> * Todo / discussions
>
> In this patch, we cannot change the number of partitions specified
> by PARTITIONS. I we can change this, the partitioning rule
> (<partition index> = abs(hashfunc(key)) % <number_of_partitions>)
> is also changed and then we need reallocatiing records between
> partitions.
>
> In this patch, user can specify a hash function USING. However,
> we migth need default hash functions which are useful and
> proper for hash partitioning.
>
> Currently, even when we issue SELECT query with a condition,
> postgres looks into all partitions regardless of each partition's
> constraint, because this is complicated such like "abs(hashint4(i))%3 = 0".
>
> postgres=# explain select * from h where i = 10;
>                         QUERY PLAN
> ----------------------------------------------------------
>  Append  (cost=0.00..125.62 rows=40 width=4)
>    ->  Seq Scan on h  (cost=0.00..0.00 rows=1 width=4)
>          Filter: (i = 10)
>    ->  Seq Scan on h1  (cost=0.00..41.88 rows=13 width=4)
>          Filter: (i = 10)
>    ->  Seq Scan on h2  (cost=0.00..41.88 rows=13 width=4)
>          Filter: (i = 10)
>    ->  Seq Scan on h3  (cost=0.00..41.88 rows=13 width=4)
>          Filter: (i = 10)
> (9 rows)
>
> However, if we modify a condition into a same expression
> as the partitions constraint, postgres can exclude unrelated
> table from search targets. So, we might avoid the problem
> by converting the qual properly before calling predicate_refuted_by().
>
> postgres=# explain select * from h where abs(hashint4(i))%3 = abs(hashint4(10))%3;
>                         QUERY PLAN
> ----------------------------------------------------------
>  Append  (cost=0.00..61.00 rows=14 width=4)
>    ->  Seq Scan on h  (cost=0.00..0.00 rows=1 width=4)
>          Filter: ((abs(hashint4(i)) % 3) = 2)
>    ->  Seq Scan on h3  (cost=0.00..61.00 rows=13 width=4)
>          Filter: ((abs(hashint4(i)) % 3) = 2)
> (5 rows)
>
> Best regards,
> Yugo Nagata
>
> --
> Yugo Nagata <nagata@sraoss.co.jp>

> diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
> index 41c0056..3820920 100644
> --- a/src/backend/catalog/heap.c
> +++ b/src/backend/catalog/heap.c
> @@ -3074,7 +3074,7 @@ StorePartitionKey(Relation rel,
>                    AttrNumber *partattrs,
>                    List *partexprs,
>                    Oid *partopclass,
> -                  Oid *partcollation)
> +                  Oid *partcollation, int16 partnparts, Oid hashfunc)
>  {
>      int            i;
>      int2vector *partattrs_vec;
> @@ -3121,6 +3121,8 @@ StorePartitionKey(Relation rel,
>      values[Anum_pg_partitioned_table_partrelid - 1] = ObjectIdGetDatum(RelationGetRelid(rel));
>      values[Anum_pg_partitioned_table_partstrat - 1] = CharGetDatum(strategy);
>      values[Anum_pg_partitioned_table_partnatts - 1] = Int16GetDatum(partnatts);
> +    values[Anum_pg_partitioned_table_partnparts - 1] = Int16GetDatum(partnparts);
> +    values[Anum_pg_partitioned_table_parthashfunc - 1] = ObjectIdGetDatum(hashfunc);
>      values[Anum_pg_partitioned_table_partattrs - 1] = PointerGetDatum(partattrs_vec);
>      values[Anum_pg_partitioned_table_partclass - 1] = PointerGetDatum(partopclass_vec);
>      values[Anum_pg_partitioned_table_partcollation - 1] = PointerGetDatum(partcollation_vec);
> diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
> index 4bcef58..24e69c6 100644
> --- a/src/backend/catalog/partition.c
> +++ b/src/backend/catalog/partition.c
> @@ -36,6 +36,8 @@
>  #include "optimizer/clauses.h"
>  #include "optimizer/planmain.h"
>  #include "optimizer/var.h"
> +#include "parser/parse_func.h"
> +#include "parser/parse_oper.h"
>  #include "rewrite/rewriteManip.h"
>  #include "storage/lmgr.h"
>  #include "utils/array.h"
> @@ -120,6 +122,7 @@ static int32 qsort_partition_rbound_cmp(const void *a, const void *b,
>
>  static List *get_qual_for_list(PartitionKey key, PartitionBoundSpec *spec);
>  static List *get_qual_for_range(PartitionKey key, PartitionBoundSpec *spec);
> +static List *get_qual_for_hash(PartitionKey key, PartitionBoundSpec *spec);
>  static Oid get_partition_operator(PartitionKey key, int col,
>                         StrategyNumber strategy, bool *need_relabel);
>  static List *generate_partition_qual(Relation rel);
> @@ -236,7 +239,8 @@ RelationBuildPartitionDesc(Relation rel)
>              oids[i++] = lfirst_oid(cell);
>
>          /* Convert from node to the internal representation */
> -        if (key->strategy == PARTITION_STRATEGY_LIST)
> +        if (key->strategy == PARTITION_STRATEGY_LIST ||
> +            key->strategy == PARTITION_STRATEGY_HASH)
>          {
>              List       *non_null_values = NIL;
>
> @@ -251,7 +255,7 @@ RelationBuildPartitionDesc(Relation rel)
>                  ListCell   *c;
>                  PartitionBoundSpec *spec = lfirst(cell);
>
> -                if (spec->strategy != PARTITION_STRATEGY_LIST)
> +                if (spec->strategy != key->strategy)
>                      elog(ERROR, "invalid strategy in partition bound spec");
>
>                  foreach(c, spec->listdatums)
> @@ -464,6 +468,7 @@ RelationBuildPartitionDesc(Relation rel)
>          switch (key->strategy)
>          {
>              case PARTITION_STRATEGY_LIST:
> +            case PARTITION_STRATEGY_HASH:
>                  {
>                      boundinfo->has_null = found_null;
>                      boundinfo->indexes = (int *) palloc(ndatums * sizeof(int));
> @@ -829,6 +834,18 @@ check_new_partition_bound(char *relname, Relation parent, Node *bound)
>                  break;
>              }
>
> +        case PARTITION_STRATEGY_HASH:
> +            {
> +                Assert(spec->strategy == PARTITION_STRATEGY_HASH);
> +
> +                if (partdesc->nparts + 1 > key->partnparts)
> +                    ereport(ERROR,
> +                            (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
> +                    errmsg("cannot create hash partition more than %d for %s",
> +                            key->partnparts, RelationGetRelationName(parent))));
> +                break;
> +            }
> +
>          default:
>              elog(ERROR, "unexpected partition strategy: %d",
>                   (int) key->strategy);
> @@ -916,6 +933,11 @@ get_qual_from_partbound(Relation rel, Relation parent, Node *bound)
>              my_qual = get_qual_for_range(key, spec);
>              break;
>
> +        case PARTITION_STRATEGY_HASH:
> +            Assert(spec->strategy == PARTITION_STRATEGY_LIST);
> +            my_qual = get_qual_for_hash(key, spec);
> +            break;
> +
>          default:
>              elog(ERROR, "unexpected partition strategy: %d",
>                   (int) key->strategy);
> @@ -1146,6 +1168,84 @@ RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
>      return pd;
>  }
>
> +/*
> + * convert_expr_for_hash
> + *
> + * Converts a expr for a hash partition's constraint.
> + * expr is converted into 'abs(hashfunc(expr)) % npart".
> + *
> + * npart: number of partitions
> + * hashfunc: OID of hash function
> + */
> +Expr *
> +convert_expr_for_hash(Expr *expr, int npart, Oid hashfunc)
> +{
> +    FuncExpr   *func,
> +               *abs;
> +    Expr        *modexpr;
> +    Oid            modoid;
> +    Oid            int4oid[1] = {INT4OID};
> +
> +    ParseState *pstate = make_parsestate(NULL);
> +    Value       *val_npart = makeInteger(npart);
> +    Node       *const_npart = (Node *) make_const(pstate, val_npart, -1);
> +
> +    /* hash function */
> +    func = makeFuncExpr(hashfunc,
> +                        INT4OID,
> +                        list_make1(expr),
> +                        0,
> +                        0,
> +                        COERCE_EXPLICIT_CALL);
> +
> +    /* Abs */
> +    abs = makeFuncExpr(LookupFuncName(list_make1(makeString("abs")), 1, int4oid, false),
> +                       INT4OID,
> +                       list_make1(func),
> +                       0,
> +                       0,
> +                       COERCE_EXPLICIT_CALL);
> +
> +    /* modulo by npart */
> +    modoid = LookupOperName(pstate, list_make1(makeString("%")), INT4OID, INT4OID, false, -1);
> +    modexpr = make_opclause(modoid, INT4OID, false, (Expr*)abs, (Expr*)const_npart, 0, 0);
> +
> +    return modexpr;
> +}
> +
> +
> +/*
> + * get_next_hash_partition_index
> + *
> + * Returns the minimal index which is not used for hash partition.
> + */
> +int
> +get_next_hash_partition_index(Relation parent)
> +{
> +    PartitionKey key = RelationGetPartitionKey(parent);
> +    PartitionDesc partdesc = RelationGetPartitionDesc(parent);
> +
> +    int      i;
> +    bool *used = palloc0(sizeof(int) * key->partnparts);
> +
> +    /* mark used for existing partition indexs */
> +    for (i = 0; i < partdesc->boundinfo->ndatums; i++)
> +    {
> +        Datum* datum = partdesc->boundinfo->datums[i];
> +        int idx = DatumGetInt16(datum[0]);
> +
> +        if (!used[idx])
> +            used[idx] = true;
> +    }
> +
> +    /* find the minimal unused index */
> +    for (i = 0; i < key->partnparts; i++)
> +        if (!used[i])
> +            break;
> +
> +    return i;
> +}
> +
>  /* Module-local functions */
>
>  /*
> @@ -1467,6 +1567,43 @@ get_qual_for_range(PartitionKey key, PartitionBoundSpec *spec)
>  }
>
>  /*
> + * get_qual_for_hash
> + *
> + * Returns a list of expressions to use as a hash partition's constraint.
> + */
> +static List *
> +get_qual_for_hash(PartitionKey key, PartitionBoundSpec *spec)
> +{
> +    List       *result;
> +    Expr       *keyCol;
> +    Expr       *expr;
> +    Expr        *opexpr;
> +    Oid            operoid;
> +    ParseState *pstate = make_parsestate(NULL);
> +
> +    /* Left operand */
> +    if (key->partattrs[0] != 0)
> +        keyCol = (Expr *) makeVar(1,
> +                                  key->partattrs[0],
> +                                  key->parttypid[0],
> +                                  key->parttypmod[0],
> +                                  key->parttypcoll[0],
> +                                  0);
> +    else
> +        keyCol = (Expr *) copyObject(linitial(key->partexprs));
> +
> +    expr = convert_expr_for_hash(keyCol, key->partnparts, key->parthashfunc);
> +
> +    /* equals the listdaums value */
> +    operoid = LookupOperName(pstate, list_make1(makeString("=")), INT4OID, INT4OID, false, -1);
> +    opexpr = make_opclause(operoid, BOOLOID, false, expr, linitial(spec->listdatums), 0, 0);
> +
> +    result = list_make1(opexpr);
> +
> +    return result;
> +}
> +
> +/*
>   * get_partition_operator
>   *
>   * Return oid of the operator of given strategy for a given partition key
> @@ -1730,6 +1867,11 @@ get_partition_for_tuple(PartitionDispatch *pd,
>                              (errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
>                          errmsg("range partition key of row contains null")));
>          }
> +        else if (key->strategy == PARTITION_STRATEGY_HASH)
> +        {
> +            values[0] = OidFunctionCall1(key->parthashfunc, values[0]);
> +            values[0] = Int16GetDatum(Abs(DatumGetInt16(values[0])) % key->partnparts);
> +        }
>
>          if (partdesc->boundinfo->has_null && isnull[0])
>              /* Tuple maps to the null-accepting list partition */
> @@ -1744,6 +1886,7 @@ get_partition_for_tuple(PartitionDispatch *pd,
>              switch (key->strategy)
>              {
>                  case PARTITION_STRATEGY_LIST:
> +                case PARTITION_STRATEGY_HASH:
>                      if (cur_offset >= 0 && equal)
>                          cur_index = partdesc->boundinfo->indexes[cur_offset];
>                      else
> @@ -1968,6 +2111,7 @@ partition_bound_cmp(PartitionKey key, PartitionBoundInfo boundinfo,
>      switch (key->strategy)
>      {
>          case PARTITION_STRATEGY_LIST:
> +        case PARTITION_STRATEGY_HASH:
>              cmpval = DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[0],
>                                                       key->partcollation[0],
>                                                       bound_datums[0],
> diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
> index 3cea220..5a28cc0 100644
> --- a/src/backend/commands/tablecmds.c
> +++ b/src/backend/commands/tablecmds.c
> @@ -41,6 +41,7 @@
>  #include "catalog/pg_inherits_fn.h"
>  #include "catalog/pg_namespace.h"
>  #include "catalog/pg_opclass.h"
> +#include "catalog/pg_proc.h"
>  #include "catalog/pg_tablespace.h"
>  #include "catalog/pg_trigger.h"
>  #include "catalog/pg_type.h"
> @@ -77,6 +78,7 @@
>  #include "parser/parse_oper.h"
>  #include "parser/parse_relation.h"
>  #include "parser/parse_type.h"
> +#include "parser/parse_func.h"
>  #include "parser/parse_utilcmd.h"
>  #include "parser/parser.h"
>  #include "pgstat.h"
> @@ -450,7 +452,7 @@ static void RangeVarCallbackForAlterRelation(const RangeVar *rv, Oid relid,
>                                   Oid oldrelid, void *arg);
>  static bool is_partition_attr(Relation rel, AttrNumber attnum, bool *used_in_expr);
>  static PartitionSpec *transformPartitionSpec(Relation rel, PartitionSpec *partspec, char *strategy);
> -static void ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs,
> +static void ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs, Oid *partatttypes,
>                        List **partexprs, Oid *partopclass, Oid *partcollation);
>  static void CreateInheritance(Relation child_rel, Relation parent_rel);
>  static void RemoveInheritance(Relation child_rel, Relation parent_rel);
> @@ -799,8 +801,10 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
>          AttrNumber    partattrs[PARTITION_MAX_KEYS];
>          Oid            partopclass[PARTITION_MAX_KEYS];
>          Oid            partcollation[PARTITION_MAX_KEYS];
> +        Oid            partatttypes[PARTITION_MAX_KEYS];
>          List       *partexprs = NIL;
>          List       *cmds = NIL;
> +        Oid hashfuncOid = InvalidOid;
>
>          /*
>           * We need to transform the raw parsetrees corresponding to partition
> @@ -811,15 +815,40 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
>          stmt->partspec = transformPartitionSpec(rel, stmt->partspec,
>                                                  &strategy);
>          ComputePartitionAttrs(rel, stmt->partspec->partParams,
> -                              partattrs, &partexprs, partopclass,
> +                              partattrs, partatttypes, &partexprs, partopclass,
>                                partcollation);
>
>          partnatts = list_length(stmt->partspec->partParams);
> +
> +        if (strategy == PARTITION_STRATEGY_HASH)
> +        {
> +            Oid funcrettype;
> +
> +            if (partnatts != 1)
> +                ereport(ERROR,
> +                        (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                        errmsg("number of partition key must be 1 for hash partition")));
> +
> +            hashfuncOid = LookupFuncName(stmt->partspec->hashfunc, 1, partatttypes, false);
> +            funcrettype = get_func_rettype(hashfuncOid);
> +            if (funcrettype != INT4OID)
> +                ereport(ERROR,
> +                        (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                        errmsg("hash function for partitioning must return integer")));
> +
> +            if (func_volatile(hashfuncOid) != PROVOLATILE_IMMUTABLE)
> +                ereport(ERROR,
> +                        (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                        errmsg("hash function for partitioning must be marked IMMUTABLE")));
> +
> +        }
> +
>          StorePartitionKey(rel, strategy, partnatts, partattrs, partexprs,
> -                          partopclass, partcollation);
> +                          partopclass, partcollation, stmt->partspec->partnparts, hashfuncOid);
>
> -        /* Force key columns to be NOT NULL when using range partitioning */
> -        if (strategy == PARTITION_STRATEGY_RANGE)
> +        /* Force key columns to be NOT NULL when using range or hash partitioning */
> +        if (strategy == PARTITION_STRATEGY_RANGE ||
> +            strategy == PARTITION_STRATEGY_HASH)
>          {
>              for (i = 0; i < partnatts; i++)
>              {
> @@ -12783,18 +12812,51 @@ transformPartitionSpec(Relation rel, PartitionSpec *partspec, char *strategy)
>      newspec->strategy = partspec->strategy;
>      newspec->location = partspec->location;
>      newspec->partParams = NIL;
> +    newspec->partnparts = partspec->partnparts;
> +    newspec->hashfunc = partspec->hashfunc;
>
>      /* Parse partitioning strategy name */
>      if (!pg_strcasecmp(partspec->strategy, "list"))
>          *strategy = PARTITION_STRATEGY_LIST;
>      else if (!pg_strcasecmp(partspec->strategy, "range"))
>          *strategy = PARTITION_STRATEGY_RANGE;
> +    else if (!pg_strcasecmp(partspec->strategy, "hash"))
> +        *strategy = PARTITION_STRATEGY_HASH;
>      else
>          ereport(ERROR,
>                  (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>                   errmsg("unrecognized partitioning strategy \"%s\"",
>                          partspec->strategy)));
>
> +    if (*strategy == PARTITION_STRATEGY_HASH)
> +    {
> +        if (partspec->partnparts < 0)
> +            ereport(ERROR,
> +                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                     errmsg("number of partitions must be specified for hash partition")));
> +        else if (partspec->partnparts == 0)
> +            ereport(ERROR,
> +                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                     errmsg("number of partitions must be greater than 0")));
> +
> +        if (list_length(partspec->hashfunc) == 0)
> +            ereport(ERROR,
> +                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                     errmsg("hash function must be specified for hash partition")));
> +    }
> +    else
> +    {
> +        if (partspec->partnparts >= 0)
> +            ereport(ERROR,
> +                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                     errmsg("number of partitions can be specified only for hash partition")));
> +
> +        if (list_length(partspec->hashfunc) > 0)
> +            ereport(ERROR,
> +                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                     errmsg("hash function can be specified only for hash partition")));
> +    }
> +
>      /*
>       * Create a dummy ParseState and insert the target relation as its sole
>       * rangetable entry.  We need a ParseState for transformExpr.
> @@ -12843,7 +12905,7 @@ transformPartitionSpec(Relation rel, PartitionSpec *partspec, char *strategy)
>   * Compute per-partition-column information from a list of PartitionElem's
>   */
>  static void
> -ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs,
> +ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs, Oid *partatttypes,
>                        List **partexprs, Oid *partopclass, Oid *partcollation)
>  {
>      int            attn;
> @@ -13010,6 +13072,7 @@ ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs,
>                                                 "btree",
>                                                 BTREE_AM_OID);
>
> +        partatttypes[attn] = atttype;
>          attn++;
>      }
>  }
> diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
> index 05d8538..f4febc9 100644
> --- a/src/backend/nodes/copyfuncs.c
> +++ b/src/backend/nodes/copyfuncs.c
> @@ -4232,6 +4232,8 @@ _copyPartitionSpec(const PartitionSpec *from)
>
>      COPY_STRING_FIELD(strategy);
>      COPY_NODE_FIELD(partParams);
> +    COPY_SCALAR_FIELD(partnparts);
> +    COPY_NODE_FIELD(hashfunc);
>      COPY_LOCATION_FIELD(location);
>
>      return newnode;
> diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
> index d595cd7..d589eac 100644
> --- a/src/backend/nodes/equalfuncs.c
> +++ b/src/backend/nodes/equalfuncs.c
> @@ -2725,6 +2725,8 @@ _equalPartitionSpec(const PartitionSpec *a, const PartitionSpec *b)
>  {
>      COMPARE_STRING_FIELD(strategy);
>      COMPARE_NODE_FIELD(partParams);
> +    COMPARE_SCALAR_FIELD(partnparts);
> +    COMPARE_NODE_FIELD(hashfunc);
>      COMPARE_LOCATION_FIELD(location);
>
>      return true;
> diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
> index b3802b4..d6db80e 100644
> --- a/src/backend/nodes/outfuncs.c
> +++ b/src/backend/nodes/outfuncs.c
> @@ -3318,6 +3318,8 @@ _outPartitionSpec(StringInfo str, const PartitionSpec *node)
>
>      WRITE_STRING_FIELD(strategy);
>      WRITE_NODE_FIELD(partParams);
> +    WRITE_INT_FIELD(partnparts);
> +    WRITE_NODE_FIELD(hashfunc);
>      WRITE_LOCATION_FIELD(location);
>  }
>
> diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
> index e833b2e..b67140d 100644
> --- a/src/backend/parser/gram.y
> +++ b/src/backend/parser/gram.y
> @@ -574,6 +574,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
>  %type <list>        partbound_datum_list
>  %type <partrange_datum>    PartitionRangeDatum
>  %type <list>        range_datum_list
> +%type <ival>        hash_partitions
> +%type <list>        hash_function
>
>  /*
>   * Non-keyword token types.  These are hard-wired into the "flex" lexer.
> @@ -627,7 +629,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
>
>      GLOBAL GRANT GRANTED GREATEST GROUP_P GROUPING
>
> -    HANDLER HAVING HEADER_P HOLD HOUR_P
> +    HANDLER HASH HAVING HEADER_P HOLD HOUR_P
>
>      IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IMPORT_P IN_P
>      INCLUDING INCREMENT INDEX INDEXES INHERIT INHERITS INITIALLY INLINE_P
> @@ -651,7 +653,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
>      OBJECT_P OF OFF OFFSET OIDS OLD ON ONLY OPERATOR OPTION OPTIONS OR
>      ORDER ORDINALITY OUT_P OUTER_P OVER OVERLAPS OVERLAY OWNED OWNER
>
> -    PARALLEL PARSER PARTIAL PARTITION PASSING PASSWORD PLACING PLANS POLICY
> +    PARALLEL PARSER PARTIAL PARTITION PARTITIONS PASSING PASSWORD PLACING PLANS POLICY
>      POSITION PRECEDING PRECISION PRESERVE PREPARE PREPARED PRIMARY
>      PRIOR PRIVILEGES PROCEDURAL PROCEDURE PROGRAM PUBLICATION
>
> @@ -2587,6 +2589,16 @@ ForValues:
>
>                      $$ = (Node *) n;
>                  }
> +
> +            /* a HASH partition */
> +            | /*EMPTY*/
> +                {
> +                    PartitionBoundSpec *n = makeNode(PartitionBoundSpec);
> +
> +                    n->strategy = PARTITION_STRATEGY_HASH;
> +
> +                    $$ = (Node *) n;
> +                }
>          ;
>
>  partbound_datum:
> @@ -3666,7 +3678,7 @@ OptPartitionSpec: PartitionSpec    { $$ = $1; }
>              | /*EMPTY*/            { $$ = NULL; }
>          ;
>
> -PartitionSpec: PARTITION BY part_strategy '(' part_params ')'
> +PartitionSpec: PARTITION BY part_strategy '(' part_params ')' hash_partitions hash_function
>                  {
>                      PartitionSpec *n = makeNode(PartitionSpec);
>
> @@ -3674,10 +3686,21 @@ PartitionSpec: PARTITION BY part_strategy '(' part_params ')'
>                      n->partParams = $5;
>                      n->location = @1;
>
> +                    n->partnparts = $7;
> +                    n->hashfunc = $8;
> +
>                      $$ = n;
>                  }
>          ;
>
> +hash_partitions: PARTITIONS Iconst { $$ = $2; }
> +                    | /*EMPTY*/   { $$ = -1; }
> +        ;
> +
> +hash_function: USING handler_name { $$ = $2; }
> +                    | /*EMPTY*/ { $$ = NULL; }
> +        ;
> +
>  part_strategy:    IDENT                    { $$ = $1; }
>                  | unreserved_keyword    { $$ = pstrdup($1); }
>          ;
> @@ -14377,6 +14400,7 @@ unreserved_keyword:
>              | GLOBAL
>              | GRANTED
>              | HANDLER
> +            | HASH
>              | HEADER_P
>              | HOLD
>              | HOUR_P
> @@ -14448,6 +14472,7 @@ unreserved_keyword:
>              | PARSER
>              | PARTIAL
>              | PARTITION
> +            | PARTITIONS
>              | PASSING
>              | PASSWORD
>              | PLANS
> diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c
> index ff2bab6..8e1be31 100644
> --- a/src/backend/parser/parse_utilcmd.c
> +++ b/src/backend/parser/parse_utilcmd.c
> @@ -40,6 +40,7 @@
>  #include "catalog/pg_opclass.h"
>  #include "catalog/pg_operator.h"
>  #include "catalog/pg_type.h"
> +#include "catalog/partition.h"
>  #include "commands/comment.h"
>  #include "commands/defrem.h"
>  #include "commands/tablecmds.h"
> @@ -3252,6 +3253,24 @@ transformPartitionBound(ParseState *pstate, Relation parent, Node *bound)
>              ++i;
>          }
>      }
> +    else if (strategy == PARTITION_STRATEGY_HASH)
> +    {
> +        Value     *conval;
> +        Node        *value;
> +        int          index;
> +
> +        if (spec->strategy != PARTITION_STRATEGY_HASH)
> +            ereport(ERROR,
> +                    (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
> +                 errmsg("invalid bound specification for a hash partition")));
> +
> +        index = get_next_hash_partition_index(parent);
> +
> +        /* store the partition index as a listdatums value */
> +        conval = makeInteger(index);
> +        value = (Node *) make_const(pstate, conval, -1);
> +        result_spec->listdatums = list_make1(value);
> +    }
>      else
>          elog(ERROR, "unexpected partition strategy: %d", (int) strategy);
>
> diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
> index b27b77d..fab6eea 100644
> --- a/src/backend/utils/adt/ruleutils.c
> +++ b/src/backend/utils/adt/ruleutils.c
> @@ -1423,7 +1423,7 @@ pg_get_indexdef_worker(Oid indexrelid, int colno,
>   *
>   * Returns the partition key specification, ie, the following:
>   *
> - * PARTITION BY { RANGE | LIST } (column opt_collation opt_opclass [, ...])
> + * PARTITION BY { RANGE | LIST | HASH } (column opt_collation opt_opclass [, ...])
>   */
>  Datum
>  pg_get_partkeydef(PG_FUNCTION_ARGS)
> @@ -1513,6 +1513,9 @@ pg_get_partkeydef_worker(Oid relid, int prettyFlags)
>          case PARTITION_STRATEGY_RANGE:
>              appendStringInfo(&buf, "RANGE");
>              break;
> +        case PARTITION_STRATEGY_HASH:
> +            appendStringInfo(&buf, "HASH");
> +            break;
>          default:
>              elog(ERROR, "unexpected partition strategy: %d",
>                   (int) form->partstrat);
> @@ -8520,6 +8523,9 @@ get_rule_expr(Node *node, deparse_context *context,
>                          appendStringInfoString(buf, ")");
>                          break;
>
> +                    case PARTITION_STRATEGY_HASH:
> +                        break;
> +
>                      default:
>                          elog(ERROR, "unrecognized partition strategy: %d",
>                               (int) spec->strategy);
> diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
> index 9001e20..829e4d2 100644
> --- a/src/backend/utils/cache/relcache.c
> +++ b/src/backend/utils/cache/relcache.c
> @@ -855,6 +855,9 @@ RelationBuildPartitionKey(Relation relation)
>      key->strategy = form->partstrat;
>      key->partnatts = form->partnatts;
>
> +    key->partnparts = form->partnparts;
> +    key->parthashfunc = form->parthashfunc;
> +
>      /*
>       * We can rely on the first variable-length attribute being mapped to the
>       * relevant field of the catalog's C struct, because all previous
> @@ -999,6 +1002,9 @@ copy_partition_key(PartitionKey fromkey)
>      newkey->strategy = fromkey->strategy;
>      newkey->partnatts = n = fromkey->partnatts;
>
> +    newkey->partnparts = fromkey->partnparts;
> +    newkey->parthashfunc = fromkey->parthashfunc;
> +
>      newkey->partattrs = (AttrNumber *) palloc(n * sizeof(AttrNumber));
>      memcpy(newkey->partattrs, fromkey->partattrs, n * sizeof(AttrNumber));
>
> diff --git a/src/include/catalog/heap.h b/src/include/catalog/heap.h
> index 1187797..367e2f8 100644
> --- a/src/include/catalog/heap.h
> +++ b/src/include/catalog/heap.h
> @@ -141,7 +141,7 @@ extern void StorePartitionKey(Relation rel,
>                    AttrNumber *partattrs,
>                    List *partexprs,
>                    Oid *partopclass,
> -                  Oid *partcollation);
> +                  Oid *partcollation, int16 partnparts, Oid hashfunc);
>  extern void RemovePartitionKeyByRelId(Oid relid);
>  extern void StorePartitionBound(Relation rel, Relation parent, Node *bound);
>
> diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
> index b195d1a..80f4b0e 100644
> --- a/src/include/catalog/partition.h
> +++ b/src/include/catalog/partition.h
> @@ -89,4 +89,6 @@ extern int get_partition_for_tuple(PartitionDispatch *pd,
>                          TupleTableSlot *slot,
>                          EState *estate,
>                          Oid *failed_at);
> +extern Expr *convert_expr_for_hash(Expr *expr, int npart, Oid hashfunc);
> +extern int get_next_hash_partition_index(Relation parent);
>  #endif   /* PARTITION_H */
> diff --git a/src/include/catalog/pg_partitioned_table.h b/src/include/catalog/pg_partitioned_table.h
> index bdff36a..69e509c 100644
> --- a/src/include/catalog/pg_partitioned_table.h
> +++ b/src/include/catalog/pg_partitioned_table.h
> @@ -33,6 +33,9 @@ CATALOG(pg_partitioned_table,3350) BKI_WITHOUT_OIDS
>      char        partstrat;        /* partitioning strategy */
>      int16        partnatts;        /* number of partition key columns */
>
> +    int16        partnparts;
> +    Oid            parthashfunc;
> +
>      /*
>       * variable-length fields start here, but we allow direct access to
>       * partattrs via the C struct.  That's because the first variable-length
> @@ -49,6 +52,8 @@ CATALOG(pg_partitioned_table,3350) BKI_WITHOUT_OIDS
>      pg_node_tree partexprs;        /* list of expressions in the partition key;
>                                   * one item for each zero entry in partattrs[] */
>  #endif
> +
> +
>  } FormData_pg_partitioned_table;
>
>  /* ----------------
> @@ -62,13 +67,15 @@ typedef FormData_pg_partitioned_table *Form_pg_partitioned_table;
>   *        compiler constants for pg_partitioned_table
>   * ----------------
>   */
> -#define Natts_pg_partitioned_table                7
> +#define Natts_pg_partitioned_table                9
>  #define Anum_pg_partitioned_table_partrelid        1
>  #define Anum_pg_partitioned_table_partstrat        2
>  #define Anum_pg_partitioned_table_partnatts        3
> -#define Anum_pg_partitioned_table_partattrs        4
> -#define Anum_pg_partitioned_table_partclass        5
> -#define Anum_pg_partitioned_table_partcollation 6
> -#define Anum_pg_partitioned_table_partexprs        7
> +#define Anum_pg_partitioned_table_partnparts    4
> +#define Anum_pg_partitioned_table_parthashfunc    5
> +#define Anum_pg_partitioned_table_partattrs        6
> +#define Anum_pg_partitioned_table_partclass        7
> +#define Anum_pg_partitioned_table_partcollation 8
> +#define Anum_pg_partitioned_table_partexprs        9
>
>  #endif   /* PG_PARTITIONED_TABLE_H */
> diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
> index 5afc3eb..1c3474f 100644
> --- a/src/include/nodes/parsenodes.h
> +++ b/src/include/nodes/parsenodes.h
> @@ -730,11 +730,14 @@ typedef struct PartitionSpec
>      NodeTag        type;
>      char       *strategy;        /* partitioning strategy ('list' or 'range') */
>      List       *partParams;        /* List of PartitionElems */
> +    int            partnparts;
> +    List       *hashfunc;
>      int            location;        /* token location, or -1 if unknown */
>  } PartitionSpec;
>
>  #define PARTITION_STRATEGY_LIST        'l'
>  #define PARTITION_STRATEGY_RANGE    'r'
> +#define PARTITION_STRATEGY_HASH        'h'
>
>  /*
>   * PartitionBoundSpec - a partition bound specification
> diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
> index 985d650..0597939 100644
> --- a/src/include/parser/kwlist.h
> +++ b/src/include/parser/kwlist.h
> @@ -180,6 +180,7 @@ PG_KEYWORD("greatest", GREATEST, COL_NAME_KEYWORD)
>  PG_KEYWORD("group", GROUP_P, RESERVED_KEYWORD)
>  PG_KEYWORD("grouping", GROUPING, COL_NAME_KEYWORD)
>  PG_KEYWORD("handler", HANDLER, UNRESERVED_KEYWORD)
> +PG_KEYWORD("hash", HASH, UNRESERVED_KEYWORD)
>  PG_KEYWORD("having", HAVING, RESERVED_KEYWORD)
>  PG_KEYWORD("header", HEADER_P, UNRESERVED_KEYWORD)
>  PG_KEYWORD("hold", HOLD, UNRESERVED_KEYWORD)
> @@ -291,6 +292,7 @@ PG_KEYWORD("parallel", PARALLEL, UNRESERVED_KEYWORD)
>  PG_KEYWORD("parser", PARSER, UNRESERVED_KEYWORD)
>  PG_KEYWORD("partial", PARTIAL, UNRESERVED_KEYWORD)
>  PG_KEYWORD("partition", PARTITION, UNRESERVED_KEYWORD)
> +PG_KEYWORD("partitions", PARTITIONS, UNRESERVED_KEYWORD)
>  PG_KEYWORD("passing", PASSING, UNRESERVED_KEYWORD)
>  PG_KEYWORD("password", PASSWORD, UNRESERVED_KEYWORD)
>  PG_KEYWORD("placing", PLACING, RESERVED_KEYWORD)
> diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
> index a617a7c..660adfb 100644
> --- a/src/include/utils/rel.h
> +++ b/src/include/utils/rel.h
> @@ -62,6 +62,9 @@ typedef struct PartitionKeyData
>      Oid           *partopcintype;    /* OIDs of opclass declared input data types */
>      FmgrInfo   *partsupfunc;    /* lookup info for support funcs */
>
> +    int16        partnparts;        /* number of hash partitions */
> +    Oid            parthashfunc;    /* OID of hash function */
> +
>      /* Partitioning collation per attribute */
>      Oid           *partcollation;
>

>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


--
Best regards,
Aleksander Alekseev

Re: [HACKERS] [POC] hash partitioning

From

Amit Langote

Date:

01 March 2017, 05:14:15

Nagata-san,

On 2017/02/28 23:33, Yugo Nagata wrote:
> Hi all,
> 
> Now we have a declarative partitioning, but hash partitioning is not
> implemented yet. Attached is a POC patch to add the hash partitioning
> feature. I know we will need more discussions about the syntax and other
> specifications before going ahead the project, but I think this runnable
> code might help to discuss what and how we implement this.

Great!

> * Description
> 
> In this patch, the hash partitioning implementation is basically based
> on the list partitioning mechanism. However, partition bounds cannot be
> specified explicitly, but this is used internally as hash partition
> index, which is calculated when a partition is created or attached.
> 
> The tentative syntax to create a partitioned table is as bellow;
> 
>  CREATE TABLE h (i int) PARTITION BY HASH(i) PARTITIONS 3 USING hashint4;
> 
> The number of partitions is specified by PARTITIONS, which is currently
> constant and cannot be changed, but I think this is needed to be changed in
> some manner. A hash function is specified by USING. Maybe, specifying hash
> function may be ommitted, and in this case, a default hash function
> corresponding to key type will be used.
> 
> A partition table can be create as bellow;
> 
>  CREATE TABLE h1 PARTITION OF h;
>  CREATE TABLE h2 PARTITION OF h;
>  CREATE TABLE h3 PARTITION OF h;
> 
> FOR VALUES clause cannot be used, and the partition bound is
> calclulated automatically as partition index of single integer value.
> 
> When trying create partitions more than the number specified
> by PARTITIONS, it gets an error.
> 
> postgres=# create table h4 partition of h;
> ERROR:  cannot create hash partition more than 3 for h

Instead of having to create each partition individually, wouldn't it be
better if the following command

CREATE TABLE h (i int) PARTITION BY HASH (i) PARTITIONS 3;

created the partitions *automatically*?

It makes sense to provide a way to create individual list and range
partitions separately, because users can specify custom bounds for each.
We don't need that for hash partitions, so why make users run separate
commands (without the FOR VALUES clause) anyway?  We may perhaps need to
offer a way to optionally specify a user-defined name for each partition
in the same command, along with tablespace, storage options, etc.  By
default, the names would be generated internally and the user can ALTER
individual partitions after the fact to specify tablespace, etc.

Thanks,
Amit

Re: [HACKERS] [POC] hash partitioning

From

Rushabh Lathia

Date:

01 March 2017, 08:00:09

On Tue, Feb 28, 2017 at 8:03 PM, Yugo Nagata <nagata@sraoss.co.jp> wrote:

Hi all,

Now we have a declarative partitioning, but hash partitioning is not
implemented yet. Attached is a POC patch to add the hash partitioning
feature. I know we will need more discussions about the syntax and other
specifications before going ahead the project, but I think this runnable
code might help to discuss what and how we implement this.

* Description

In this patch, the hash partitioning implementation is basically based
on the list partitioning mechanism. However, partition bounds cannot be
specified explicitly, but this is used internally as hash partition
index, which is calculated when a partition is created or attached.

The tentative syntax to create a partitioned table is as bellow;

CREATE TABLE h (i int) PARTITION BY HASH(i) PARTITIONS 3 USING hashint4;

The number of partitions is specified by PARTITIONS, which is currently
constant and cannot be changed, but I think this is needed to be changed in
some manner. A hash function is specified by USING. Maybe, specifying hash
function may be ommitted, and in this case, a default hash function
corresponding to key type will be used.

A partition table can be create as bellow;

CREATE TABLE h1 PARTITION OF h;
CREATE TABLE h2 PARTITION OF h;
CREATE TABLE h3 PARTITION OF h;

FOR VALUES clause cannot be used, and the partition bound is
calclulated automatically as partition index of single integer value.

When trying create partitions more than the number specified
by PARTITIONS, it gets an error.

postgres=# create table h4 partition of h;
ERROR: cannot create hash partition more than 3 for h

An inserted record is stored in a partition whose index equals
abs(hashfunc(key)) % <number_of_partitions>. In the above
example, this is abs(hashint4(i))%3.

postgres=# insert into h (select generate_series(0,20));
INSERT 0 21

postgres=# select *,tableoid::regclass from h;
i | tableoid
----+----------
0 | h1
1 | h1
2 | h1
4 | h1
8 | h1
10 | h1
11 | h1
14 | h1
15 | h1
17 | h1
20 | h1
5 | h2
12 | h2
13 | h2
16 | h2
19 | h2
3 | h3
6 | h3
7 | h3
9 | h3
18 | h3
(21 rows)

This is good, I will have closer look into the patch, but here are

few quick comments.

- CREATE HASH partition syntax adds two new keywords and ideally

we should try to avoid adding additional keywords. Also I can see that

HASH keyword been added, but I don't see any use of newly added

keyword in gram.y.

- Also I didn't like the idea of fixing number of partitions during the CREATE

TABLE syntax. Thats something that needs to be able to changes.

* Todo / discussions

In this patch, we cannot change the number of partitions specified
by PARTITIONS. I we can change this, the partitioning rule
(<partition index> = abs(hashfunc(key)) % <number_of_partitions>)
is also changed and then we need reallocatiing records between
partitions.

In this patch, user can specify a hash function USING. However,
we migth need default hash functions which are useful and
proper for hash partitioning.

- With fixing default hash function and not specifying number of partitions

during CREATE TABLE - don't need two new additional columns into

pg_partitioned_table catalog.

Currently, even when we issue SELECT query with a condition,
postgres looks into all partitions regardless of each partition's
constraint, because this is complicated such like "abs(hashint4(i))%3 = 0".

postgres=# explain select * from h where i = 10;
QUERY PLAN
----------------------------------------------------------
Append (cost=0.00..125.62 rows=40 width=4)
-> Seq Scan on h (cost=0.00..0.00 rows=1 width=4)
Filter: (i = 10)
-> Seq Scan on h1 (cost=0.00..41.88 rows=13 width=4)
Filter: (i = 10)
-> Seq Scan on h2 (cost=0.00..41.88 rows=13 width=4)
Filter: (i = 10)
-> Seq Scan on h3 (cost=0.00..41.88 rows=13 width=4)
Filter: (i = 10)
(9 rows)

However, if we modify a condition into a same expression
as the partitions constraint, postgres can exclude unrelated
table from search targets. So, we might avoid the problem
by converting the qual properly before calling predicate_refuted_by().

postgres=# explain select * from h where abs(hashint4(i))%3 = abs(hashint4(10))%3;
QUERY PLAN
----------------------------------------------------------
Append (cost=0.00..61.00 rows=14 width=4)
-> Seq Scan on h (cost=0.00..0.00 rows=1 width=4)
Filter: ((abs(hashint4(i)) % 3) = 2)
-> Seq Scan on h3 (cost=0.00..61.00 rows=13 width=4)
Filter: ((abs(hashint4(i)) % 3) = 2)
(5 rows)

Best regards,
Yugo Nagata

--
Yugo Nagata <nagata@sraoss.co.jp>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Regards,

Rushabh Lathia

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

01 March 2017, 08:22:58

On Tue, Feb 28, 2017 at 8:03 PM, Yugo Nagata <nagata@sraoss.co.jp> wrote:
> Hi all,
>
> Now we have a declarative partitioning, but hash partitioning is not
> implemented yet. Attached is a POC patch to add the hash partitioning
> feature. I know we will need more discussions about the syntax and other
> specifications before going ahead the project, but I think this runnable
> code might help to discuss what and how we implement this.
>

Great.

> * Description
>
> In this patch, the hash partitioning implementation is basically based
> on the list partitioning mechanism. However, partition bounds cannot be
> specified explicitly, but this is used internally as hash partition
> index, which is calculated when a partition is created or attached.
>
> The tentative syntax to create a partitioned table is as bellow;
>
> CREATE TABLE h (i int) PARTITION BY HASH(i) PARTITIONS 3 USING hashint4;
>
> The number of partitions is specified by PARTITIONS, which is currently
> constant and cannot be changed, but I think this is needed to be changed in
> some manner. A hash function is specified by USING. Maybe, specifying hash
> function may be ommitted, and in this case, a default hash function
> corresponding to key type will be used.
>
> A partition table can be create as bellow;
>
> CREATE TABLE h1 PARTITION OF h;
> CREATE TABLE h2 PARTITION OF h;
> CREATE TABLE h3 PARTITION OF h;
>
> FOR VALUES clause cannot be used, and the partition bound is
> calclulated automatically as partition index of single integer value.
>
> When trying create partitions more than the number specified
> by PARTITIONS, it gets an error.
>
> postgres=# create table h4 partition of h;
> ERROR: cannot create hash partition more than 3 for h
>
> An inserted record is stored in a partition whose index equals
> abs(hashfunc(key)) % <number_of_partitions>. In the above
> example, this is abs(hashint4(i))%3.
>
> postgres=# insert into h (select generate_series(0,20));
> INSERT 0 21
>
> postgres=# select *,tableoid::regclass from h;
> i | tableoid
> ----+----------
> 0 | h1
> 1 | h1
> 2 | h1
> 4 | h1
> 8 | h1
> 10 | h1
> 11 | h1
> 14 | h1
> 15 | h1
> 17 | h1
> 20 | h1
> 5 | h2
> 12 | h2
> 13 | h2
> 16 | h2
> 19 | h2
> 3 | h3
> 6 | h3
> 7 | h3
> 9 | h3
> 18 | h3
> (21 rows)
>
> * Todo / discussions
>
> In this patch, we cannot change the number of partitions specified
> by PARTITIONS. I we can change this, the partitioning rule
> (<partition index> = abs(hashfunc(key)) % <number_of_partitions>)
> is also changed and then we need reallocatiing records between
> partitions.
>
> In this patch, user can specify a hash function USING. However,
> we migth need default hash functions which are useful and
> proper for hash partitioning.
>

IMHO, we should try to keep create partition syntax simple and aligned with other partition strategy. For e.g:

CREATE TABLE h (i int) PARTITION BY HASH(i);

I Agree that it is unavoidable partitions number in modulo hashing,

but we can do in other hashing technique. Have you had thought about

Linear hashing[1] or Consistent hashing[2]? This will allow us to add/drop

partition with minimal row moment.

+1 for the default hash function corresponding to partitioning key type.

Regards,

Amul

[1] https://en.wikipedia.org/wiki/Linear_hashing

[2] https://en.wikipedia.org/wiki/Consistent_hashing

Re: [HACKERS] [POC] hash partitioning

From

Yugo Nagata

Date:

01 March 2017, 12:10:10

Hi Aleksander,

On Tue, 28 Feb 2017 18:05:36 +0300
Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:

> Hi, Yugo.
> 
> Looks like a great feature! I'm going to take a closer look on your code
> and write a feedback shortly. For now I can only tell that you forgot
> to include some documentation in the patch.

Thank you for looking into it. I'm forward to your feedback.
This is a proof of concept patch and additional documentation
is not included. I'll add this after reaching a consensus
on the specification of the feature.

> 
> I've added a corresponding entry to current commitfest [1]. Hope you
> don't mind. If it's not too much trouble could you please register on a
> commitfest site and add yourself to this entry as an author? I'm pretty
> sure someone is using this information for writing release notes or
> something like this.

Thank you for registering it to the commitfest. I have added me as an auther.

> 
> [1] https://commitfest.postgresql.org/13/1059/
> 
> On Tue, Feb 28, 2017 at 11:33:13PM +0900, Yugo Nagata wrote:
> > Hi all,
> > 
> > Now we have a declarative partitioning, but hash partitioning is not
> > implemented yet. Attached is a POC patch to add the hash partitioning
> > feature. I know we will need more discussions about the syntax and other
> > specifications before going ahead the project, but I think this runnable
> > code might help to discuss what and how we implement this.
> > 
> > * Description
> > 
> > In this patch, the hash partitioning implementation is basically based
> > on the list partitioning mechanism. However, partition bounds cannot be
> > specified explicitly, but this is used internally as hash partition
> > index, which is calculated when a partition is created or attached.
> > 
> > The tentative syntax to create a partitioned table is as bellow;
> > 
> >  CREATE TABLE h (i int) PARTITION BY HASH(i) PARTITIONS 3 USING hashint4;
> > 
> > The number of partitions is specified by PARTITIONS, which is currently
> > constant and cannot be changed, but I think this is needed to be changed in
> > some manner. A hash function is specified by USING. Maybe, specifying hash
> > function may be ommitted, and in this case, a default hash function
> > corresponding to key type will be used.
> > 
> > A partition table can be create as bellow;
> > 
> >  CREATE TABLE h1 PARTITION OF h;
> >  CREATE TABLE h2 PARTITION OF h;
> >  CREATE TABLE h3 PARTITION OF h;
> > 
> > FOR VALUES clause cannot be used, and the partition bound is
> > calclulated automatically as partition index of single integer value.
> > 
> > When trying create partitions more than the number specified
> > by PARTITIONS, it gets an error.
> > 
> > postgres=# create table h4 partition of h;
> > ERROR:  cannot create hash partition more than 3 for h
> > 
> > An inserted record is stored in a partition whose index equals
> > abs(hashfunc(key)) % <number_of_partitions>. In the above
> > example, this is abs(hashint4(i))%3.
> > 
> > postgres=# insert into h (select generate_series(0,20));
> > INSERT 0 21
> > 
> > postgres=# select *,tableoid::regclass from h;
> >  i  | tableoid 
> > ----+----------
> >   0 | h1
> >   1 | h1
> >   2 | h1
> >   4 | h1
> >   8 | h1
> >  10 | h1
> >  11 | h1
> >  14 | h1
> >  15 | h1
> >  17 | h1
> >  20 | h1
> >   5 | h2
> >  12 | h2
> >  13 | h2
> >  16 | h2
> >  19 | h2
> >   3 | h3
> >   6 | h3
> >   7 | h3
> >   9 | h3
> >  18 | h3
> > (21 rows)
> > 
> > * Todo / discussions
> > 
> > In this patch, we cannot change the number of partitions specified
> > by PARTITIONS. I we can change this, the partitioning rule
> > (<partition index> = abs(hashfunc(key)) % <number_of_partitions>)
> > is also changed and then we need reallocatiing records between
> > partitions.
> > 
> > In this patch, user can specify a hash function USING. However,
> > we migth need default hash functions which are useful and
> > proper for hash partitioning. 
> > 
> > Currently, even when we issue SELECT query with a condition,
> > postgres looks into all partitions regardless of each partition's
> > constraint, because this is complicated such like "abs(hashint4(i))%3 = 0".
> > 
> > postgres=# explain select * from h where i = 10;
> >                         QUERY PLAN                        
> > ----------------------------------------------------------
> >  Append  (cost=0.00..125.62 rows=40 width=4)
> >    ->  Seq Scan on h  (cost=0.00..0.00 rows=1 width=4)
> >          Filter: (i = 10)
> >    ->  Seq Scan on h1  (cost=0.00..41.88 rows=13 width=4)
> >          Filter: (i = 10)
> >    ->  Seq Scan on h2  (cost=0.00..41.88 rows=13 width=4)
> >          Filter: (i = 10)
> >    ->  Seq Scan on h3  (cost=0.00..41.88 rows=13 width=4)
> >          Filter: (i = 10)
> > (9 rows)
> > 
> > However, if we modify a condition into a same expression
> > as the partitions constraint, postgres can exclude unrelated
> > table from search targets. So, we might avoid the problem
> > by converting the qual properly before calling predicate_refuted_by().
> > 
> > postgres=# explain select * from h where abs(hashint4(i))%3 = abs(hashint4(10))%3;
> >                         QUERY PLAN                        
> > ----------------------------------------------------------
> >  Append  (cost=0.00..61.00 rows=14 width=4)
> >    ->  Seq Scan on h  (cost=0.00..0.00 rows=1 width=4)
> >          Filter: ((abs(hashint4(i)) % 3) = 2)
> >    ->  Seq Scan on h3  (cost=0.00..61.00 rows=13 width=4)
> >          Filter: ((abs(hashint4(i)) % 3) = 2)
> > (5 rows)
> > 
> > Best regards,
> > Yugo Nagata
> > 
> > -- 
> > Yugo Nagata <nagata@sraoss.co.jp>
> 
> > diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
> > index 41c0056..3820920 100644
> > --- a/src/backend/catalog/heap.c
> > +++ b/src/backend/catalog/heap.c
> > @@ -3074,7 +3074,7 @@ StorePartitionKey(Relation rel,
> >                    AttrNumber *partattrs,
> >                    List *partexprs,
> >                    Oid *partopclass,
> > -                  Oid *partcollation)
> > +                  Oid *partcollation, int16 partnparts, Oid hashfunc)
> >  {
> >      int            i;
> >      int2vector *partattrs_vec;
> > @@ -3121,6 +3121,8 @@ StorePartitionKey(Relation rel,
> >      values[Anum_pg_partitioned_table_partrelid - 1] = ObjectIdGetDatum(RelationGetRelid(rel));
> >      values[Anum_pg_partitioned_table_partstrat - 1] = CharGetDatum(strategy);
> >      values[Anum_pg_partitioned_table_partnatts - 1] = Int16GetDatum(partnatts);
> > +    values[Anum_pg_partitioned_table_partnparts - 1] = Int16GetDatum(partnparts);
> > +    values[Anum_pg_partitioned_table_parthashfunc - 1] = ObjectIdGetDatum(hashfunc);
> >      values[Anum_pg_partitioned_table_partattrs - 1] = PointerGetDatum(partattrs_vec);
> >      values[Anum_pg_partitioned_table_partclass - 1] = PointerGetDatum(partopclass_vec);
> >      values[Anum_pg_partitioned_table_partcollation - 1] = PointerGetDatum(partcollation_vec);
> > diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
> > index 4bcef58..24e69c6 100644
> > --- a/src/backend/catalog/partition.c
> > +++ b/src/backend/catalog/partition.c
> > @@ -36,6 +36,8 @@
> >  #include "optimizer/clauses.h"
> >  #include "optimizer/planmain.h"
> >  #include "optimizer/var.h"
> > +#include "parser/parse_func.h"
> > +#include "parser/parse_oper.h"
> >  #include "rewrite/rewriteManip.h"
> >  #include "storage/lmgr.h"
> >  #include "utils/array.h"
> > @@ -120,6 +122,7 @@ static int32 qsort_partition_rbound_cmp(const void *a, const void *b,
> >  
> >  static List *get_qual_for_list(PartitionKey key, PartitionBoundSpec *spec);
> >  static List *get_qual_for_range(PartitionKey key, PartitionBoundSpec *spec);
> > +static List *get_qual_for_hash(PartitionKey key, PartitionBoundSpec *spec);
> >  static Oid get_partition_operator(PartitionKey key, int col,
> >                         StrategyNumber strategy, bool *need_relabel);
> >  static List *generate_partition_qual(Relation rel);
> > @@ -236,7 +239,8 @@ RelationBuildPartitionDesc(Relation rel)
> >              oids[i++] = lfirst_oid(cell);
> >
> >          /* Convert from node to the internal representation */
> > -        if (key->strategy == PARTITION_STRATEGY_LIST)
> > +        if (key->strategy == PARTITION_STRATEGY_LIST ||
> > +            key->strategy == PARTITION_STRATEGY_HASH)
> >          {
> >              List       *non_null_values = NIL;
> >  
> > @@ -251,7 +255,7 @@ RelationBuildPartitionDesc(Relation rel)
> >                  ListCell   *c;
> >                  PartitionBoundSpec *spec = lfirst(cell);
> >  
> > -                if (spec->strategy != PARTITION_STRATEGY_LIST)
> > +                if (spec->strategy != key->strategy)
> >                      elog(ERROR, "invalid strategy in partition bound spec");
> >  
> >                  foreach(c, spec->listdatums)
> > @@ -464,6 +468,7 @@ RelationBuildPartitionDesc(Relation rel)
> >          switch (key->strategy)
> >          {
> >              case PARTITION_STRATEGY_LIST:
> > +            case PARTITION_STRATEGY_HASH:
> >                  {
> >                      boundinfo->has_null = found_null;
> >                      boundinfo->indexes = (int *) palloc(ndatums * sizeof(int));
> > @@ -829,6 +834,18 @@ check_new_partition_bound(char *relname, Relation parent, Node *bound)
> >                  break;
> >              }
> >  
> > +        case PARTITION_STRATEGY_HASH:
> > +            {
> > +                Assert(spec->strategy == PARTITION_STRATEGY_HASH);
> > +
> > +                if (partdesc->nparts + 1 > key->partnparts)
> > +                    ereport(ERROR,
> > +                            (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
> > +                    errmsg("cannot create hash partition more than %d for %s",
> > +                            key->partnparts, RelationGetRelationName(parent))));
> > +                break;
> > +            }
> > +
> >          default:
> >              elog(ERROR, "unexpected partition strategy: %d",
> >                   (int) key->strategy);
> > @@ -916,6 +933,11 @@ get_qual_from_partbound(Relation rel, Relation parent, Node *bound)
> >              my_qual = get_qual_for_range(key, spec);
> >              break;
> >  
> > +        case PARTITION_STRATEGY_HASH:
> > +            Assert(spec->strategy == PARTITION_STRATEGY_LIST);
> > +            my_qual = get_qual_for_hash(key, spec);
> > +            break;
> > +
> >          default:
> >              elog(ERROR, "unexpected partition strategy: %d",
> >                   (int) key->strategy);
> > @@ -1146,6 +1168,84 @@ RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
> >      return pd;
> >  }
> >  
> > +/*
> > + * convert_expr_for_hash
> > + *
> > + * Converts a expr for a hash partition's constraint.
> > + * expr is converted into 'abs(hashfunc(expr)) % npart".
> > + *
> > + * npart: number of partitions
> > + * hashfunc: OID of hash function
> > + */
> > +Expr *
> > +convert_expr_for_hash(Expr *expr, int npart, Oid hashfunc)
> > +{
> > +    FuncExpr   *func,
> > +               *abs;
> > +    Expr        *modexpr;
> > +    Oid            modoid;
> > +    Oid            int4oid[1] = {INT4OID};
> > +
> > +    ParseState *pstate = make_parsestate(NULL);
> > +    Value       *val_npart = makeInteger(npart);
> > +    Node       *const_npart = (Node *) make_const(pstate, val_npart, -1);
> > +
> > +    /* hash function */
> > +    func = makeFuncExpr(hashfunc,
> > +                        INT4OID,
> > +                        list_make1(expr),
> > +                        0,
> > +                        0,
> > +                        COERCE_EXPLICIT_CALL);
> > +
> > +    /* Abs */
> > +    abs = makeFuncExpr(LookupFuncName(list_make1(makeString("abs")), 1, int4oid, false),
> > +                       INT4OID,
> > +                       list_make1(func),
> > +                       0,
> > +                       0,
> > +                       COERCE_EXPLICIT_CALL);
> > +
> > +    /* modulo by npart */
> > +    modoid = LookupOperName(pstate, list_make1(makeString("%")), INT4OID, INT4OID, false, -1);
> > +    modexpr = make_opclause(modoid, INT4OID, false, (Expr*)abs, (Expr*)const_npart, 0, 0);
> > +
> > +    return modexpr;
> > +}
> > +
> > +
> > +/*
> > + * get_next_hash_partition_index
> > + *
> > + * Returns the minimal index which is not used for hash partition.
> > + */
> > +int
> > +get_next_hash_partition_index(Relation parent)
> > +{
> > +    PartitionKey key = RelationGetPartitionKey(parent);
> > +    PartitionDesc partdesc = RelationGetPartitionDesc(parent);
> > +
> > +    int      i;
> > +    bool *used = palloc0(sizeof(int) * key->partnparts);
> > +
> > +    /* mark used for existing partition indexs */
> > +    for (i = 0; i < partdesc->boundinfo->ndatums; i++)
> > +    {
> > +        Datum* datum = partdesc->boundinfo->datums[i];
> > +        int idx = DatumGetInt16(datum[0]);
> > +
> > +        if (!used[idx])
> > +            used[idx] = true;
> > +    }
> > +
> > +    /* find the minimal unused index */
> > +    for (i = 0; i < key->partnparts; i++)
> > +        if (!used[i])
> > +            break;
> > +
> > +    return i;
> > +}
> > +
> >  /* Module-local functions */
> >  
> >  /*
> > @@ -1467,6 +1567,43 @@ get_qual_for_range(PartitionKey key, PartitionBoundSpec *spec)
> >  }
> >  
> >  /*
> > + * get_qual_for_hash
> > + *
> > + * Returns a list of expressions to use as a hash partition's constraint.
> > + */
> > +static List *
> > +get_qual_for_hash(PartitionKey key, PartitionBoundSpec *spec)
> > +{
> > +    List       *result;
> > +    Expr       *keyCol;
> > +    Expr       *expr;
> > +    Expr        *opexpr;
> > +    Oid            operoid;
> > +    ParseState *pstate = make_parsestate(NULL);
> > +
> > +    /* Left operand */
> > +    if (key->partattrs[0] != 0)
> > +        keyCol = (Expr *) makeVar(1,
> > +                                  key->partattrs[0],
> > +                                  key->parttypid[0],
> > +                                  key->parttypmod[0],
> > +                                  key->parttypcoll[0],
> > +                                  0);
> > +    else
> > +        keyCol = (Expr *) copyObject(linitial(key->partexprs));
> > +
> > +    expr = convert_expr_for_hash(keyCol, key->partnparts, key->parthashfunc);
> > +
> > +    /* equals the listdaums value */
> > +    operoid = LookupOperName(pstate, list_make1(makeString("=")), INT4OID, INT4OID, false, -1);
> > +    opexpr = make_opclause(operoid, BOOLOID, false, expr, linitial(spec->listdatums), 0, 0);
> > +
> > +    result = list_make1(opexpr);
> > +
> > +    return result;
> > +}
> > +
> > +/*
> >   * get_partition_operator
> >   *
> >   * Return oid of the operator of given strategy for a given partition key
> > @@ -1730,6 +1867,11 @@ get_partition_for_tuple(PartitionDispatch *pd,
> >                              (errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
> >                          errmsg("range partition key of row contains null")));
> >          }
> > +        else if (key->strategy == PARTITION_STRATEGY_HASH)
> > +        {
> > +            values[0] = OidFunctionCall1(key->parthashfunc, values[0]);
> > +            values[0] = Int16GetDatum(Abs(DatumGetInt16(values[0])) % key->partnparts);
> > +        }
> >  
> >          if (partdesc->boundinfo->has_null && isnull[0])
> >              /* Tuple maps to the null-accepting list partition */
> > @@ -1744,6 +1886,7 @@ get_partition_for_tuple(PartitionDispatch *pd,
> >              switch (key->strategy)
> >              {
> >                  case PARTITION_STRATEGY_LIST:
> > +                case PARTITION_STRATEGY_HASH:
> >                      if (cur_offset >= 0 && equal)
> >                          cur_index = partdesc->boundinfo->indexes[cur_offset];
> >                      else
> > @@ -1968,6 +2111,7 @@ partition_bound_cmp(PartitionKey key, PartitionBoundInfo boundinfo,
> >      switch (key->strategy)
> >      {
> >          case PARTITION_STRATEGY_LIST:
> > +        case PARTITION_STRATEGY_HASH:
> >              cmpval = DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[0],
> >                                                       key->partcollation[0],
> >                                                       bound_datums[0],
> > diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
> > index 3cea220..5a28cc0 100644
> > --- a/src/backend/commands/tablecmds.c
> > +++ b/src/backend/commands/tablecmds.c
> > @@ -41,6 +41,7 @@
> >  #include "catalog/pg_inherits_fn.h"
> >  #include "catalog/pg_namespace.h"
> >  #include "catalog/pg_opclass.h"
> > +#include "catalog/pg_proc.h"
> >  #include "catalog/pg_tablespace.h"
> >  #include "catalog/pg_trigger.h"
> >  #include "catalog/pg_type.h"
> > @@ -77,6 +78,7 @@
> >  #include "parser/parse_oper.h"
> >  #include "parser/parse_relation.h"
> >  #include "parser/parse_type.h"
> > +#include "parser/parse_func.h"
> >  #include "parser/parse_utilcmd.h"
> >  #include "parser/parser.h"
> >  #include "pgstat.h"
> > @@ -450,7 +452,7 @@ static void RangeVarCallbackForAlterRelation(const RangeVar *rv, Oid relid,
> >                                   Oid oldrelid, void *arg);
> >  static bool is_partition_attr(Relation rel, AttrNumber attnum, bool *used_in_expr);
> >  static PartitionSpec *transformPartitionSpec(Relation rel, PartitionSpec *partspec, char *strategy);
> > -static void ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs,
> > +static void ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs, Oid *partatttypes,
> >                        List **partexprs, Oid *partopclass, Oid *partcollation);
> >  static void CreateInheritance(Relation child_rel, Relation parent_rel);
> >  static void RemoveInheritance(Relation child_rel, Relation parent_rel);
> > @@ -799,8 +801,10 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
> >          AttrNumber    partattrs[PARTITION_MAX_KEYS];
> >          Oid            partopclass[PARTITION_MAX_KEYS];
> >          Oid            partcollation[PARTITION_MAX_KEYS];
> > +        Oid            partatttypes[PARTITION_MAX_KEYS];
> >          List       *partexprs = NIL;
> >          List       *cmds = NIL;
> > +        Oid hashfuncOid = InvalidOid;
> >  
> >          /*
> >           * We need to transform the raw parsetrees corresponding to partition
> > @@ -811,15 +815,40 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
> >          stmt->partspec = transformPartitionSpec(rel, stmt->partspec,
> >                                                  &strategy);
> >          ComputePartitionAttrs(rel, stmt->partspec->partParams,
> > -                              partattrs, &partexprs, partopclass,
> > +                              partattrs, partatttypes, &partexprs, partopclass,
> >                                partcollation);
> >  
> >          partnatts = list_length(stmt->partspec->partParams);
> > +
> > +        if (strategy == PARTITION_STRATEGY_HASH)
> > +        {
> > +            Oid funcrettype;
> > +
> > +            if (partnatts != 1)
> > +                ereport(ERROR,
> > +                        (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +                        errmsg("number of partition key must be 1 for hash partition")));
> > +
> > +            hashfuncOid = LookupFuncName(stmt->partspec->hashfunc, 1, partatttypes, false);
> > +            funcrettype = get_func_rettype(hashfuncOid);
> > +            if (funcrettype != INT4OID)
> > +                ereport(ERROR,
> > +                        (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +                        errmsg("hash function for partitioning must return integer")));
> > +
> > +            if (func_volatile(hashfuncOid) != PROVOLATILE_IMMUTABLE)
> > +                ereport(ERROR,
> > +                        (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +                        errmsg("hash function for partitioning must be marked IMMUTABLE")));
> > +
> > +        }
> > +
> >          StorePartitionKey(rel, strategy, partnatts, partattrs, partexprs,
> > -                          partopclass, partcollation);
> > +                          partopclass, partcollation, stmt->partspec->partnparts, hashfuncOid);
> >  
> > -        /* Force key columns to be NOT NULL when using range partitioning */
> > -        if (strategy == PARTITION_STRATEGY_RANGE)
> > +        /* Force key columns to be NOT NULL when using range or hash partitioning */
> > +        if (strategy == PARTITION_STRATEGY_RANGE ||
> > +            strategy == PARTITION_STRATEGY_HASH)
> >          {
> >              for (i = 0; i < partnatts; i++)
> >              {
> > @@ -12783,18 +12812,51 @@ transformPartitionSpec(Relation rel, PartitionSpec *partspec, char *strategy)
> >      newspec->strategy = partspec->strategy;
> >      newspec->location = partspec->location;
> >      newspec->partParams = NIL;
> > +    newspec->partnparts = partspec->partnparts;
> > +    newspec->hashfunc = partspec->hashfunc;
> >  
> >      /* Parse partitioning strategy name */
> >      if (!pg_strcasecmp(partspec->strategy, "list"))
> >          *strategy = PARTITION_STRATEGY_LIST;
> >      else if (!pg_strcasecmp(partspec->strategy, "range"))
> >          *strategy = PARTITION_STRATEGY_RANGE;
> > +    else if (!pg_strcasecmp(partspec->strategy, "hash"))
> > +        *strategy = PARTITION_STRATEGY_HASH;
> >      else
> >          ereport(ERROR,
> >                  (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> >                   errmsg("unrecognized partitioning strategy \"%s\"",
> >                          partspec->strategy)));
> >  
> > +    if (*strategy == PARTITION_STRATEGY_HASH)
> > +    {
> > +        if (partspec->partnparts < 0)
> > +            ereport(ERROR,
> > +                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +                     errmsg("number of partitions must be specified for hash partition")));
> > +        else if (partspec->partnparts == 0)
> > +            ereport(ERROR,
> > +                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +                     errmsg("number of partitions must be greater than 0")));
> > +
> > +        if (list_length(partspec->hashfunc) == 0)
> > +            ereport(ERROR,
> > +                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +                     errmsg("hash function must be specified for hash partition")));
> > +    }
> > +    else
> > +    {
> > +        if (partspec->partnparts >= 0)
> > +            ereport(ERROR,
> > +                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +                     errmsg("number of partitions can be specified only for hash partition")));
> > +
> > +        if (list_length(partspec->hashfunc) > 0)
> > +            ereport(ERROR,
> > +                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +                     errmsg("hash function can be specified only for hash partition")));
> > +    }
> > +
> >      /*
> >       * Create a dummy ParseState and insert the target relation as its sole
> >       * rangetable entry.  We need a ParseState for transformExpr.
> > @@ -12843,7 +12905,7 @@ transformPartitionSpec(Relation rel, PartitionSpec *partspec, char *strategy)
> >   * Compute per-partition-column information from a list of PartitionElem's
> >   */
> >  static void
> > -ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs,
> > +ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs, Oid *partatttypes,
> >                        List **partexprs, Oid *partopclass, Oid *partcollation)
> >  {
> >      int            attn;
> > @@ -13010,6 +13072,7 @@ ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs,
> >                                                 "btree",
> >                                                 BTREE_AM_OID);
> >  
> > +        partatttypes[attn] = atttype;
> >          attn++;
> >      }
> >  }
> > diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
> > index 05d8538..f4febc9 100644
> > --- a/src/backend/nodes/copyfuncs.c
> > +++ b/src/backend/nodes/copyfuncs.c
> > @@ -4232,6 +4232,8 @@ _copyPartitionSpec(const PartitionSpec *from)
> >  
> >      COPY_STRING_FIELD(strategy);
> >      COPY_NODE_FIELD(partParams);
> > +    COPY_SCALAR_FIELD(partnparts);
> > +    COPY_NODE_FIELD(hashfunc);
> >      COPY_LOCATION_FIELD(location);
> >  
> >      return newnode;
> > diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
> > index d595cd7..d589eac 100644
> > --- a/src/backend/nodes/equalfuncs.c
> > +++ b/src/backend/nodes/equalfuncs.c
> > @@ -2725,6 +2725,8 @@ _equalPartitionSpec(const PartitionSpec *a, const PartitionSpec *b)
> >  {
> >      COMPARE_STRING_FIELD(strategy);
> >      COMPARE_NODE_FIELD(partParams);
> > +    COMPARE_SCALAR_FIELD(partnparts);
> > +    COMPARE_NODE_FIELD(hashfunc);
> >      COMPARE_LOCATION_FIELD(location);
> >  
> >      return true;
> > diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
> > index b3802b4..d6db80e 100644
> > --- a/src/backend/nodes/outfuncs.c
> > +++ b/src/backend/nodes/outfuncs.c
> > @@ -3318,6 +3318,8 @@ _outPartitionSpec(StringInfo str, const PartitionSpec *node)
> >  
> >      WRITE_STRING_FIELD(strategy);
> >      WRITE_NODE_FIELD(partParams);
> > +    WRITE_INT_FIELD(partnparts);
> > +    WRITE_NODE_FIELD(hashfunc);
> >      WRITE_LOCATION_FIELD(location);
> >  }
> >  
> > diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
> > index e833b2e..b67140d 100644
> > --- a/src/backend/parser/gram.y
> > +++ b/src/backend/parser/gram.y
> > @@ -574,6 +574,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
> >  %type <list>        partbound_datum_list
> >  %type <partrange_datum>    PartitionRangeDatum
> >  %type <list>        range_datum_list
> > +%type <ival>        hash_partitions
> > +%type <list>        hash_function
> >  
> >  /*
> >   * Non-keyword token types.  These are hard-wired into the "flex" lexer.
> > @@ -627,7 +629,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
> >  
> >      GLOBAL GRANT GRANTED GREATEST GROUP_P GROUPING
> >  
> > -    HANDLER HAVING HEADER_P HOLD HOUR_P
> > +    HANDLER HASH HAVING HEADER_P HOLD HOUR_P
> >  
> >      IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IMPORT_P IN_P
> >      INCLUDING INCREMENT INDEX INDEXES INHERIT INHERITS INITIALLY INLINE_P
> > @@ -651,7 +653,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
> >      OBJECT_P OF OFF OFFSET OIDS OLD ON ONLY OPERATOR OPTION OPTIONS OR
> >      ORDER ORDINALITY OUT_P OUTER_P OVER OVERLAPS OVERLAY OWNED OWNER
> >  
> > -    PARALLEL PARSER PARTIAL PARTITION PASSING PASSWORD PLACING PLANS POLICY
> > +    PARALLEL PARSER PARTIAL PARTITION PARTITIONS PASSING PASSWORD PLACING PLANS POLICY
> >      POSITION PRECEDING PRECISION PRESERVE PREPARE PREPARED PRIMARY
> >      PRIOR PRIVILEGES PROCEDURAL PROCEDURE PROGRAM PUBLICATION
> >  
> > @@ -2587,6 +2589,16 @@ ForValues:
> >  
> >                      $$ = (Node *) n;
> >                  }
> > +
> > +            /* a HASH partition */
> > +            | /*EMPTY*/
> > +                {
> > +                    PartitionBoundSpec *n = makeNode(PartitionBoundSpec);
> > +
> > +                    n->strategy = PARTITION_STRATEGY_HASH;
> > +
> > +                    $$ = (Node *) n;
> > +                }
> >          ;
> >  
> >  partbound_datum:
> > @@ -3666,7 +3678,7 @@ OptPartitionSpec: PartitionSpec    { $$ = $1; }
> >              | /*EMPTY*/            { $$ = NULL; }
> >          ;
> >  
> > -PartitionSpec: PARTITION BY part_strategy '(' part_params ')'
> > +PartitionSpec: PARTITION BY part_strategy '(' part_params ')' hash_partitions hash_function
> >                  {
> >                      PartitionSpec *n = makeNode(PartitionSpec);
> >  
> > @@ -3674,10 +3686,21 @@ PartitionSpec: PARTITION BY part_strategy '(' part_params ')'
> >                      n->partParams = $5;
> >                      n->location = @1;
> >  
> > +                    n->partnparts = $7;
> > +                    n->hashfunc = $8;
> > +
> >                      $$ = n;
> >                  }
> >          ;
> >  
> > +hash_partitions: PARTITIONS Iconst { $$ = $2; }
> > +                    | /*EMPTY*/   { $$ = -1; }
> > +        ;
> > +
> > +hash_function: USING handler_name { $$ = $2; }
> > +                    | /*EMPTY*/ { $$ = NULL; }
> > +        ;
> > +
> >  part_strategy:    IDENT                    { $$ = $1; }
> >                  | unreserved_keyword    { $$ = pstrdup($1); }
> >          ;
> > @@ -14377,6 +14400,7 @@ unreserved_keyword:
> >              | GLOBAL
> >              | GRANTED
> >              | HANDLER
> > +            | HASH
> >              | HEADER_P
> >              | HOLD
> >              | HOUR_P
> > @@ -14448,6 +14472,7 @@ unreserved_keyword:
> >              | PARSER
> >              | PARTIAL
> >              | PARTITION
> > +            | PARTITIONS
> >              | PASSING
> >              | PASSWORD
> >              | PLANS
> > diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c
> > index ff2bab6..8e1be31 100644
> > --- a/src/backend/parser/parse_utilcmd.c
> > +++ b/src/backend/parser/parse_utilcmd.c
> > @@ -40,6 +40,7 @@
> >  #include "catalog/pg_opclass.h"
> >  #include "catalog/pg_operator.h"
> >  #include "catalog/pg_type.h"
> > +#include "catalog/partition.h"
> >  #include "commands/comment.h"
> >  #include "commands/defrem.h"
> >  #include "commands/tablecmds.h"
> > @@ -3252,6 +3253,24 @@ transformPartitionBound(ParseState *pstate, Relation parent, Node *bound)
> >              ++i;
> >          }
> >      }
> > +    else if (strategy == PARTITION_STRATEGY_HASH)
> > +    {
> > +        Value     *conval;
> > +        Node        *value;
> > +        int          index;
> > +
> > +        if (spec->strategy != PARTITION_STRATEGY_HASH)
> > +            ereport(ERROR,
> > +                    (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
> > +                 errmsg("invalid bound specification for a hash partition")));
> > +
> > +        index = get_next_hash_partition_index(parent);
> > +
> > +        /* store the partition index as a listdatums value */
> > +        conval = makeInteger(index);
> > +        value = (Node *) make_const(pstate, conval, -1);
> > +        result_spec->listdatums = list_make1(value);
> > +    }
> >      else
> >          elog(ERROR, "unexpected partition strategy: %d", (int) strategy);
> >  
> > diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
> > index b27b77d..fab6eea 100644
> > --- a/src/backend/utils/adt/ruleutils.c
> > +++ b/src/backend/utils/adt/ruleutils.c
> > @@ -1423,7 +1423,7 @@ pg_get_indexdef_worker(Oid indexrelid, int colno,
> >   *
> >   * Returns the partition key specification, ie, the following:
> >   *
> > - * PARTITION BY { RANGE | LIST } (column opt_collation opt_opclass [, ...])
> > + * PARTITION BY { RANGE | LIST | HASH } (column opt_collation opt_opclass [, ...])
> >   */
> >  Datum
> >  pg_get_partkeydef(PG_FUNCTION_ARGS)
> > @@ -1513,6 +1513,9 @@ pg_get_partkeydef_worker(Oid relid, int prettyFlags)
> >          case PARTITION_STRATEGY_RANGE:
> >              appendStringInfo(&buf, "RANGE");
> >              break;
> > +        case PARTITION_STRATEGY_HASH:
> > +            appendStringInfo(&buf, "HASH");
> > +            break;
> >          default:
> >              elog(ERROR, "unexpected partition strategy: %d",
> >                   (int) form->partstrat);
> > @@ -8520,6 +8523,9 @@ get_rule_expr(Node *node, deparse_context *context,
> >                          appendStringInfoString(buf, ")");
> >                          break;
> >  
> > +                    case PARTITION_STRATEGY_HASH:
> > +                        break;
> > +
> >                      default:
> >                          elog(ERROR, "unrecognized partition strategy: %d",
> >                               (int) spec->strategy);
> > diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
> > index 9001e20..829e4d2 100644
> > --- a/src/backend/utils/cache/relcache.c
> > +++ b/src/backend/utils/cache/relcache.c
> > @@ -855,6 +855,9 @@ RelationBuildPartitionKey(Relation relation)
> >      key->strategy = form->partstrat;
> >      key->partnatts = form->partnatts;
> >  
> > +    key->partnparts = form->partnparts;
> > +    key->parthashfunc = form->parthashfunc;
> > +
> >      /*
> >       * We can rely on the first variable-length attribute being mapped to the
> >       * relevant field of the catalog's C struct, because all previous
> > @@ -999,6 +1002,9 @@ copy_partition_key(PartitionKey fromkey)
> >      newkey->strategy = fromkey->strategy;
> >      newkey->partnatts = n = fromkey->partnatts;
> >  
> > +    newkey->partnparts = fromkey->partnparts;
> > +    newkey->parthashfunc = fromkey->parthashfunc;
> > +
> >      newkey->partattrs = (AttrNumber *) palloc(n * sizeof(AttrNumber));
> >      memcpy(newkey->partattrs, fromkey->partattrs, n * sizeof(AttrNumber));
> >  
> > diff --git a/src/include/catalog/heap.h b/src/include/catalog/heap.h
> > index 1187797..367e2f8 100644
> > --- a/src/include/catalog/heap.h
> > +++ b/src/include/catalog/heap.h
> > @@ -141,7 +141,7 @@ extern void StorePartitionKey(Relation rel,
> >                    AttrNumber *partattrs,
> >                    List *partexprs,
> >                    Oid *partopclass,
> > -                  Oid *partcollation);
> > +                  Oid *partcollation, int16 partnparts, Oid hashfunc);
> >  extern void RemovePartitionKeyByRelId(Oid relid);
> >  extern void StorePartitionBound(Relation rel, Relation parent, Node *bound);
> >  
> > diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
> > index b195d1a..80f4b0e 100644
> > --- a/src/include/catalog/partition.h
> > +++ b/src/include/catalog/partition.h
> > @@ -89,4 +89,6 @@ extern int get_partition_for_tuple(PartitionDispatch *pd,
> >                          TupleTableSlot *slot,
> >                          EState *estate,
> >                          Oid *failed_at);
> > +extern Expr *convert_expr_for_hash(Expr *expr, int npart, Oid hashfunc);
> > +extern int get_next_hash_partition_index(Relation parent);
> >  #endif   /* PARTITION_H */
> > diff --git a/src/include/catalog/pg_partitioned_table.h b/src/include/catalog/pg_partitioned_table.h
> > index bdff36a..69e509c 100644
> > --- a/src/include/catalog/pg_partitioned_table.h
> > +++ b/src/include/catalog/pg_partitioned_table.h
> > @@ -33,6 +33,9 @@ CATALOG(pg_partitioned_table,3350) BKI_WITHOUT_OIDS
> >      char        partstrat;        /* partitioning strategy */
> >      int16        partnatts;        /* number of partition key columns */
> >  
> > +    int16        partnparts;
> > +    Oid            parthashfunc;
> > +
> >      /*
> >       * variable-length fields start here, but we allow direct access to
> >       * partattrs via the C struct.  That's because the first variable-length
> > @@ -49,6 +52,8 @@ CATALOG(pg_partitioned_table,3350) BKI_WITHOUT_OIDS
> >      pg_node_tree partexprs;        /* list of expressions in the partition key;
> >                                   * one item for each zero entry in partattrs[] */
> >  #endif
> > +
> > +
> >  } FormData_pg_partitioned_table;
> >  
> >  /* ----------------
> > @@ -62,13 +67,15 @@ typedef FormData_pg_partitioned_table *Form_pg_partitioned_table;
> >   *        compiler constants for pg_partitioned_table
> >   * ----------------
> >   */
> > -#define Natts_pg_partitioned_table                7
> > +#define Natts_pg_partitioned_table                9
> >  #define Anum_pg_partitioned_table_partrelid        1
> >  #define Anum_pg_partitioned_table_partstrat        2
> >  #define Anum_pg_partitioned_table_partnatts        3
> > -#define Anum_pg_partitioned_table_partattrs        4
> > -#define Anum_pg_partitioned_table_partclass        5
> > -#define Anum_pg_partitioned_table_partcollation 6
> > -#define Anum_pg_partitioned_table_partexprs        7
> > +#define Anum_pg_partitioned_table_partnparts    4
> > +#define Anum_pg_partitioned_table_parthashfunc    5
> > +#define Anum_pg_partitioned_table_partattrs        6
> > +#define Anum_pg_partitioned_table_partclass        7
> > +#define Anum_pg_partitioned_table_partcollation 8
> > +#define Anum_pg_partitioned_table_partexprs        9
> >  
> >  #endif   /* PG_PARTITIONED_TABLE_H */
> > diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
> > index 5afc3eb..1c3474f 100644
> > --- a/src/include/nodes/parsenodes.h
> > +++ b/src/include/nodes/parsenodes.h
> > @@ -730,11 +730,14 @@ typedef struct PartitionSpec
> >      NodeTag        type;
> >      char       *strategy;        /* partitioning strategy ('list' or 'range') */
> >      List       *partParams;        /* List of PartitionElems */
> > +    int            partnparts;
> > +    List       *hashfunc;
> >      int            location;        /* token location, or -1 if unknown */
> >  } PartitionSpec;
> >  
> >  #define PARTITION_STRATEGY_LIST        'l'
> >  #define PARTITION_STRATEGY_RANGE    'r'
> > +#define PARTITION_STRATEGY_HASH        'h'
> >  
> >  /*
> >   * PartitionBoundSpec - a partition bound specification
> > diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
> > index 985d650..0597939 100644
> > --- a/src/include/parser/kwlist.h
> > +++ b/src/include/parser/kwlist.h
> > @@ -180,6 +180,7 @@ PG_KEYWORD("greatest", GREATEST, COL_NAME_KEYWORD)
> >  PG_KEYWORD("group", GROUP_P, RESERVED_KEYWORD)
> >  PG_KEYWORD("grouping", GROUPING, COL_NAME_KEYWORD)
> >  PG_KEYWORD("handler", HANDLER, UNRESERVED_KEYWORD)
> > +PG_KEYWORD("hash", HASH, UNRESERVED_KEYWORD)
> >  PG_KEYWORD("having", HAVING, RESERVED_KEYWORD)
> >  PG_KEYWORD("header", HEADER_P, UNRESERVED_KEYWORD)
> >  PG_KEYWORD("hold", HOLD, UNRESERVED_KEYWORD)
> > @@ -291,6 +292,7 @@ PG_KEYWORD("parallel", PARALLEL, UNRESERVED_KEYWORD)
> >  PG_KEYWORD("parser", PARSER, UNRESERVED_KEYWORD)
> >  PG_KEYWORD("partial", PARTIAL, UNRESERVED_KEYWORD)
> >  PG_KEYWORD("partition", PARTITION, UNRESERVED_KEYWORD)
> > +PG_KEYWORD("partitions", PARTITIONS, UNRESERVED_KEYWORD)
> >  PG_KEYWORD("passing", PASSING, UNRESERVED_KEYWORD)
> >  PG_KEYWORD("password", PASSWORD, UNRESERVED_KEYWORD)
> >  PG_KEYWORD("placing", PLACING, RESERVED_KEYWORD)
> > diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
> > index a617a7c..660adfb 100644
> > --- a/src/include/utils/rel.h
> > +++ b/src/include/utils/rel.h
> > @@ -62,6 +62,9 @@ typedef struct PartitionKeyData
> >      Oid           *partopcintype;    /* OIDs of opclass declared input data types */
> >      FmgrInfo   *partsupfunc;    /* lookup info for support funcs */
> >  
> > +    int16        partnparts;        /* number of hash partitions */
> > +    Oid            parthashfunc;    /* OID of hash function */
> > +
> >      /* Partitioning collation per attribute */
> >      Oid           *partcollation;
> >  
> 
> > 
> > -- 
> > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-hackers
> 
> 
> -- 
> Best regards,
> Aleksander Alekseev


-- 
Yugo Nagata <nagata@sraoss.co.jp>

Re: [HACKERS] [POC] hash partitioning

From

Yugo Nagata

Date:

01 March 2017, 12:45:38

Hi Ammit,

On Wed, 1 Mar 2017 11:14:15 +0900
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

> Nagata-san,
> 
> On 2017/02/28 23:33, Yugo Nagata wrote:
> > Hi all,
> > 
> > Now we have a declarative partitioning, but hash partitioning is not
> > implemented yet. Attached is a POC patch to add the hash partitioning
> > feature. I know we will need more discussions about the syntax and other
> > specifications before going ahead the project, but I think this runnable
> > code might help to discuss what and how we implement this.
> 
> Great!

Thank you!

> 
> > * Description
> > 
> > In this patch, the hash partitioning implementation is basically based
> > on the list partitioning mechanism. However, partition bounds cannot be
> > specified explicitly, but this is used internally as hash partition
> > index, which is calculated when a partition is created or attached.
> > 
> > The tentative syntax to create a partitioned table is as bellow;
> > 
> >  CREATE TABLE h (i int) PARTITION BY HASH(i) PARTITIONS 3 USING hashint4;
> > 
> > The number of partitions is specified by PARTITIONS, which is currently
> > constant and cannot be changed, but I think this is needed to be changed in
> > some manner. A hash function is specified by USING. Maybe, specifying hash
> > function may be ommitted, and in this case, a default hash function
> > corresponding to key type will be used.
> > 
> > A partition table can be create as bellow;
> > 
> >  CREATE TABLE h1 PARTITION OF h;
> >  CREATE TABLE h2 PARTITION OF h;
> >  CREATE TABLE h3 PARTITION OF h;
> > 
> > FOR VALUES clause cannot be used, and the partition bound is
> > calclulated automatically as partition index of single integer value.
> > 
> > When trying create partitions more than the number specified
> > by PARTITIONS, it gets an error.
> > 
> > postgres=# create table h4 partition of h;
> > ERROR:  cannot create hash partition more than 3 for h
> 
> Instead of having to create each partition individually, wouldn't it be
> better if the following command
> 
> CREATE TABLE h (i int) PARTITION BY HASH (i) PARTITIONS 3;
> 
> created the partitions *automatically*?
> 
> It makes sense to provide a way to create individual list and range
> partitions separately, because users can specify custom bounds for each.
> We don't need that for hash partitions, so why make users run separate
> commands (without the FOR VALUES clause) anyway?  We may perhaps need to
> offer a way to optionally specify a user-defined name for each partition
> in the same command, along with tablespace, storage options, etc.  By
> default, the names would be generated internally and the user can ALTER
> individual partitions after the fact to specify tablespace, etc.

I though that creating each partition individually is needed because some
user will want to specify a tablespce to each partition. However, as you
say, that isn't need for many cases because use can move a partition
to other tablespaces afterward by ALTER.

Thanks,
Yugo Nagata

> 
> Thanks,
> Amit
> 
> 
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Yugo Nagata <nagata@sraoss.co.jp>

Re: [HACKERS] [POC] hash partitioning

From

Yugo Nagata

Date:

01 March 2017, 13:07:30

On Wed, 1 Mar 2017 10:30:09 +0530
Rushabh Lathia <rushabh.lathia@gmail.com> wrote:

> On Tue, Feb 28, 2017 at 8:03 PM, Yugo Nagata <nagata@sraoss.co.jp> wrote:
> 
> > Hi all,
> >
> > Now we have a declarative partitioning, but hash partitioning is not
> > implemented yet. Attached is a POC patch to add the hash partitioning
> > feature. I know we will need more discussions about the syntax and other
> > specifications before going ahead the project, but I think this runnable
> > code might help to discuss what and how we implement this.
> >
> > * Description
> >
> > In this patch, the hash partitioning implementation is basically based
> > on the list partitioning mechanism. However, partition bounds cannot be
> > specified explicitly, but this is used internally as hash partition
> > index, which is calculated when a partition is created or attached.
> >
> > The tentative syntax to create a partitioned table is as bellow;
> >
> >  CREATE TABLE h (i int) PARTITION BY HASH(i) PARTITIONS 3 USING hashint4;
> >
> > The number of partitions is specified by PARTITIONS, which is currently
> > constant and cannot be changed, but I think this is needed to be changed in
> > some manner. A hash function is specified by USING. Maybe, specifying hash
> > function may be ommitted, and in this case, a default hash function
> > corresponding to key type will be used.
> >
> > A partition table can be create as bellow;
> >
> >  CREATE TABLE h1 PARTITION OF h;
> >  CREATE TABLE h2 PARTITION OF h;
> >  CREATE TABLE h3 PARTITION OF h;
> >
> > FOR VALUES clause cannot be used, and the partition bound is
> > calclulated automatically as partition index of single integer value.
> >
> > When trying create partitions more than the number specified
> > by PARTITIONS, it gets an error.
> >
> > postgres=# create table h4 partition of h;
> > ERROR:  cannot create hash partition more than 3 for h
> >
> > An inserted record is stored in a partition whose index equals
> > abs(hashfunc(key)) % <number_of_partitions>. In the above
> > example, this is abs(hashint4(i))%3.
> >
> > postgres=# insert into h (select generate_series(0,20));
> > INSERT 0 21
> >
> > postgres=# select *,tableoid::regclass from h;
> >  i  | tableoid
> > ----+----------
> >   0 | h1
> >   1 | h1
> >   2 | h1
> >   4 | h1
> >   8 | h1
> >  10 | h1
> >  11 | h1
> >  14 | h1
> >  15 | h1
> >  17 | h1
> >  20 | h1
> >   5 | h2
> >  12 | h2
> >  13 | h2
> >  16 | h2
> >  19 | h2
> >   3 | h3
> >   6 | h3
> >   7 | h3
> >   9 | h3
> >  18 | h3
> > (21 rows)
> >
> >
> This is good, I will have closer look into the patch, but here are
> few quick comments.

Thanks. I'm looking forward to your comments.

> 
> - CREATE HASH partition syntax adds two new keywords and ideally
> we should try to avoid adding additional keywords. Also I can see that
> HASH keyword been added, but I don't see any use of newly added
> keyword in gram.y.

Yes, you are right. HASH keyword is not necessary. I'll remove it
from the patch.

> 
> - Also I didn't like the idea of fixing number of partitions during the
> CREATE
> TABLE syntax. Thats something that needs to be able to changes.

I agree. The number specified by PARTIONS should be the *initial* number
of partitions and this should be abelt to be changed. I'm investigating
the way.

> 
> 
> 
> > * Todo / discussions
> >
> > In this patch, we cannot change the number of partitions specified
> > by PARTITIONS. I we can change this, the partitioning rule
> > (<partition index> = abs(hashfunc(key)) % <number_of_partitions>)
> > is also changed and then we need reallocatiing records between
> > partitions.
> >
> > In this patch, user can specify a hash function USING. However,
> > we migth need default hash functions which are useful and
> > proper for hash partitioning.
> >
> 
> +1
> 
> - With fixing default hash function and not specifying number of partitions
> during CREATE TABLE - don't need two new additional columns into
> pg_partitioned_table catalog.

I think the option to specify a hash function is needed because
user may want to use a user-defined hash function for some reasons,
for example, when a user-defined type is used as a partition key.

> 
> 
> > Currently, even when we issue SELECT query with a condition,
> > postgres looks into all partitions regardless of each partition's
> > constraint, because this is complicated such like "abs(hashint4(i))%3 = 0".
> >
> > postgres=# explain select * from h where i = 10;
> >                         QUERY PLAN
> > ----------------------------------------------------------
> >  Append  (cost=0.00..125.62 rows=40 width=4)
> >    ->  Seq Scan on h  (cost=0.00..0.00 rows=1 width=4)
> >          Filter: (i = 10)
> >    ->  Seq Scan on h1  (cost=0.00..41.88 rows=13 width=4)
> >          Filter: (i = 10)
> >    ->  Seq Scan on h2  (cost=0.00..41.88 rows=13 width=4)
> >          Filter: (i = 10)
> >    ->  Seq Scan on h3  (cost=0.00..41.88 rows=13 width=4)
> >          Filter: (i = 10)
> > (9 rows)
> >
> > However, if we modify a condition into a same expression
> > as the partitions constraint, postgres can exclude unrelated
> > table from search targets. So, we might avoid the problem
> > by converting the qual properly before calling predicate_refuted_by().
> >
> > postgres=# explain select * from h where abs(hashint4(i))%3 =
> > abs(hashint4(10))%3;
> >                         QUERY PLAN
> > ----------------------------------------------------------
> >  Append  (cost=0.00..61.00 rows=14 width=4)
> >    ->  Seq Scan on h  (cost=0.00..0.00 rows=1 width=4)
> >          Filter: ((abs(hashint4(i)) % 3) = 2)
> >    ->  Seq Scan on h3  (cost=0.00..61.00 rows=13 width=4)
> >          Filter: ((abs(hashint4(i)) % 3) = 2)
> > (5 rows)
> >
> > Best regards,
> > Yugo Nagata
> >
> > --
> > Yugo Nagata <nagata@sraoss.co.jp>
> >
> >
> > --
> > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-hackers
> >
> >
> 
> 
> Regards,
> 
> Rushabh Lathia


-- 
Yugo Nagata <nagata@sraoss.co.jp>

Re: [HACKERS] [POC] hash partitioning

From

Yugo Nagata

Date:

01 March 2017, 13:20:25

On Wed, 1 Mar 2017 10:52:58 +0530
amul sul <sulamul@gmail.com> wrote:

> On Tue, Feb 28, 2017 at 8:03 PM, Yugo Nagata <nagata@sraoss.co.jp> wrote:
> > Hi all,
> >
> > Now we have a declarative partitioning, but hash partitioning is not
> > implemented yet. Attached is a POC patch to add the hash partitioning
> > feature. I know we will need more discussions about the syntax and other
> > specifications before going ahead the project, but I think this runnable
> > code might help to discuss what and how we implement this.
> >
>
> Great.

Thanks.

>
> > * Description
> >
> > In this patch, the hash partitioning implementation is basically based
> > on the list partitioning mechanism. However, partition bounds cannot be
> > specified explicitly, but this is used internally as hash partition
> > index, which is calculated when a partition is created or attached.
> >
> > The tentative syntax to create a partitioned table is as bellow;
> >
> >  CREATE TABLE h (i int) PARTITION BY HASH(i) PARTITIONS 3 USING hashint4;
> >
> > The number of partitions is specified by PARTITIONS, which is currently
> > constant and cannot be changed, but I think this is needed to be changed
> in
> > some manner. A hash function is specified by USING. Maybe, specifying hash
> > function may be ommitted, and in this case, a default hash function
> > corresponding to key type will be used.
> >
> > A partition table can be create as bellow;
> >
> >  CREATE TABLE h1 PARTITION OF h;
> >  CREATE TABLE h2 PARTITION OF h;
> >  CREATE TABLE h3 PARTITION OF h;
> >
> > FOR VALUES clause cannot be used, and the partition bound is
> > calclulated automatically as partition index of single integer value.
> >
> > When trying create partitions more than the number specified
> > by PARTITIONS, it gets an error.
> >
> > postgres=# create table h4 partition of h;
> > ERROR:  cannot create hash partition more than 3 for h
> >
> > An inserted record is stored in a partition whose index equals
> > abs(hashfunc(key)) % <number_of_partitions>. In the above
> > example, this is abs(hashint4(i))%3.
> >
> > postgres=# insert into h (select generate_series(0,20));
> > INSERT 0 21
> >
> > postgres=# select *,tableoid::regclass from h;
> >  i  | tableoid
> > ----+----------
> >   0 | h1
> >   1 | h1
> >   2 | h1
> >   4 | h1
> >   8 | h1
> >  10 | h1
> >  11 | h1
> >  14 | h1
> >  15 | h1
> >  17 | h1
> >  20 | h1
> >   5 | h2
> >  12 | h2
> >  13 | h2
> >  16 | h2
> >  19 | h2
> >   3 | h3
> >   6 | h3
> >   7 | h3
> >   9 | h3
> >  18 | h3
> > (21 rows)
> >
> > * Todo / discussions
> >
> > In this patch, we cannot change the number of partitions specified
> > by PARTITIONS. I we can change this, the partitioning rule
> > (<partition index> = abs(hashfunc(key)) % <number_of_partitions>)
> > is also changed and then we need reallocatiing records between
> > partitions.
> >
> > In this patch, user can specify a hash function USING. However,
> > we migth need default hash functions which are useful and
> > proper for hash partitioning.
> >
> IMHO, we should try to keep create partition syntax simple and aligned
> with other partition strategy. For e.g:
> CREATE TABLE h (i int) PARTITION BY HASH(i);
>
> I Agree that it is unavoidable partitions number in modulo hashing,
> but we can do in other hashing technique.  Have you had thought about
> Linear hashing[1] or Consistent hashing[2]?  This will allow us to
> add/drop
> partition with minimal row moment. 

Thank you for your information of hash technique. I'll see them
and try to allowing the number of partitions to be changed.

Thanks,
Yugo Nagata

>
> +1 for the default hash function corresponding to partitioning key type.
>
> Regards,
> Amul
> 
>
> [1] https://en.wikipedia.org/wiki/Linear_hashing
> [2] https://en.wikipedia.org/wiki/Consistent_hashing


--
Yugo Nagata <nagata@sraoss.co.jp>

Re: [HACKERS] [POC] hash partitioning

From

Aleksander Alekseev

Date:

01 March 2017, 17:08:49

Hi, Yugo.

Today I've had an opportunity to take a closer look on this patch. Here are
a few things that bother me.

1a) There are missing commends here:

```
--- a/src/include/catalog/pg_partitioned_table.h
+++ b/src/include/catalog/pg_partitioned_table.h
@@ -33,6 +33,9 @@ CATALOG(pg_partitioned_table,3350) BKI_WITHOUT_OIDS   char        partstrat;      /* partitioning
strategy*/   int16       partnatts;      /* number of partition key columns */ 

+   int16       partnparts;
+   Oid         parthashfunc;
+
```

1b) ... and here:

```
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -730,11 +730,14 @@ typedef struct PartitionSpec   NodeTag     type;   char       *strategy;       /* partitioning
strategy('list' or 'range') */   List       *partParams;     /* List of PartitionElems */ 
+   int         partnparts;
+   List       *hashfunc;   int         location;       /* token location, or -1 if unknown */} PartitionSpec;
```

2) I believe new empty lines in patches are generally not welcomed by
community:

```
@@ -49,6 +52,8 @@ CATALOG(pg_partitioned_table,3350) BKI_WITHOUT_OIDS   pg_node_tree partexprs;     /* list of
expressionsin the partition key;                                * one item for each zero entry in partattrs[] */#endif 
+
+} FormData_pg_partitioned_table;
```

3) One test fails on my laptop (Arch Linux, x64) [1]:

```
***************
*** 344,350 **** CREATE TABLE partitioned (     a int ) PARTITION BY HASH (a);
! ERROR:  unrecognized partitioning strategy "hash" -- specified column must be present in the table CREATE TABLE
partitioned(     a int 
--- 344,350 ---- CREATE TABLE partitioned (     a int ) PARTITION BY HASH (a);
! ERROR:  number of partitions must be specified for hash partition -- specified column must be present in the table
CREATETABLE partitioned (     a int 
```

Exact script I'm using for building and testing PostgreSQL could be
found here [2].

4) As I already mentioned - missing documentation.

In general patch looks quite good to me. I personally believe it has all
the changes to be accepted in current commitfest. Naturally if community
will come to a consensus regarding keywords, whether all partitions
should be created automatically, etc :)

[1] http://afiskon.ru/s/dd/20cbe21934_regression.diffs.txt
[2] http://afiskon.ru/s/76/a4fb71739c_full-build.sh.txt

On Wed, Mar 01, 2017 at 06:10:10PM +0900, Yugo Nagata wrote:
> Hi Aleksander,
>
> On Tue, 28 Feb 2017 18:05:36 +0300
> Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:
>
> > Hi, Yugo.
> >
> > Looks like a great feature! I'm going to take a closer look on your code
> > and write a feedback shortly. For now I can only tell that you forgot
> > to include some documentation in the patch.
>
> Thank you for looking into it. I'm forward to your feedback.
> This is a proof of concept patch and additional documentation
> is not included. I'll add this after reaching a consensus
> on the specification of the feature.
>
> >
> > I've added a corresponding entry to current commitfest [1]. Hope you
> > don't mind. If it's not too much trouble could you please register on a
> > commitfest site and add yourself to this entry as an author? I'm pretty
> > sure someone is using this information for writing release notes or
> > something like this.
>
> Thank you for registering it to the commitfest. I have added me as an auther.
>
> >
> > [1] https://commitfest.postgresql.org/13/1059/
> >
> > On Tue, Feb 28, 2017 at 11:33:13PM +0900, Yugo Nagata wrote:
> > > Hi all,
> > >
> > > Now we have a declarative partitioning, but hash partitioning is not
> > > implemented yet. Attached is a POC patch to add the hash partitioning
> > > feature. I know we will need more discussions about the syntax and other
> > > specifications before going ahead the project, but I think this runnable
> > > code might help to discuss what and how we implement this.
> > >
> > > * Description
> > >
> > > In this patch, the hash partitioning implementation is basically based
> > > on the list partitioning mechanism. However, partition bounds cannot be
> > > specified explicitly, but this is used internally as hash partition
> > > index, which is calculated when a partition is created or attached.
> > >
> > > The tentative syntax to create a partitioned table is as bellow;
> > >
> > >  CREATE TABLE h (i int) PARTITION BY HASH(i) PARTITIONS 3 USING hashint4;
> > >
> > > The number of partitions is specified by PARTITIONS, which is currently
> > > constant and cannot be changed, but I think this is needed to be changed in
> > > some manner. A hash function is specified by USING. Maybe, specifying hash
> > > function may be ommitted, and in this case, a default hash function
> > > corresponding to key type will be used.
> > >
> > > A partition table can be create as bellow;
> > >
> > >  CREATE TABLE h1 PARTITION OF h;
> > >  CREATE TABLE h2 PARTITION OF h;
> > >  CREATE TABLE h3 PARTITION OF h;
> > >
> > > FOR VALUES clause cannot be used, and the partition bound is
> > > calclulated automatically as partition index of single integer value.
> > >
> > > When trying create partitions more than the number specified
> > > by PARTITIONS, it gets an error.
> > >
> > > postgres=# create table h4 partition of h;
> > > ERROR:  cannot create hash partition more than 3 for h
> > >
> > > An inserted record is stored in a partition whose index equals
> > > abs(hashfunc(key)) % <number_of_partitions>. In the above
> > > example, this is abs(hashint4(i))%3.
> > >
> > > postgres=# insert into h (select generate_series(0,20));
> > > INSERT 0 21
> > >
> > > postgres=# select *,tableoid::regclass from h;
> > >  i  | tableoid
> > > ----+----------
> > >   0 | h1
> > >   1 | h1
> > >   2 | h1
> > >   4 | h1
> > >   8 | h1
> > >  10 | h1
> > >  11 | h1
> > >  14 | h1
> > >  15 | h1
> > >  17 | h1
> > >  20 | h1
> > >   5 | h2
> > >  12 | h2
> > >  13 | h2
> > >  16 | h2
> > >  19 | h2
> > >   3 | h3
> > >   6 | h3
> > >   7 | h3
> > >   9 | h3
> > >  18 | h3
> > > (21 rows)
> > >
> > > * Todo / discussions
> > >
> > > In this patch, we cannot change the number of partitions specified
> > > by PARTITIONS. I we can change this, the partitioning rule
> > > (<partition index> = abs(hashfunc(key)) % <number_of_partitions>)
> > > is also changed and then we need reallocatiing records between
> > > partitions.
> > >
> > > In this patch, user can specify a hash function USING. However,
> > > we migth need default hash functions which are useful and
> > > proper for hash partitioning.
> > >
> > > Currently, even when we issue SELECT query with a condition,
> > > postgres looks into all partitions regardless of each partition's
> > > constraint, because this is complicated such like "abs(hashint4(i))%3 = 0".
> > >
> > > postgres=# explain select * from h where i = 10;
> > >                         QUERY PLAN
> > > ----------------------------------------------------------
> > >  Append  (cost=0.00..125.62 rows=40 width=4)
> > >    ->  Seq Scan on h  (cost=0.00..0.00 rows=1 width=4)
> > >          Filter: (i = 10)
> > >    ->  Seq Scan on h1  (cost=0.00..41.88 rows=13 width=4)
> > >          Filter: (i = 10)
> > >    ->  Seq Scan on h2  (cost=0.00..41.88 rows=13 width=4)
> > >          Filter: (i = 10)
> > >    ->  Seq Scan on h3  (cost=0.00..41.88 rows=13 width=4)
> > >          Filter: (i = 10)
> > > (9 rows)
> > >
> > > However, if we modify a condition into a same expression
> > > as the partitions constraint, postgres can exclude unrelated
> > > table from search targets. So, we might avoid the problem
> > > by converting the qual properly before calling predicate_refuted_by().
> > >
> > > postgres=# explain select * from h where abs(hashint4(i))%3 = abs(hashint4(10))%3;
> > >                         QUERY PLAN
> > > ----------------------------------------------------------
> > >  Append  (cost=0.00..61.00 rows=14 width=4)
> > >    ->  Seq Scan on h  (cost=0.00..0.00 rows=1 width=4)
> > >          Filter: ((abs(hashint4(i)) % 3) = 2)
> > >    ->  Seq Scan on h3  (cost=0.00..61.00 rows=13 width=4)
> > >          Filter: ((abs(hashint4(i)) % 3) = 2)
> > > (5 rows)
> > >
> > > Best regards,
> > > Yugo Nagata
> > >
> > > --
> > > Yugo Nagata <nagata@sraoss.co.jp>
> >
> > > diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
> > > index 41c0056..3820920 100644
> > > --- a/src/backend/catalog/heap.c
> > > +++ b/src/backend/catalog/heap.c
> > > @@ -3074,7 +3074,7 @@ StorePartitionKey(Relation rel,
> > >                    AttrNumber *partattrs,
> > >                    List *partexprs,
> > >                    Oid *partopclass,
> > > -                  Oid *partcollation)
> > > +                  Oid *partcollation, int16 partnparts, Oid hashfunc)
> > >  {
> > >      int            i;
> > >      int2vector *partattrs_vec;
> > > @@ -3121,6 +3121,8 @@ StorePartitionKey(Relation rel,
> > >      values[Anum_pg_partitioned_table_partrelid - 1] = ObjectIdGetDatum(RelationGetRelid(rel));
> > >      values[Anum_pg_partitioned_table_partstrat - 1] = CharGetDatum(strategy);
> > >      values[Anum_pg_partitioned_table_partnatts - 1] = Int16GetDatum(partnatts);
> > > +    values[Anum_pg_partitioned_table_partnparts - 1] = Int16GetDatum(partnparts);
> > > +    values[Anum_pg_partitioned_table_parthashfunc - 1] = ObjectIdGetDatum(hashfunc);
> > >      values[Anum_pg_partitioned_table_partattrs - 1] = PointerGetDatum(partattrs_vec);
> > >      values[Anum_pg_partitioned_table_partclass - 1] = PointerGetDatum(partopclass_vec);
> > >      values[Anum_pg_partitioned_table_partcollation - 1] = PointerGetDatum(partcollation_vec);
> > > diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
> > > index 4bcef58..24e69c6 100644
> > > --- a/src/backend/catalog/partition.c
> > > +++ b/src/backend/catalog/partition.c
> > > @@ -36,6 +36,8 @@
> > >  #include "optimizer/clauses.h"
> > >  #include "optimizer/planmain.h"
> > >  #include "optimizer/var.h"
> > > +#include "parser/parse_func.h"
> > > +#include "parser/parse_oper.h"
> > >  #include "rewrite/rewriteManip.h"
> > >  #include "storage/lmgr.h"
> > >  #include "utils/array.h"
> > > @@ -120,6 +122,7 @@ static int32 qsort_partition_rbound_cmp(const void *a, const void *b,
> > >
> > >  static List *get_qual_for_list(PartitionKey key, PartitionBoundSpec *spec);
> > >  static List *get_qual_for_range(PartitionKey key, PartitionBoundSpec *spec);
> > > +static List *get_qual_for_hash(PartitionKey key, PartitionBoundSpec *spec);
> > >  static Oid get_partition_operator(PartitionKey key, int col,
> > >                         StrategyNumber strategy, bool *need_relabel);
> > >  static List *generate_partition_qual(Relation rel);
> > > @@ -236,7 +239,8 @@ RelationBuildPartitionDesc(Relation rel)
> > >              oids[i++] = lfirst_oid(cell);
> > >
> > >          /* Convert from node to the internal representation */
> > > -        if (key->strategy == PARTITION_STRATEGY_LIST)
> > > +        if (key->strategy == PARTITION_STRATEGY_LIST ||
> > > +            key->strategy == PARTITION_STRATEGY_HASH)
> > >          {
> > >              List       *non_null_values = NIL;
> > >
> > > @@ -251,7 +255,7 @@ RelationBuildPartitionDesc(Relation rel)
> > >                  ListCell   *c;
> > >                  PartitionBoundSpec *spec = lfirst(cell);
> > >
> > > -                if (spec->strategy != PARTITION_STRATEGY_LIST)
> > > +                if (spec->strategy != key->strategy)
> > >                      elog(ERROR, "invalid strategy in partition bound spec");
> > >
> > >                  foreach(c, spec->listdatums)
> > > @@ -464,6 +468,7 @@ RelationBuildPartitionDesc(Relation rel)
> > >          switch (key->strategy)
> > >          {
> > >              case PARTITION_STRATEGY_LIST:
> > > +            case PARTITION_STRATEGY_HASH:
> > >                  {
> > >                      boundinfo->has_null = found_null;
> > >                      boundinfo->indexes = (int *) palloc(ndatums * sizeof(int));
> > > @@ -829,6 +834,18 @@ check_new_partition_bound(char *relname, Relation parent, Node *bound)
> > >                  break;
> > >              }
> > >
> > > +        case PARTITION_STRATEGY_HASH:
> > > +            {
> > > +                Assert(spec->strategy == PARTITION_STRATEGY_HASH);
> > > +
> > > +                if (partdesc->nparts + 1 > key->partnparts)
> > > +                    ereport(ERROR,
> > > +                            (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
> > > +                    errmsg("cannot create hash partition more than %d for %s",
> > > +                            key->partnparts, RelationGetRelationName(parent))));
> > > +                break;
> > > +            }
> > > +
> > >          default:
> > >              elog(ERROR, "unexpected partition strategy: %d",
> > >                   (int) key->strategy);
> > > @@ -916,6 +933,11 @@ get_qual_from_partbound(Relation rel, Relation parent, Node *bound)
> > >              my_qual = get_qual_for_range(key, spec);
> > >              break;
> > >
> > > +        case PARTITION_STRATEGY_HASH:
> > > +            Assert(spec->strategy == PARTITION_STRATEGY_LIST);
> > > +            my_qual = get_qual_for_hash(key, spec);
> > > +            break;
> > > +
> > >          default:
> > >              elog(ERROR, "unexpected partition strategy: %d",
> > >                   (int) key->strategy);
> > > @@ -1146,6 +1168,84 @@ RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
> > >      return pd;
> > >  }
> > >
> > > +/*
> > > + * convert_expr_for_hash
> > > + *
> > > + * Converts a expr for a hash partition's constraint.
> > > + * expr is converted into 'abs(hashfunc(expr)) % npart".
> > > + *
> > > + * npart: number of partitions
> > > + * hashfunc: OID of hash function
> > > + */
> > > +Expr *
> > > +convert_expr_for_hash(Expr *expr, int npart, Oid hashfunc)
> > > +{
> > > +    FuncExpr   *func,
> > > +               *abs;
> > > +    Expr        *modexpr;
> > > +    Oid            modoid;
> > > +    Oid            int4oid[1] = {INT4OID};
> > > +
> > > +    ParseState *pstate = make_parsestate(NULL);
> > > +    Value       *val_npart = makeInteger(npart);
> > > +    Node       *const_npart = (Node *) make_const(pstate, val_npart, -1);
> > > +
> > > +    /* hash function */
> > > +    func = makeFuncExpr(hashfunc,
> > > +                        INT4OID,
> > > +                        list_make1(expr),
> > > +                        0,
> > > +                        0,
> > > +                        COERCE_EXPLICIT_CALL);
> > > +
> > > +    /* Abs */
> > > +    abs = makeFuncExpr(LookupFuncName(list_make1(makeString("abs")), 1, int4oid, false),
> > > +                       INT4OID,
> > > +                       list_make1(func),
> > > +                       0,
> > > +                       0,
> > > +                       COERCE_EXPLICIT_CALL);
> > > +
> > > +    /* modulo by npart */
> > > +    modoid = LookupOperName(pstate, list_make1(makeString("%")), INT4OID, INT4OID, false, -1);
> > > +    modexpr = make_opclause(modoid, INT4OID, false, (Expr*)abs, (Expr*)const_npart, 0, 0);
> > > +
> > > +    return modexpr;
> > > +}
> > > +
> > > +
> > > +/*
> > > + * get_next_hash_partition_index
> > > + *
> > > + * Returns the minimal index which is not used for hash partition.
> > > + */
> > > +int
> > > +get_next_hash_partition_index(Relation parent)
> > > +{
> > > +    PartitionKey key = RelationGetPartitionKey(parent);
> > > +    PartitionDesc partdesc = RelationGetPartitionDesc(parent);
> > > +
> > > +    int      i;
> > > +    bool *used = palloc0(sizeof(int) * key->partnparts);
> > > +
> > > +    /* mark used for existing partition indexs */
> > > +    for (i = 0; i < partdesc->boundinfo->ndatums; i++)
> > > +    {
> > > +        Datum* datum = partdesc->boundinfo->datums[i];
> > > +        int idx = DatumGetInt16(datum[0]);
> > > +
> > > +        if (!used[idx])
> > > +            used[idx] = true;
> > > +    }
> > > +
> > > +    /* find the minimal unused index */
> > > +    for (i = 0; i < key->partnparts; i++)
> > > +        if (!used[i])
> > > +            break;
> > > +
> > > +    return i;
> > > +}
> > > +
> > >  /* Module-local functions */
> > >
> > >  /*
> > > @@ -1467,6 +1567,43 @@ get_qual_for_range(PartitionKey key, PartitionBoundSpec *spec)
> > >  }
> > >
> > >  /*
> > > + * get_qual_for_hash
> > > + *
> > > + * Returns a list of expressions to use as a hash partition's constraint.
> > > + */
> > > +static List *
> > > +get_qual_for_hash(PartitionKey key, PartitionBoundSpec *spec)
> > > +{
> > > +    List       *result;
> > > +    Expr       *keyCol;
> > > +    Expr       *expr;
> > > +    Expr        *opexpr;
> > > +    Oid            operoid;
> > > +    ParseState *pstate = make_parsestate(NULL);
> > > +
> > > +    /* Left operand */
> > > +    if (key->partattrs[0] != 0)
> > > +        keyCol = (Expr *) makeVar(1,
> > > +                                  key->partattrs[0],
> > > +                                  key->parttypid[0],
> > > +                                  key->parttypmod[0],
> > > +                                  key->parttypcoll[0],
> > > +                                  0);
> > > +    else
> > > +        keyCol = (Expr *) copyObject(linitial(key->partexprs));
> > > +
> > > +    expr = convert_expr_for_hash(keyCol, key->partnparts, key->parthashfunc);
> > > +
> > > +    /* equals the listdaums value */
> > > +    operoid = LookupOperName(pstate, list_make1(makeString("=")), INT4OID, INT4OID, false, -1);
> > > +    opexpr = make_opclause(operoid, BOOLOID, false, expr, linitial(spec->listdatums), 0, 0);
> > > +
> > > +    result = list_make1(opexpr);
> > > +
> > > +    return result;
> > > +}
> > > +
> > > +/*
> > >   * get_partition_operator
> > >   *
> > >   * Return oid of the operator of given strategy for a given partition key
> > > @@ -1730,6 +1867,11 @@ get_partition_for_tuple(PartitionDispatch *pd,
> > >                              (errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
> > >                          errmsg("range partition key of row contains null")));
> > >          }
> > > +        else if (key->strategy == PARTITION_STRATEGY_HASH)
> > > +        {
> > > +            values[0] = OidFunctionCall1(key->parthashfunc, values[0]);
> > > +            values[0] = Int16GetDatum(Abs(DatumGetInt16(values[0])) % key->partnparts);
> > > +        }
> > >
> > >          if (partdesc->boundinfo->has_null && isnull[0])
> > >              /* Tuple maps to the null-accepting list partition */
> > > @@ -1744,6 +1886,7 @@ get_partition_for_tuple(PartitionDispatch *pd,
> > >              switch (key->strategy)
> > >              {
> > >                  case PARTITION_STRATEGY_LIST:
> > > +                case PARTITION_STRATEGY_HASH:
> > >                      if (cur_offset >= 0 && equal)
> > >                          cur_index = partdesc->boundinfo->indexes[cur_offset];
> > >                      else
> > > @@ -1968,6 +2111,7 @@ partition_bound_cmp(PartitionKey key, PartitionBoundInfo boundinfo,
> > >      switch (key->strategy)
> > >      {
> > >          case PARTITION_STRATEGY_LIST:
> > > +        case PARTITION_STRATEGY_HASH:
> > >              cmpval = DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[0],
> > >                                                       key->partcollation[0],
> > >                                                       bound_datums[0],
> > > diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
> > > index 3cea220..5a28cc0 100644
> > > --- a/src/backend/commands/tablecmds.c
> > > +++ b/src/backend/commands/tablecmds.c
> > > @@ -41,6 +41,7 @@
> > >  #include "catalog/pg_inherits_fn.h"
> > >  #include "catalog/pg_namespace.h"
> > >  #include "catalog/pg_opclass.h"
> > > +#include "catalog/pg_proc.h"
> > >  #include "catalog/pg_tablespace.h"
> > >  #include "catalog/pg_trigger.h"
> > >  #include "catalog/pg_type.h"
> > > @@ -77,6 +78,7 @@
> > >  #include "parser/parse_oper.h"
> > >  #include "parser/parse_relation.h"
> > >  #include "parser/parse_type.h"
> > > +#include "parser/parse_func.h"
> > >  #include "parser/parse_utilcmd.h"
> > >  #include "parser/parser.h"
> > >  #include "pgstat.h"
> > > @@ -450,7 +452,7 @@ static void RangeVarCallbackForAlterRelation(const RangeVar *rv, Oid relid,
> > >                                   Oid oldrelid, void *arg);
> > >  static bool is_partition_attr(Relation rel, AttrNumber attnum, bool *used_in_expr);
> > >  static PartitionSpec *transformPartitionSpec(Relation rel, PartitionSpec *partspec, char *strategy);
> > > -static void ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs,
> > > +static void ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs, Oid *partatttypes,
> > >                        List **partexprs, Oid *partopclass, Oid *partcollation);
> > >  static void CreateInheritance(Relation child_rel, Relation parent_rel);
> > >  static void RemoveInheritance(Relation child_rel, Relation parent_rel);
> > > @@ -799,8 +801,10 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
> > >          AttrNumber    partattrs[PARTITION_MAX_KEYS];
> > >          Oid            partopclass[PARTITION_MAX_KEYS];
> > >          Oid            partcollation[PARTITION_MAX_KEYS];
> > > +        Oid            partatttypes[PARTITION_MAX_KEYS];
> > >          List       *partexprs = NIL;
> > >          List       *cmds = NIL;
> > > +        Oid hashfuncOid = InvalidOid;
> > >
> > >          /*
> > >           * We need to transform the raw parsetrees corresponding to partition
> > > @@ -811,15 +815,40 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
> > >          stmt->partspec = transformPartitionSpec(rel, stmt->partspec,
> > >                                                  &strategy);
> > >          ComputePartitionAttrs(rel, stmt->partspec->partParams,
> > > -                              partattrs, &partexprs, partopclass,
> > > +                              partattrs, partatttypes, &partexprs, partopclass,
> > >                                partcollation);
> > >
> > >          partnatts = list_length(stmt->partspec->partParams);
> > > +
> > > +        if (strategy == PARTITION_STRATEGY_HASH)
> > > +        {
> > > +            Oid funcrettype;
> > > +
> > > +            if (partnatts != 1)
> > > +                ereport(ERROR,
> > > +                        (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > +                        errmsg("number of partition key must be 1 for hash partition")));
> > > +
> > > +            hashfuncOid = LookupFuncName(stmt->partspec->hashfunc, 1, partatttypes, false);
> > > +            funcrettype = get_func_rettype(hashfuncOid);
> > > +            if (funcrettype != INT4OID)
> > > +                ereport(ERROR,
> > > +                        (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > +                        errmsg("hash function for partitioning must return integer")));
> > > +
> > > +            if (func_volatile(hashfuncOid) != PROVOLATILE_IMMUTABLE)
> > > +                ereport(ERROR,
> > > +                        (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > +                        errmsg("hash function for partitioning must be marked IMMUTABLE")));
> > > +
> > > +        }
> > > +
> > >          StorePartitionKey(rel, strategy, partnatts, partattrs, partexprs,
> > > -                          partopclass, partcollation);
> > > +                          partopclass, partcollation, stmt->partspec->partnparts, hashfuncOid);
> > >
> > > -        /* Force key columns to be NOT NULL when using range partitioning */
> > > -        if (strategy == PARTITION_STRATEGY_RANGE)
> > > +        /* Force key columns to be NOT NULL when using range or hash partitioning */
> > > +        if (strategy == PARTITION_STRATEGY_RANGE ||
> > > +            strategy == PARTITION_STRATEGY_HASH)
> > >          {
> > >              for (i = 0; i < partnatts; i++)
> > >              {
> > > @@ -12783,18 +12812,51 @@ transformPartitionSpec(Relation rel, PartitionSpec *partspec, char *strategy)
> > >      newspec->strategy = partspec->strategy;
> > >      newspec->location = partspec->location;
> > >      newspec->partParams = NIL;
> > > +    newspec->partnparts = partspec->partnparts;
> > > +    newspec->hashfunc = partspec->hashfunc;
> > >
> > >      /* Parse partitioning strategy name */
> > >      if (!pg_strcasecmp(partspec->strategy, "list"))
> > >          *strategy = PARTITION_STRATEGY_LIST;
> > >      else if (!pg_strcasecmp(partspec->strategy, "range"))
> > >          *strategy = PARTITION_STRATEGY_RANGE;
> > > +    else if (!pg_strcasecmp(partspec->strategy, "hash"))
> > > +        *strategy = PARTITION_STRATEGY_HASH;
> > >      else
> > >          ereport(ERROR,
> > >                  (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > >                   errmsg("unrecognized partitioning strategy \"%s\"",
> > >                          partspec->strategy)));
> > >
> > > +    if (*strategy == PARTITION_STRATEGY_HASH)
> > > +    {
> > > +        if (partspec->partnparts < 0)
> > > +            ereport(ERROR,
> > > +                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > +                     errmsg("number of partitions must be specified for hash partition")));
> > > +        else if (partspec->partnparts == 0)
> > > +            ereport(ERROR,
> > > +                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > +                     errmsg("number of partitions must be greater than 0")));
> > > +
> > > +        if (list_length(partspec->hashfunc) == 0)
> > > +            ereport(ERROR,
> > > +                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > +                     errmsg("hash function must be specified for hash partition")));
> > > +    }
> > > +    else
> > > +    {
> > > +        if (partspec->partnparts >= 0)
> > > +            ereport(ERROR,
> > > +                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > +                     errmsg("number of partitions can be specified only for hash partition")));
> > > +
> > > +        if (list_length(partspec->hashfunc) > 0)
> > > +            ereport(ERROR,
> > > +                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > +                     errmsg("hash function can be specified only for hash partition")));
> > > +    }
> > > +
> > >      /*
> > >       * Create a dummy ParseState and insert the target relation as its sole
> > >       * rangetable entry.  We need a ParseState for transformExpr.
> > > @@ -12843,7 +12905,7 @@ transformPartitionSpec(Relation rel, PartitionSpec *partspec, char *strategy)
> > >   * Compute per-partition-column information from a list of PartitionElem's
> > >   */
> > >  static void
> > > -ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs,
> > > +ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs, Oid *partatttypes,
> > >                        List **partexprs, Oid *partopclass, Oid *partcollation)
> > >  {
> > >      int            attn;
> > > @@ -13010,6 +13072,7 @@ ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs,
> > >                                                 "btree",
> > >                                                 BTREE_AM_OID);
> > >
> > > +        partatttypes[attn] = atttype;
> > >          attn++;
> > >      }
> > >  }
> > > diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
> > > index 05d8538..f4febc9 100644
> > > --- a/src/backend/nodes/copyfuncs.c
> > > +++ b/src/backend/nodes/copyfuncs.c
> > > @@ -4232,6 +4232,8 @@ _copyPartitionSpec(const PartitionSpec *from)
> > >
> > >      COPY_STRING_FIELD(strategy);
> > >      COPY_NODE_FIELD(partParams);
> > > +    COPY_SCALAR_FIELD(partnparts);
> > > +    COPY_NODE_FIELD(hashfunc);
> > >      COPY_LOCATION_FIELD(location);
> > >
> > >      return newnode;
> > > diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
> > > index d595cd7..d589eac 100644
> > > --- a/src/backend/nodes/equalfuncs.c
> > > +++ b/src/backend/nodes/equalfuncs.c
> > > @@ -2725,6 +2725,8 @@ _equalPartitionSpec(const PartitionSpec *a, const PartitionSpec *b)
> > >  {
> > >      COMPARE_STRING_FIELD(strategy);
> > >      COMPARE_NODE_FIELD(partParams);
> > > +    COMPARE_SCALAR_FIELD(partnparts);
> > > +    COMPARE_NODE_FIELD(hashfunc);
> > >      COMPARE_LOCATION_FIELD(location);
> > >
> > >      return true;
> > > diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
> > > index b3802b4..d6db80e 100644
> > > --- a/src/backend/nodes/outfuncs.c
> > > +++ b/src/backend/nodes/outfuncs.c
> > > @@ -3318,6 +3318,8 @@ _outPartitionSpec(StringInfo str, const PartitionSpec *node)
> > >
> > >      WRITE_STRING_FIELD(strategy);
> > >      WRITE_NODE_FIELD(partParams);
> > > +    WRITE_INT_FIELD(partnparts);
> > > +    WRITE_NODE_FIELD(hashfunc);
> > >      WRITE_LOCATION_FIELD(location);
> > >  }
> > >
> > > diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
> > > index e833b2e..b67140d 100644
> > > --- a/src/backend/parser/gram.y
> > > +++ b/src/backend/parser/gram.y
> > > @@ -574,6 +574,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
> > >  %type <list>        partbound_datum_list
> > >  %type <partrange_datum>    PartitionRangeDatum
> > >  %type <list>        range_datum_list
> > > +%type <ival>        hash_partitions
> > > +%type <list>        hash_function
> > >
> > >  /*
> > >   * Non-keyword token types.  These are hard-wired into the "flex" lexer.
> > > @@ -627,7 +629,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
> > >
> > >      GLOBAL GRANT GRANTED GREATEST GROUP_P GROUPING
> > >
> > > -    HANDLER HAVING HEADER_P HOLD HOUR_P
> > > +    HANDLER HASH HAVING HEADER_P HOLD HOUR_P
> > >
> > >      IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IMPORT_P IN_P
> > >      INCLUDING INCREMENT INDEX INDEXES INHERIT INHERITS INITIALLY INLINE_P
> > > @@ -651,7 +653,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
> > >      OBJECT_P OF OFF OFFSET OIDS OLD ON ONLY OPERATOR OPTION OPTIONS OR
> > >      ORDER ORDINALITY OUT_P OUTER_P OVER OVERLAPS OVERLAY OWNED OWNER
> > >
> > > -    PARALLEL PARSER PARTIAL PARTITION PASSING PASSWORD PLACING PLANS POLICY
> > > +    PARALLEL PARSER PARTIAL PARTITION PARTITIONS PASSING PASSWORD PLACING PLANS POLICY
> > >      POSITION PRECEDING PRECISION PRESERVE PREPARE PREPARED PRIMARY
> > >      PRIOR PRIVILEGES PROCEDURAL PROCEDURE PROGRAM PUBLICATION
> > >
> > > @@ -2587,6 +2589,16 @@ ForValues:
> > >
> > >                      $$ = (Node *) n;
> > >                  }
> > > +
> > > +            /* a HASH partition */
> > > +            | /*EMPTY*/
> > > +                {
> > > +                    PartitionBoundSpec *n = makeNode(PartitionBoundSpec);
> > > +
> > > +                    n->strategy = PARTITION_STRATEGY_HASH;
> > > +
> > > +                    $$ = (Node *) n;
> > > +                }
> > >          ;
> > >
> > >  partbound_datum:
> > > @@ -3666,7 +3678,7 @@ OptPartitionSpec: PartitionSpec    { $$ = $1; }
> > >              | /*EMPTY*/            { $$ = NULL; }
> > >          ;
> > >
> > > -PartitionSpec: PARTITION BY part_strategy '(' part_params ')'
> > > +PartitionSpec: PARTITION BY part_strategy '(' part_params ')' hash_partitions hash_function
> > >                  {
> > >                      PartitionSpec *n = makeNode(PartitionSpec);
> > >
> > > @@ -3674,10 +3686,21 @@ PartitionSpec: PARTITION BY part_strategy '(' part_params ')'
> > >                      n->partParams = $5;
> > >                      n->location = @1;
> > >
> > > +                    n->partnparts = $7;
> > > +                    n->hashfunc = $8;
> > > +
> > >                      $$ = n;
> > >                  }
> > >          ;
> > >
> > > +hash_partitions: PARTITIONS Iconst { $$ = $2; }
> > > +                    | /*EMPTY*/   { $$ = -1; }
> > > +        ;
> > > +
> > > +hash_function: USING handler_name { $$ = $2; }
> > > +                    | /*EMPTY*/ { $$ = NULL; }
> > > +        ;
> > > +
> > >  part_strategy:    IDENT                    { $$ = $1; }
> > >                  | unreserved_keyword    { $$ = pstrdup($1); }
> > >          ;
> > > @@ -14377,6 +14400,7 @@ unreserved_keyword:
> > >              | GLOBAL
> > >              | GRANTED
> > >              | HANDLER
> > > +            | HASH
> > >              | HEADER_P
> > >              | HOLD
> > >              | HOUR_P
> > > @@ -14448,6 +14472,7 @@ unreserved_keyword:
> > >              | PARSER
> > >              | PARTIAL
> > >              | PARTITION
> > > +            | PARTITIONS
> > >              | PASSING
> > >              | PASSWORD
> > >              | PLANS
> > > diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c
> > > index ff2bab6..8e1be31 100644
> > > --- a/src/backend/parser/parse_utilcmd.c
> > > +++ b/src/backend/parser/parse_utilcmd.c
> > > @@ -40,6 +40,7 @@
> > >  #include "catalog/pg_opclass.h"
> > >  #include "catalog/pg_operator.h"
> > >  #include "catalog/pg_type.h"
> > > +#include "catalog/partition.h"
> > >  #include "commands/comment.h"
> > >  #include "commands/defrem.h"
> > >  #include "commands/tablecmds.h"
> > > @@ -3252,6 +3253,24 @@ transformPartitionBound(ParseState *pstate, Relation parent, Node *bound)
> > >              ++i;
> > >          }
> > >      }
> > > +    else if (strategy == PARTITION_STRATEGY_HASH)
> > > +    {
> > > +        Value     *conval;
> > > +        Node        *value;
> > > +        int          index;
> > > +
> > > +        if (spec->strategy != PARTITION_STRATEGY_HASH)
> > > +            ereport(ERROR,
> > > +                    (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
> > > +                 errmsg("invalid bound specification for a hash partition")));
> > > +
> > > +        index = get_next_hash_partition_index(parent);
> > > +
> > > +        /* store the partition index as a listdatums value */
> > > +        conval = makeInteger(index);
> > > +        value = (Node *) make_const(pstate, conval, -1);
> > > +        result_spec->listdatums = list_make1(value);
> > > +    }
> > >      else
> > >          elog(ERROR, "unexpected partition strategy: %d", (int) strategy);
> > >
> > > diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
> > > index b27b77d..fab6eea 100644
> > > --- a/src/backend/utils/adt/ruleutils.c
> > > +++ b/src/backend/utils/adt/ruleutils.c
> > > @@ -1423,7 +1423,7 @@ pg_get_indexdef_worker(Oid indexrelid, int colno,
> > >   *
> > >   * Returns the partition key specification, ie, the following:
> > >   *
> > > - * PARTITION BY { RANGE | LIST } (column opt_collation opt_opclass [, ...])
> > > + * PARTITION BY { RANGE | LIST | HASH } (column opt_collation opt_opclass [, ...])
> > >   */
> > >  Datum
> > >  pg_get_partkeydef(PG_FUNCTION_ARGS)
> > > @@ -1513,6 +1513,9 @@ pg_get_partkeydef_worker(Oid relid, int prettyFlags)
> > >          case PARTITION_STRATEGY_RANGE:
> > >              appendStringInfo(&buf, "RANGE");
> > >              break;
> > > +        case PARTITION_STRATEGY_HASH:
> > > +            appendStringInfo(&buf, "HASH");
> > > +            break;
> > >          default:
> > >              elog(ERROR, "unexpected partition strategy: %d",
> > >                   (int) form->partstrat);
> > > @@ -8520,6 +8523,9 @@ get_rule_expr(Node *node, deparse_context *context,
> > >                          appendStringInfoString(buf, ")");
> > >                          break;
> > >
> > > +                    case PARTITION_STRATEGY_HASH:
> > > +                        break;
> > > +
> > >                      default:
> > >                          elog(ERROR, "unrecognized partition strategy: %d",
> > >                               (int) spec->strategy);
> > > diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
> > > index 9001e20..829e4d2 100644
> > > --- a/src/backend/utils/cache/relcache.c
> > > +++ b/src/backend/utils/cache/relcache.c
> > > @@ -855,6 +855,9 @@ RelationBuildPartitionKey(Relation relation)
> > >      key->strategy = form->partstrat;
> > >      key->partnatts = form->partnatts;
> > >
> > > +    key->partnparts = form->partnparts;
> > > +    key->parthashfunc = form->parthashfunc;
> > > +
> > >      /*
> > >       * We can rely on the first variable-length attribute being mapped to the
> > >       * relevant field of the catalog's C struct, because all previous
> > > @@ -999,6 +1002,9 @@ copy_partition_key(PartitionKey fromkey)
> > >      newkey->strategy = fromkey->strategy;
> > >      newkey->partnatts = n = fromkey->partnatts;
> > >
> > > +    newkey->partnparts = fromkey->partnparts;
> > > +    newkey->parthashfunc = fromkey->parthashfunc;
> > > +
> > >      newkey->partattrs = (AttrNumber *) palloc(n * sizeof(AttrNumber));
> > >      memcpy(newkey->partattrs, fromkey->partattrs, n * sizeof(AttrNumber));
> > >
> > > diff --git a/src/include/catalog/heap.h b/src/include/catalog/heap.h
> > > index 1187797..367e2f8 100644
> > > --- a/src/include/catalog/heap.h
> > > +++ b/src/include/catalog/heap.h
> > > @@ -141,7 +141,7 @@ extern void StorePartitionKey(Relation rel,
> > >                    AttrNumber *partattrs,
> > >                    List *partexprs,
> > >                    Oid *partopclass,
> > > -                  Oid *partcollation);
> > > +                  Oid *partcollation, int16 partnparts, Oid hashfunc);
> > >  extern void RemovePartitionKeyByRelId(Oid relid);
> > >  extern void StorePartitionBound(Relation rel, Relation parent, Node *bound);
> > >
> > > diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
> > > index b195d1a..80f4b0e 100644
> > > --- a/src/include/catalog/partition.h
> > > +++ b/src/include/catalog/partition.h
> > > @@ -89,4 +89,6 @@ extern int get_partition_for_tuple(PartitionDispatch *pd,
> > >                          TupleTableSlot *slot,
> > >                          EState *estate,
> > >                          Oid *failed_at);
> > > +extern Expr *convert_expr_for_hash(Expr *expr, int npart, Oid hashfunc);
> > > +extern int get_next_hash_partition_index(Relation parent);
> > >  #endif   /* PARTITION_H */
> > > diff --git a/src/include/catalog/pg_partitioned_table.h b/src/include/catalog/pg_partitioned_table.h
> > > index bdff36a..69e509c 100644
> > > --- a/src/include/catalog/pg_partitioned_table.h
> > > +++ b/src/include/catalog/pg_partitioned_table.h
> > > @@ -33,6 +33,9 @@ CATALOG(pg_partitioned_table,3350) BKI_WITHOUT_OIDS
> > >      char        partstrat;        /* partitioning strategy */
> > >      int16        partnatts;        /* number of partition key columns */
> > >
> > > +    int16        partnparts;
> > > +    Oid            parthashfunc;
> > > +
> > >      /*
> > >       * variable-length fields start here, but we allow direct access to
> > >       * partattrs via the C struct.  That's because the first variable-length
> > > @@ -49,6 +52,8 @@ CATALOG(pg_partitioned_table,3350) BKI_WITHOUT_OIDS
> > >      pg_node_tree partexprs;        /* list of expressions in the partition key;
> > >                                   * one item for each zero entry in partattrs[] */
> > >  #endif
> > > +
> > > +
> > >  } FormData_pg_partitioned_table;
> > >
> > >  /* ----------------
> > > @@ -62,13 +67,15 @@ typedef FormData_pg_partitioned_table *Form_pg_partitioned_table;
> > >   *        compiler constants for pg_partitioned_table
> > >   * ----------------
> > >   */
> > > -#define Natts_pg_partitioned_table                7
> > > +#define Natts_pg_partitioned_table                9
> > >  #define Anum_pg_partitioned_table_partrelid        1
> > >  #define Anum_pg_partitioned_table_partstrat        2
> > >  #define Anum_pg_partitioned_table_partnatts        3
> > > -#define Anum_pg_partitioned_table_partattrs        4
> > > -#define Anum_pg_partitioned_table_partclass        5
> > > -#define Anum_pg_partitioned_table_partcollation 6
> > > -#define Anum_pg_partitioned_table_partexprs        7
> > > +#define Anum_pg_partitioned_table_partnparts    4
> > > +#define Anum_pg_partitioned_table_parthashfunc    5
> > > +#define Anum_pg_partitioned_table_partattrs        6
> > > +#define Anum_pg_partitioned_table_partclass        7
> > > +#define Anum_pg_partitioned_table_partcollation 8
> > > +#define Anum_pg_partitioned_table_partexprs        9
> > >
> > >  #endif   /* PG_PARTITIONED_TABLE_H */
> > > diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
> > > index 5afc3eb..1c3474f 100644
> > > --- a/src/include/nodes/parsenodes.h
> > > +++ b/src/include/nodes/parsenodes.h
> > > @@ -730,11 +730,14 @@ typedef struct PartitionSpec
> > >      NodeTag        type;
> > >      char       *strategy;        /* partitioning strategy ('list' or 'range') */
> > >      List       *partParams;        /* List of PartitionElems */
> > > +    int            partnparts;
> > > +    List       *hashfunc;
> > >      int            location;        /* token location, or -1 if unknown */
> > >  } PartitionSpec;
> > >
> > >  #define PARTITION_STRATEGY_LIST        'l'
> > >  #define PARTITION_STRATEGY_RANGE    'r'
> > > +#define PARTITION_STRATEGY_HASH        'h'
> > >
> > >  /*
> > >   * PartitionBoundSpec - a partition bound specification
> > > diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
> > > index 985d650..0597939 100644
> > > --- a/src/include/parser/kwlist.h
> > > +++ b/src/include/parser/kwlist.h
> > > @@ -180,6 +180,7 @@ PG_KEYWORD("greatest", GREATEST, COL_NAME_KEYWORD)
> > >  PG_KEYWORD("group", GROUP_P, RESERVED_KEYWORD)
> > >  PG_KEYWORD("grouping", GROUPING, COL_NAME_KEYWORD)
> > >  PG_KEYWORD("handler", HANDLER, UNRESERVED_KEYWORD)
> > > +PG_KEYWORD("hash", HASH, UNRESERVED_KEYWORD)
> > >  PG_KEYWORD("having", HAVING, RESERVED_KEYWORD)
> > >  PG_KEYWORD("header", HEADER_P, UNRESERVED_KEYWORD)
> > >  PG_KEYWORD("hold", HOLD, UNRESERVED_KEYWORD)
> > > @@ -291,6 +292,7 @@ PG_KEYWORD("parallel", PARALLEL, UNRESERVED_KEYWORD)
> > >  PG_KEYWORD("parser", PARSER, UNRESERVED_KEYWORD)
> > >  PG_KEYWORD("partial", PARTIAL, UNRESERVED_KEYWORD)
> > >  PG_KEYWORD("partition", PARTITION, UNRESERVED_KEYWORD)
> > > +PG_KEYWORD("partitions", PARTITIONS, UNRESERVED_KEYWORD)
> > >  PG_KEYWORD("passing", PASSING, UNRESERVED_KEYWORD)
> > >  PG_KEYWORD("password", PASSWORD, UNRESERVED_KEYWORD)
> > >  PG_KEYWORD("placing", PLACING, RESERVED_KEYWORD)
> > > diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
> > > index a617a7c..660adfb 100644
> > > --- a/src/include/utils/rel.h
> > > +++ b/src/include/utils/rel.h
> > > @@ -62,6 +62,9 @@ typedef struct PartitionKeyData
> > >      Oid           *partopcintype;    /* OIDs of opclass declared input data types */
> > >      FmgrInfo   *partsupfunc;    /* lookup info for support funcs */
> > >
> > > +    int16        partnparts;        /* number of hash partitions */
> > > +    Oid            parthashfunc;    /* OID of hash function */
> > > +
> > >      /* Partitioning collation per attribute */
> > >      Oid           *partcollation;
> > >
> >
> > >
> > > --
> > > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> > > To make changes to your subscription:
> > > http://www.postgresql.org/mailpref/pgsql-hackers
> >
> >
> > --
> > Best regards,
> > Aleksander Alekseev
>
>
> --
> Yugo Nagata <nagata@sraoss.co.jp>

--
Best regards,
Aleksander Alekseev

Re: [HACKERS] [POC] hash partitioning

From

Maksim Milyutin

Date:

01 March 2017, 17:10:34

On 01.03.2017 05:14, Amit Langote wrote:
> Nagata-san,
>
>> A partition table can be create as bellow;
>>
>>  CREATE TABLE h1 PARTITION OF h;
>>  CREATE TABLE h2 PARTITION OF h;
>>  CREATE TABLE h3 PARTITION OF h;
>>
>> FOR VALUES clause cannot be used, and the partition bound is
>> calclulated automatically as partition index of single integer value.
>>
>> When trying create partitions more than the number specified
>> by PARTITIONS, it gets an error.
>>
>> postgres=# create table h4 partition of h;
>> ERROR:  cannot create hash partition more than 3 for h
>
> Instead of having to create each partition individually, wouldn't it be
> better if the following command
>
> CREATE TABLE h (i int) PARTITION BY HASH (i) PARTITIONS 3;
>
> created the partitions *automatically*?

It's a good idea but in this case we can't create hash-partition that is 
also partitioned table, and as a consequence we are unable to create 
subpartitions. My understanding is that the table can be partitioned 
only using CREATE TABLE statement, not ALTER TABLE. For this reason the 
new created partitions are only regular tables.

We can achieve desired result through creating a separate partitioned 
table and making the DETACH/ATTACH manipulation, though. But IMO it's 
not flexible case.

It would be a good thing if a regular table could be partitioned through 
separate command. Then your idea would not be restrictive.

-- 
Maksim Milyutin
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [HACKERS] [POC] hash partitioning

From

Aleksander Alekseev

Date:

01 March 2017, 17:23:46

> We can achieve desired result through creating a separate partitioned table
> and making the DETACH/ATTACH manipulation, though. But IMO it's not flexible
> case.

I think it would be great to allow end user to decide. If user is
not interested in subpartitions he or she can use syntax like 'CREATE
TABLE ... PARTITION BY HAHS(i) PARTITIONS 3 CREATE AUTOMATICALLY;' or
maybe a build-in procedure for this. Otherwise there is also
ATTACH/DETACH syntax available.

Anyway all of this is something that could be discussed infinitely and
not necessarily should be included in this concrete patch. We could
probably agree that 3 or 4 separately discussed, reviewed and tested
patches are better than one huge patch that will be moved to the next
commitfest because of disagreements regarding a syntax.

On Wed, Mar 01, 2017 at 05:10:34PM +0300, Maksim Milyutin wrote:
> On 01.03.2017 05:14, Amit Langote wrote:
> > Nagata-san,
> >
> > > A partition table can be create as bellow;
> > >
> > >  CREATE TABLE h1 PARTITION OF h;
> > >  CREATE TABLE h2 PARTITION OF h;
> > >  CREATE TABLE h3 PARTITION OF h;
> > >
> > > FOR VALUES clause cannot be used, and the partition bound is
> > > calclulated automatically as partition index of single integer value.
> > >
> > > When trying create partitions more than the number specified
> > > by PARTITIONS, it gets an error.
> > >
> > > postgres=# create table h4 partition of h;
> > > ERROR:  cannot create hash partition more than 3 for h
> >
> > Instead of having to create each partition individually, wouldn't it be
> > better if the following command
> >
> > CREATE TABLE h (i int) PARTITION BY HASH (i) PARTITIONS 3;
> >
> > created the partitions *automatically*?
>
> It's a good idea but in this case we can't create hash-partition that is
> also partitioned table, and as a consequence we are unable to create
> subpartitions. My understanding is that the table can be partitioned only
> using CREATE TABLE statement, not ALTER TABLE. For this reason the new
> created partitions are only regular tables.
>
> We can achieve desired result through creating a separate partitioned table
> and making the DETACH/ATTACH manipulation, though. But IMO it's not flexible
> case.
>
> It would be a good thing if a regular table could be partitioned through
> separate command. Then your idea would not be restrictive.
>
>
> --
> Maksim Milyutin
> Postgres Professional: http://www.postgrespro.com
> Russian Postgres Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--
Best regards,
Aleksander Alekseev

Re: [HACKERS] [POC] hash partitioning

From

Yugo Nagata

Date:

02 March 2017, 11:45:38

Hi Aleksander ,

Thank you for reviewing the patch.

On Wed, 1 Mar 2017 17:08:49 +0300
Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:

> Hi, Yugo.
> 
> Today I've had an opportunity to take a closer look on this patch. Here are
> a few things that bother me.
> 
> 1a) There are missing commends here:
> 
> ```
> --- a/src/include/catalog/pg_partitioned_table.h
> +++ b/src/include/catalog/pg_partitioned_table.h
> @@ -33,6 +33,9 @@ CATALOG(pg_partitioned_table,3350) BKI_WITHOUT_OIDS
>     char        partstrat;      /* partitioning strategy */
>     int16       partnatts;      /* number of partition key columns */
> 
> +   int16       partnparts;
> +   Oid         parthashfunc;
> +
> ```
> 
> 1b) ... and here:
> 
> ```
> --- a/src/include/nodes/parsenodes.h
> +++ b/src/include/nodes/parsenodes.h
> @@ -730,11 +730,14 @@ typedef struct PartitionSpec
>     NodeTag     type;
>     char       *strategy;       /* partitioning strategy ('list' or 'range') */
>     List       *partParams;     /* List of PartitionElems */
> +   int         partnparts;
> +   List       *hashfunc;
>     int         location;       /* token location, or -1 if unknown */
>  } PartitionSpec;
> ```

ok, I'll add comments for these members;

> 
> 2) I believe new empty lines in patches are generally not welcomed by
> community:
> 
> ```
> @@ -49,6 +52,8 @@ CATALOG(pg_partitioned_table,3350) BKI_WITHOUT_OIDS
>     pg_node_tree partexprs;     /* list of expressions in the partition key;
>                                  * one item for each zero entry in partattrs[] */
>  #endif
> +
> +
>  } FormData_pg_partitioned_table;
> ```

I'll remove it from the patch.

> 
> 3) One test fails on my laptop (Arch Linux, x64) [1]:
> 
> ```
> ***************
> *** 344,350 ****
>   CREATE TABLE partitioned (
>       a int
>   ) PARTITION BY HASH (a);
> ! ERROR:  unrecognized partitioning strategy "hash"
>   -- specified column must be present in the table
>   CREATE TABLE partitioned (
>       a int
> --- 344,350 ----
>   CREATE TABLE partitioned (
>       a int
>   ) PARTITION BY HASH (a);
> ! ERROR:  number of partitions must be specified for hash partition
>   -- specified column must be present in the table
>   CREATE TABLE partitioned (
>       a int
> ```

These are expected behaviors in the current patch. However, there
are some discussions on the specification about CREATE TABLE, so
it may be changed.

> 
> Exact script I'm using for building and testing PostgreSQL could be
> found here [2].
> 
> 4) As I already mentioned - missing documentation.

I think writing the documentation should be waited fo the specification
getting a consensus.

> 
> In general patch looks quite good to me. I personally believe it has all
> the changes to be accepted in current commitfest. Naturally if community
> will come to a consensus regarding keywords, whether all partitions
> should be created automatically, etc :)
> 
> [1] http://afiskon.ru/s/dd/20cbe21934_regression.diffs.txt
> [2] http://afiskon.ru/s/76/a4fb71739c_full-build.sh.txt
> 
> On Wed, Mar 01, 2017 at 06:10:10PM +0900, Yugo Nagata wrote:
> > Hi Aleksander,
> > 
> > On Tue, 28 Feb 2017 18:05:36 +0300
> > Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:
> > 
> > > Hi, Yugo.
> > > 
> > > Looks like a great feature! I'm going to take a closer look on your code
> > > and write a feedback shortly. For now I can only tell that you forgot
> > > to include some documentation in the patch.
> > 
> > Thank you for looking into it. I'm forward to your feedback.
> > This is a proof of concept patch and additional documentation
> > is not included. I'll add this after reaching a consensus
> > on the specification of the feature.
> > 
> > > 
> > > I've added a corresponding entry to current commitfest [1]. Hope you
> > > don't mind. If it's not too much trouble could you please register on a
> > > commitfest site and add yourself to this entry as an author? I'm pretty
> > > sure someone is using this information for writing release notes or
> > > something like this.
> > 
> > Thank you for registering it to the commitfest. I have added me as an auther.
> > 
> > > 
> > > [1] https://commitfest.postgresql.org/13/1059/
> > > 
> > > On Tue, Feb 28, 2017 at 11:33:13PM +0900, Yugo Nagata wrote:
> > > > Hi all,
> > > > 
> > > > Now we have a declarative partitioning, but hash partitioning is not
> > > > implemented yet. Attached is a POC patch to add the hash partitioning
> > > > feature. I know we will need more discussions about the syntax and other
> > > > specifications before going ahead the project, but I think this runnable
> > > > code might help to discuss what and how we implement this.
> > > > 
> > > > * Description
> > > > 
> > > > In this patch, the hash partitioning implementation is basically based
> > > > on the list partitioning mechanism. However, partition bounds cannot be
> > > > specified explicitly, but this is used internally as hash partition
> > > > index, which is calculated when a partition is created or attached.
> > > > 
> > > > The tentative syntax to create a partitioned table is as bellow;
> > > > 
> > > >  CREATE TABLE h (i int) PARTITION BY HASH(i) PARTITIONS 3 USING hashint4;
> > > > 
> > > > The number of partitions is specified by PARTITIONS, which is currently
> > > > constant and cannot be changed, but I think this is needed to be changed in
> > > > some manner. A hash function is specified by USING. Maybe, specifying hash
> > > > function may be ommitted, and in this case, a default hash function
> > > > corresponding to key type will be used.
> > > > 
> > > > A partition table can be create as bellow;
> > > > 
> > > >  CREATE TABLE h1 PARTITION OF h;
> > > >  CREATE TABLE h2 PARTITION OF h;
> > > >  CREATE TABLE h3 PARTITION OF h;
> > > > 
> > > > FOR VALUES clause cannot be used, and the partition bound is
> > > > calclulated automatically as partition index of single integer value.
> > > > 
> > > > When trying create partitions more than the number specified
> > > > by PARTITIONS, it gets an error.
> > > > 
> > > > postgres=# create table h4 partition of h;
> > > > ERROR:  cannot create hash partition more than 3 for h
> > > > 
> > > > An inserted record is stored in a partition whose index equals
> > > > abs(hashfunc(key)) % <number_of_partitions>. In the above
> > > > example, this is abs(hashint4(i))%3.
> > > > 
> > > > postgres=# insert into h (select generate_series(0,20));
> > > > INSERT 0 21
> > > > 
> > > > postgres=# select *,tableoid::regclass from h;
> > > >  i  | tableoid 
> > > > ----+----------
> > > >   0 | h1
> > > >   1 | h1
> > > >   2 | h1
> > > >   4 | h1
> > > >   8 | h1
> > > >  10 | h1
> > > >  11 | h1
> > > >  14 | h1
> > > >  15 | h1
> > > >  17 | h1
> > > >  20 | h1
> > > >   5 | h2
> > > >  12 | h2
> > > >  13 | h2
> > > >  16 | h2
> > > >  19 | h2
> > > >   3 | h3
> > > >   6 | h3
> > > >   7 | h3
> > > >   9 | h3
> > > >  18 | h3
> > > > (21 rows)
> > > > 
> > > > * Todo / discussions
> > > > 
> > > > In this patch, we cannot change the number of partitions specified
> > > > by PARTITIONS. I we can change this, the partitioning rule
> > > > (<partition index> = abs(hashfunc(key)) % <number_of_partitions>)
> > > > is also changed and then we need reallocatiing records between
> > > > partitions.
> > > > 
> > > > In this patch, user can specify a hash function USING. However,
> > > > we migth need default hash functions which are useful and
> > > > proper for hash partitioning. 
> > > > 
> > > > Currently, even when we issue SELECT query with a condition,
> > > > postgres looks into all partitions regardless of each partition's
> > > > constraint, because this is complicated such like "abs(hashint4(i))%3 = 0".
> > > > 
> > > > postgres=# explain select * from h where i = 10;
> > > >                         QUERY PLAN                        
> > > > ----------------------------------------------------------
> > > >  Append  (cost=0.00..125.62 rows=40 width=4)
> > > >    ->  Seq Scan on h  (cost=0.00..0.00 rows=1 width=4)
> > > >          Filter: (i = 10)
> > > >    ->  Seq Scan on h1  (cost=0.00..41.88 rows=13 width=4)
> > > >          Filter: (i = 10)
> > > >    ->  Seq Scan on h2  (cost=0.00..41.88 rows=13 width=4)
> > > >          Filter: (i = 10)
> > > >    ->  Seq Scan on h3  (cost=0.00..41.88 rows=13 width=4)
> > > >          Filter: (i = 10)
> > > > (9 rows)
> > > > 
> > > > However, if we modify a condition into a same expression
> > > > as the partitions constraint, postgres can exclude unrelated
> > > > table from search targets. So, we might avoid the problem
> > > > by converting the qual properly before calling predicate_refuted_by().
> > > > 
> > > > postgres=# explain select * from h where abs(hashint4(i))%3 = abs(hashint4(10))%3;
> > > >                         QUERY PLAN                        
> > > > ----------------------------------------------------------
> > > >  Append  (cost=0.00..61.00 rows=14 width=4)
> > > >    ->  Seq Scan on h  (cost=0.00..0.00 rows=1 width=4)
> > > >          Filter: ((abs(hashint4(i)) % 3) = 2)
> > > >    ->  Seq Scan on h3  (cost=0.00..61.00 rows=13 width=4)
> > > >          Filter: ((abs(hashint4(i)) % 3) = 2)
> > > > (5 rows)
> > > > 
> > > > Best regards,
> > > > Yugo Nagata
> > > > 
> > > > -- 
> > > > Yugo Nagata <nagata@sraoss.co.jp>
> > > 
> > > > diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
> > > > index 41c0056..3820920 100644
> > > > --- a/src/backend/catalog/heap.c
> > > > +++ b/src/backend/catalog/heap.c
> > > > @@ -3074,7 +3074,7 @@ StorePartitionKey(Relation rel,
> > > >                    AttrNumber *partattrs,
> > > >                    List *partexprs,
> > > >                    Oid *partopclass,
> > > > -                  Oid *partcollation)
> > > > +                  Oid *partcollation, int16 partnparts, Oid hashfunc)
> > > >  {
> > > >      int            i;
> > > >      int2vector *partattrs_vec;
> > > > @@ -3121,6 +3121,8 @@ StorePartitionKey(Relation rel,
> > > >      values[Anum_pg_partitioned_table_partrelid - 1] = ObjectIdGetDatum(RelationGetRelid(rel));
> > > >      values[Anum_pg_partitioned_table_partstrat - 1] = CharGetDatum(strategy);
> > > >      values[Anum_pg_partitioned_table_partnatts - 1] = Int16GetDatum(partnatts);
> > > > +    values[Anum_pg_partitioned_table_partnparts - 1] = Int16GetDatum(partnparts);
> > > > +    values[Anum_pg_partitioned_table_parthashfunc - 1] = ObjectIdGetDatum(hashfunc);
> > > >      values[Anum_pg_partitioned_table_partattrs - 1] = PointerGetDatum(partattrs_vec);
> > > >      values[Anum_pg_partitioned_table_partclass - 1] = PointerGetDatum(partopclass_vec);
> > > >      values[Anum_pg_partitioned_table_partcollation - 1] = PointerGetDatum(partcollation_vec);
> > > > diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
> > > > index 4bcef58..24e69c6 100644
> > > > --- a/src/backend/catalog/partition.c
> > > > +++ b/src/backend/catalog/partition.c
> > > > @@ -36,6 +36,8 @@
> > > >  #include "optimizer/clauses.h"
> > > >  #include "optimizer/planmain.h"
> > > >  #include "optimizer/var.h"
> > > > +#include "parser/parse_func.h"
> > > > +#include "parser/parse_oper.h"
> > > >  #include "rewrite/rewriteManip.h"
> > > >  #include "storage/lmgr.h"
> > > >  #include "utils/array.h"
> > > > @@ -120,6 +122,7 @@ static int32 qsort_partition_rbound_cmp(const void *a, const void *b,
> > > >  
> > > >  static List *get_qual_for_list(PartitionKey key, PartitionBoundSpec *spec);
> > > >  static List *get_qual_for_range(PartitionKey key, PartitionBoundSpec *spec);
> > > > +static List *get_qual_for_hash(PartitionKey key, PartitionBoundSpec *spec);
> > > >  static Oid get_partition_operator(PartitionKey key, int col,
> > > >                         StrategyNumber strategy, bool *need_relabel);
> > > >  static List *generate_partition_qual(Relation rel);
> > > > @@ -236,7 +239,8 @@ RelationBuildPartitionDesc(Relation rel)
> > > >              oids[i++] = lfirst_oid(cell);
> > > >  
> > > >          /* Convert from node to the internal representation */
> > > > -        if (key->strategy == PARTITION_STRATEGY_LIST)
> > > > +        if (key->strategy == PARTITION_STRATEGY_LIST ||
> > > > +            key->strategy == PARTITION_STRATEGY_HASH)
> > > >          {
> > > >              List       *non_null_values = NIL;
> > > >  
> > > > @@ -251,7 +255,7 @@ RelationBuildPartitionDesc(Relation rel)
> > > >                  ListCell   *c;
> > > >                  PartitionBoundSpec *spec = lfirst(cell);
> > > >  
> > > > -                if (spec->strategy != PARTITION_STRATEGY_LIST)
> > > > +                if (spec->strategy != key->strategy)
> > > >                      elog(ERROR, "invalid strategy in partition bound spec");
> > > >  
> > > >                  foreach(c, spec->listdatums)
> > > > @@ -464,6 +468,7 @@ RelationBuildPartitionDesc(Relation rel)
> > > >          switch (key->strategy)
> > > >          {
> > > >              case PARTITION_STRATEGY_LIST:
> > > > +            case PARTITION_STRATEGY_HASH:
> > > >                  {
> > > >                      boundinfo->has_null = found_null;
> > > >                      boundinfo->indexes = (int *) palloc(ndatums * sizeof(int));
> > > > @@ -829,6 +834,18 @@ check_new_partition_bound(char *relname, Relation parent, Node *bound)
> > > >                  break;
> > > >              }
> > > >  
> > > > +        case PARTITION_STRATEGY_HASH:
> > > > +            {
> > > > +                Assert(spec->strategy == PARTITION_STRATEGY_HASH);
> > > > +
> > > > +                if (partdesc->nparts + 1 > key->partnparts)
> > > > +                    ereport(ERROR,
> > > > +                            (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
> > > > +                    errmsg("cannot create hash partition more than %d for %s",
> > > > +                            key->partnparts, RelationGetRelationName(parent))));
> > > > +                break;
> > > > +            }
> > > > +
> > > >          default:
> > > >              elog(ERROR, "unexpected partition strategy: %d",
> > > >                   (int) key->strategy);
> > > > @@ -916,6 +933,11 @@ get_qual_from_partbound(Relation rel, Relation parent, Node *bound)
> > > >              my_qual = get_qual_for_range(key, spec);
> > > >              break;
> > > >  
> > > > +        case PARTITION_STRATEGY_HASH:
> > > > +            Assert(spec->strategy == PARTITION_STRATEGY_LIST);
> > > > +            my_qual = get_qual_for_hash(key, spec);
> > > > +            break;
> > > > +
> > > >          default:
> > > >              elog(ERROR, "unexpected partition strategy: %d",
> > > >                   (int) key->strategy);
> > > > @@ -1146,6 +1168,84 @@ RelationGetPartitionDispatchInfo(Relation rel, int lockmode,
> > > >      return pd;
> > > >  }
> > > >  
> > > > +/*
> > > > + * convert_expr_for_hash
> > > > + *
> > > > + * Converts a expr for a hash partition's constraint.
> > > > + * expr is converted into 'abs(hashfunc(expr)) % npart".
> > > > + *
> > > > + * npart: number of partitions
> > > > + * hashfunc: OID of hash function
> > > > + */
> > > > +Expr *
> > > > +convert_expr_for_hash(Expr *expr, int npart, Oid hashfunc)
> > > > +{
> > > > +    FuncExpr   *func,
> > > > +               *abs;
> > > > +    Expr        *modexpr;
> > > > +    Oid            modoid;
> > > > +    Oid            int4oid[1] = {INT4OID};
> > > > +
> > > > +    ParseState *pstate = make_parsestate(NULL);
> > > > +    Value       *val_npart = makeInteger(npart);
> > > > +    Node       *const_npart = (Node *) make_const(pstate, val_npart, -1);
> > > > +
> > > > +    /* hash function */
> > > > +    func = makeFuncExpr(hashfunc,
> > > > +                        INT4OID,
> > > > +                        list_make1(expr),
> > > > +                        0,
> > > > +                        0,
> > > > +                        COERCE_EXPLICIT_CALL);
> > > > +
> > > > +    /* Abs */
> > > > +    abs = makeFuncExpr(LookupFuncName(list_make1(makeString("abs")), 1, int4oid, false),
> > > > +                       INT4OID,
> > > > +                       list_make1(func),
> > > > +                       0,
> > > > +                       0,
> > > > +                       COERCE_EXPLICIT_CALL);
> > > > +
> > > > +    /* modulo by npart */
> > > > +    modoid = LookupOperName(pstate, list_make1(makeString("%")), INT4OID, INT4OID, false, -1);
> > > > +    modexpr = make_opclause(modoid, INT4OID, false, (Expr*)abs, (Expr*)const_npart, 0, 0);
> > > > +
> > > > +    return modexpr;
> > > > +}
> > > > +
> > > > +
> > > > +/*
> > > > + * get_next_hash_partition_index
> > > > + *
> > > > + * Returns the minimal index which is not used for hash partition.
> > > > + */
> > > > +int
> > > > +get_next_hash_partition_index(Relation parent)
> > > > +{
> > > > +    PartitionKey key = RelationGetPartitionKey(parent);
> > > > +    PartitionDesc partdesc = RelationGetPartitionDesc(parent);
> > > > +
> > > > +    int      i;
> > > > +    bool *used = palloc0(sizeof(int) * key->partnparts);
> > > > +
> > > > +    /* mark used for existing partition indexs */
> > > > +    for (i = 0; i < partdesc->boundinfo->ndatums; i++)
> > > > +    {
> > > > +        Datum* datum = partdesc->boundinfo->datums[i];
> > > > +        int idx = DatumGetInt16(datum[0]);
> > > > +
> > > > +        if (!used[idx])
> > > > +            used[idx] = true;
> > > > +    }
> > > > +
> > > > +    /* find the minimal unused index */
> > > > +    for (i = 0; i < key->partnparts; i++)
> > > > +        if (!used[i])
> > > > +            break;
> > > > +
> > > > +    return i;
> > > > +}
> > > > +
> > > >  /* Module-local functions */
> > > >  
> > > >  /*
> > > > @@ -1467,6 +1567,43 @@ get_qual_for_range(PartitionKey key, PartitionBoundSpec *spec)
> > > >  }
> > > >  
> > > >  /*
> > > > + * get_qual_for_hash
> > > > + *
> > > > + * Returns a list of expressions to use as a hash partition's constraint.
> > > > + */
> > > > +static List *
> > > > +get_qual_for_hash(PartitionKey key, PartitionBoundSpec *spec)
> > > > +{
> > > > +    List       *result;
> > > > +    Expr       *keyCol;
> > > > +    Expr       *expr;
> > > > +    Expr        *opexpr;
> > > > +    Oid            operoid;
> > > > +    ParseState *pstate = make_parsestate(NULL);
> > > > +
> > > > +    /* Left operand */
> > > > +    if (key->partattrs[0] != 0)
> > > > +        keyCol = (Expr *) makeVar(1,
> > > > +                                  key->partattrs[0],
> > > > +                                  key->parttypid[0],
> > > > +                                  key->parttypmod[0],
> > > > +                                  key->parttypcoll[0],
> > > > +                                  0);
> > > > +    else
> > > > +        keyCol = (Expr *) copyObject(linitial(key->partexprs));
> > > > +
> > > > +    expr = convert_expr_for_hash(keyCol, key->partnparts, key->parthashfunc);
> > > > +
> > > > +    /* equals the listdaums value */
> > > > +    operoid = LookupOperName(pstate, list_make1(makeString("=")), INT4OID, INT4OID, false, -1);
> > > > +    opexpr = make_opclause(operoid, BOOLOID, false, expr, linitial(spec->listdatums), 0, 0);
> > > > +
> > > > +    result = list_make1(opexpr);
> > > > +
> > > > +    return result;
> > > > +}
> > > > +
> > > > +/*
> > > >   * get_partition_operator
> > > >   *
> > > >   * Return oid of the operator of given strategy for a given partition key
> > > > @@ -1730,6 +1867,11 @@ get_partition_for_tuple(PartitionDispatch *pd,
> > > >                              (errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
> > > >                          errmsg("range partition key of row contains null")));
> > > >          }
> > > > +        else if (key->strategy == PARTITION_STRATEGY_HASH)
> > > > +        {
> > > > +            values[0] = OidFunctionCall1(key->parthashfunc, values[0]);
> > > > +            values[0] = Int16GetDatum(Abs(DatumGetInt16(values[0])) % key->partnparts);
> > > > +        }
> > > >  
> > > >          if (partdesc->boundinfo->has_null && isnull[0])
> > > >              /* Tuple maps to the null-accepting list partition */
> > > > @@ -1744,6 +1886,7 @@ get_partition_for_tuple(PartitionDispatch *pd,
> > > >              switch (key->strategy)
> > > >              {
> > > >                  case PARTITION_STRATEGY_LIST:
> > > > +                case PARTITION_STRATEGY_HASH:
> > > >                      if (cur_offset >= 0 && equal)
> > > >                          cur_index = partdesc->boundinfo->indexes[cur_offset];
> > > >                      else
> > > > @@ -1968,6 +2111,7 @@ partition_bound_cmp(PartitionKey key, PartitionBoundInfo boundinfo,
> > > >      switch (key->strategy)
> > > >      {
> > > >          case PARTITION_STRATEGY_LIST:
> > > > +        case PARTITION_STRATEGY_HASH:
> > > >              cmpval = DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[0],
> > > >                                                       key->partcollation[0],
> > > >                                                       bound_datums[0],
> > > > diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
> > > > index 3cea220..5a28cc0 100644
> > > > --- a/src/backend/commands/tablecmds.c
> > > > +++ b/src/backend/commands/tablecmds.c
> > > > @@ -41,6 +41,7 @@
> > > >  #include "catalog/pg_inherits_fn.h"
> > > >  #include "catalog/pg_namespace.h"
> > > >  #include "catalog/pg_opclass.h"
> > > > +#include "catalog/pg_proc.h"
> > > >  #include "catalog/pg_tablespace.h"
> > > >  #include "catalog/pg_trigger.h"
> > > >  #include "catalog/pg_type.h"
> > > > @@ -77,6 +78,7 @@
> > > >  #include "parser/parse_oper.h"
> > > >  #include "parser/parse_relation.h"
> > > >  #include "parser/parse_type.h"
> > > > +#include "parser/parse_func.h"
> > > >  #include "parser/parse_utilcmd.h"
> > > >  #include "parser/parser.h"
> > > >  #include "pgstat.h"
> > > > @@ -450,7 +452,7 @@ static void RangeVarCallbackForAlterRelation(const RangeVar *rv, Oid relid,
> > > >                                   Oid oldrelid, void *arg);
> > > >  static bool is_partition_attr(Relation rel, AttrNumber attnum, bool *used_in_expr);
> > > >  static PartitionSpec *transformPartitionSpec(Relation rel, PartitionSpec *partspec, char *strategy);
> > > > -static void ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs,
> > > > +static void ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs, Oid *partatttypes,
> > > >                        List **partexprs, Oid *partopclass, Oid *partcollation);
> > > >  static void CreateInheritance(Relation child_rel, Relation parent_rel);
> > > >  static void RemoveInheritance(Relation child_rel, Relation parent_rel);
> > > > @@ -799,8 +801,10 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
> > > >          AttrNumber    partattrs[PARTITION_MAX_KEYS];
> > > >          Oid            partopclass[PARTITION_MAX_KEYS];
> > > >          Oid            partcollation[PARTITION_MAX_KEYS];
> > > > +        Oid            partatttypes[PARTITION_MAX_KEYS];
> > > >          List       *partexprs = NIL;
> > > >          List       *cmds = NIL;
> > > > +        Oid hashfuncOid = InvalidOid;
> > > >  
> > > >          /*
> > > >           * We need to transform the raw parsetrees corresponding to partition
> > > > @@ -811,15 +815,40 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
> > > >          stmt->partspec = transformPartitionSpec(rel, stmt->partspec,
> > > >                                                  &strategy);
> > > >          ComputePartitionAttrs(rel, stmt->partspec->partParams,
> > > > -                              partattrs, &partexprs, partopclass,
> > > > +                              partattrs, partatttypes, &partexprs, partopclass,
> > > >                                partcollation);
> > > >  
> > > >          partnatts = list_length(stmt->partspec->partParams);
> > > > +
> > > > +        if (strategy == PARTITION_STRATEGY_HASH)
> > > > +        {
> > > > +            Oid funcrettype;
> > > > +
> > > > +            if (partnatts != 1)
> > > > +                ereport(ERROR,
> > > > +                        (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > > +                        errmsg("number of partition key must be 1 for hash partition")));
> > > > +
> > > > +            hashfuncOid = LookupFuncName(stmt->partspec->hashfunc, 1, partatttypes, false);
> > > > +            funcrettype = get_func_rettype(hashfuncOid);
> > > > +            if (funcrettype != INT4OID)
> > > > +                ereport(ERROR,
> > > > +                        (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > > +                        errmsg("hash function for partitioning must return integer")));
> > > > +
> > > > +            if (func_volatile(hashfuncOid) != PROVOLATILE_IMMUTABLE)
> > > > +                ereport(ERROR,
> > > > +                        (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > > +                        errmsg("hash function for partitioning must be marked IMMUTABLE")));
> > > > +
> > > > +        }
> > > > +
> > > >          StorePartitionKey(rel, strategy, partnatts, partattrs, partexprs,
> > > > -                          partopclass, partcollation);
> > > > +                          partopclass, partcollation, stmt->partspec->partnparts, hashfuncOid);
> > > >  
> > > > -        /* Force key columns to be NOT NULL when using range partitioning */
> > > > -        if (strategy == PARTITION_STRATEGY_RANGE)
> > > > +        /* Force key columns to be NOT NULL when using range or hash partitioning */
> > > > +        if (strategy == PARTITION_STRATEGY_RANGE ||
> > > > +            strategy == PARTITION_STRATEGY_HASH)
> > > >          {
> > > >              for (i = 0; i < partnatts; i++)
> > > >              {
> > > > @@ -12783,18 +12812,51 @@ transformPartitionSpec(Relation rel, PartitionSpec *partspec, char *strategy)
> > > >      newspec->strategy = partspec->strategy;
> > > >      newspec->location = partspec->location;
> > > >      newspec->partParams = NIL;
> > > > +    newspec->partnparts = partspec->partnparts;
> > > > +    newspec->hashfunc = partspec->hashfunc;
> > > >  
> > > >      /* Parse partitioning strategy name */
> > > >      if (!pg_strcasecmp(partspec->strategy, "list"))
> > > >          *strategy = PARTITION_STRATEGY_LIST;
> > > >      else if (!pg_strcasecmp(partspec->strategy, "range"))
> > > >          *strategy = PARTITION_STRATEGY_RANGE;
> > > > +    else if (!pg_strcasecmp(partspec->strategy, "hash"))
> > > > +        *strategy = PARTITION_STRATEGY_HASH;
> > > >      else
> > > >          ereport(ERROR,
> > > >                  (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > >                   errmsg("unrecognized partitioning strategy \"%s\"",
> > > >                          partspec->strategy)));
> > > >  
> > > > +    if (*strategy == PARTITION_STRATEGY_HASH)
> > > > +    {
> > > > +        if (partspec->partnparts < 0)
> > > > +            ereport(ERROR,
> > > > +                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > > +                     errmsg("number of partitions must be specified for hash partition")));
> > > > +        else if (partspec->partnparts == 0)
> > > > +            ereport(ERROR,
> > > > +                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > > +                     errmsg("number of partitions must be greater than 0")));
> > > > +
> > > > +        if (list_length(partspec->hashfunc) == 0)
> > > > +            ereport(ERROR,
> > > > +                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > > +                     errmsg("hash function must be specified for hash partition")));
> > > > +    }
> > > > +    else
> > > > +    {
> > > > +        if (partspec->partnparts >= 0)
> > > > +            ereport(ERROR,
> > > > +                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > > +                     errmsg("number of partitions can be specified only for hash partition")));
> > > > +
> > > > +        if (list_length(partspec->hashfunc) > 0)
> > > > +            ereport(ERROR,
> > > > +                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > > +                     errmsg("hash function can be specified only for hash partition")));
> > > > +    }
> > > > +
> > > >      /*
> > > >       * Create a dummy ParseState and insert the target relation as its sole
> > > >       * rangetable entry.  We need a ParseState for transformExpr.
> > > > @@ -12843,7 +12905,7 @@ transformPartitionSpec(Relation rel, PartitionSpec *partspec, char *strategy)
> > > >   * Compute per-partition-column information from a list of PartitionElem's
> > > >   */
> > > >  static void
> > > > -ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs,
> > > > +ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs, Oid *partatttypes,
> > > >                        List **partexprs, Oid *partopclass, Oid *partcollation)
> > > >  {
> > > >      int            attn;
> > > > @@ -13010,6 +13072,7 @@ ComputePartitionAttrs(Relation rel, List *partParams, AttrNumber *partattrs,
> > > >                                                 "btree",
> > > >                                                 BTREE_AM_OID);
> > > >  
> > > > +        partatttypes[attn] = atttype;
> > > >          attn++;
> > > >      }
> > > >  }
> > > > diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
> > > > index 05d8538..f4febc9 100644
> > > > --- a/src/backend/nodes/copyfuncs.c
> > > > +++ b/src/backend/nodes/copyfuncs.c
> > > > @@ -4232,6 +4232,8 @@ _copyPartitionSpec(const PartitionSpec *from)
> > > >  
> > > >      COPY_STRING_FIELD(strategy);
> > > >      COPY_NODE_FIELD(partParams);
> > > > +    COPY_SCALAR_FIELD(partnparts);
> > > > +    COPY_NODE_FIELD(hashfunc);
> > > >      COPY_LOCATION_FIELD(location);
> > > >  
> > > >      return newnode;
> > > > diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
> > > > index d595cd7..d589eac 100644
> > > > --- a/src/backend/nodes/equalfuncs.c
> > > > +++ b/src/backend/nodes/equalfuncs.c
> > > > @@ -2725,6 +2725,8 @@ _equalPartitionSpec(const PartitionSpec *a, const PartitionSpec *b)
> > > >  {
> > > >      COMPARE_STRING_FIELD(strategy);
> > > >      COMPARE_NODE_FIELD(partParams);
> > > > +    COMPARE_SCALAR_FIELD(partnparts);
> > > > +    COMPARE_NODE_FIELD(hashfunc);
> > > >      COMPARE_LOCATION_FIELD(location);
> > > >  
> > > >      return true;
> > > > diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
> > > > index b3802b4..d6db80e 100644
> > > > --- a/src/backend/nodes/outfuncs.c
> > > > +++ b/src/backend/nodes/outfuncs.c
> > > > @@ -3318,6 +3318,8 @@ _outPartitionSpec(StringInfo str, const PartitionSpec *node)
> > > >  
> > > >      WRITE_STRING_FIELD(strategy);
> > > >      WRITE_NODE_FIELD(partParams);
> > > > +    WRITE_INT_FIELD(partnparts);
> > > > +    WRITE_NODE_FIELD(hashfunc);
> > > >      WRITE_LOCATION_FIELD(location);
> > > >  }
> > > >  
> > > > diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
> > > > index e833b2e..b67140d 100644
> > > > --- a/src/backend/parser/gram.y
> > > > +++ b/src/backend/parser/gram.y
> > > > @@ -574,6 +574,8 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
> > > >  %type <list>        partbound_datum_list
> > > >  %type <partrange_datum>    PartitionRangeDatum
> > > >  %type <list>        range_datum_list
> > > > +%type <ival>        hash_partitions
> > > > +%type <list>        hash_function
> > > >  
> > > >  /*
> > > >   * Non-keyword token types.  These are hard-wired into the "flex" lexer.
> > > > @@ -627,7 +629,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
> > > >  
> > > >      GLOBAL GRANT GRANTED GREATEST GROUP_P GROUPING
> > > >  
> > > > -    HANDLER HAVING HEADER_P HOLD HOUR_P
> > > > +    HANDLER HASH HAVING HEADER_P HOLD HOUR_P
> > > >  
> > > >      IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IMPORT_P IN_P
> > > >      INCLUDING INCREMENT INDEX INDEXES INHERIT INHERITS INITIALLY INLINE_P
> > > > @@ -651,7 +653,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
> > > >      OBJECT_P OF OFF OFFSET OIDS OLD ON ONLY OPERATOR OPTION OPTIONS OR
> > > >      ORDER ORDINALITY OUT_P OUTER_P OVER OVERLAPS OVERLAY OWNED OWNER
> > > >  
> > > > -    PARALLEL PARSER PARTIAL PARTITION PASSING PASSWORD PLACING PLANS POLICY
> > > > +    PARALLEL PARSER PARTIAL PARTITION PARTITIONS PASSING PASSWORD PLACING PLANS POLICY
> > > >      POSITION PRECEDING PRECISION PRESERVE PREPARE PREPARED PRIMARY
> > > >      PRIOR PRIVILEGES PROCEDURAL PROCEDURE PROGRAM PUBLICATION
> > > >  
> > > > @@ -2587,6 +2589,16 @@ ForValues:
> > > >  
> > > >                      $$ = (Node *) n;
> > > >                  }
> > > > +
> > > > +            /* a HASH partition */
> > > > +            | /*EMPTY*/
> > > > +                {
> > > > +                    PartitionBoundSpec *n = makeNode(PartitionBoundSpec);
> > > > +
> > > > +                    n->strategy = PARTITION_STRATEGY_HASH;
> > > > +
> > > > +                    $$ = (Node *) n;
> > > > +                }
> > > >          ;
> > > >  
> > > >  partbound_datum:
> > > > @@ -3666,7 +3678,7 @@ OptPartitionSpec: PartitionSpec    { $$ = $1; }
> > > >              | /*EMPTY*/            { $$ = NULL; }
> > > >          ;
> > > >  
> > > > -PartitionSpec: PARTITION BY part_strategy '(' part_params ')'
> > > > +PartitionSpec: PARTITION BY part_strategy '(' part_params ')' hash_partitions hash_function
> > > >                  {
> > > >                      PartitionSpec *n = makeNode(PartitionSpec);
> > > >  
> > > > @@ -3674,10 +3686,21 @@ PartitionSpec: PARTITION BY part_strategy '(' part_params ')'
> > > >                      n->partParams = $5;
> > > >                      n->location = @1;
> > > >  
> > > > +                    n->partnparts = $7;
> > > > +                    n->hashfunc = $8;
> > > > +
> > > >                      $$ = n;
> > > >                  }
> > > >          ;
> > > >  
> > > > +hash_partitions: PARTITIONS Iconst { $$ = $2; }
> > > > +                    | /*EMPTY*/   { $$ = -1; }
> > > > +        ;
> > > > +
> > > > +hash_function: USING handler_name { $$ = $2; }
> > > > +                    | /*EMPTY*/ { $$ = NULL; }
> > > > +        ;
> > > > +
> > > >  part_strategy:    IDENT                    { $$ = $1; }
> > > >                  | unreserved_keyword    { $$ = pstrdup($1); }
> > > >          ;
> > > > @@ -14377,6 +14400,7 @@ unreserved_keyword:
> > > >              | GLOBAL
> > > >              | GRANTED
> > > >              | HANDLER
> > > > +            | HASH
> > > >              | HEADER_P
> > > >              | HOLD
> > > >              | HOUR_P
> > > > @@ -14448,6 +14472,7 @@ unreserved_keyword:
> > > >              | PARSER
> > > >              | PARTIAL
> > > >              | PARTITION
> > > > +            | PARTITIONS
> > > >              | PASSING
> > > >              | PASSWORD
> > > >              | PLANS
> > > > diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c
> > > > index ff2bab6..8e1be31 100644
> > > > --- a/src/backend/parser/parse_utilcmd.c
> > > > +++ b/src/backend/parser/parse_utilcmd.c
> > > > @@ -40,6 +40,7 @@
> > > >  #include "catalog/pg_opclass.h"
> > > >  #include "catalog/pg_operator.h"
> > > >  #include "catalog/pg_type.h"
> > > > +#include "catalog/partition.h"
> > > >  #include "commands/comment.h"
> > > >  #include "commands/defrem.h"
> > > >  #include "commands/tablecmds.h"
> > > > @@ -3252,6 +3253,24 @@ transformPartitionBound(ParseState *pstate, Relation parent, Node *bound)
> > > >              ++i;
> > > >          }
> > > >      }
> > > > +    else if (strategy == PARTITION_STRATEGY_HASH)
> > > > +    {
> > > > +        Value     *conval;
> > > > +        Node        *value;
> > > > +        int          index;
> > > > +
> > > > +        if (spec->strategy != PARTITION_STRATEGY_HASH)
> > > > +            ereport(ERROR,
> > > > +                    (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
> > > > +                 errmsg("invalid bound specification for a hash partition")));
> > > > +
> > > > +        index = get_next_hash_partition_index(parent);
> > > > +
> > > > +        /* store the partition index as a listdatums value */
> > > > +        conval = makeInteger(index);
> > > > +        value = (Node *) make_const(pstate, conval, -1);
> > > > +        result_spec->listdatums = list_make1(value);
> > > > +    }
> > > >      else
> > > >          elog(ERROR, "unexpected partition strategy: %d", (int) strategy);
> > > >  
> > > > diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
> > > > index b27b77d..fab6eea 100644
> > > > --- a/src/backend/utils/adt/ruleutils.c
> > > > +++ b/src/backend/utils/adt/ruleutils.c
> > > > @@ -1423,7 +1423,7 @@ pg_get_indexdef_worker(Oid indexrelid, int colno,
> > > >   *
> > > >   * Returns the partition key specification, ie, the following:
> > > >   *
> > > > - * PARTITION BY { RANGE | LIST } (column opt_collation opt_opclass [, ...])
> > > > + * PARTITION BY { RANGE | LIST | HASH } (column opt_collation opt_opclass [, ...])
> > > >   */
> > > >  Datum
> > > >  pg_get_partkeydef(PG_FUNCTION_ARGS)
> > > > @@ -1513,6 +1513,9 @@ pg_get_partkeydef_worker(Oid relid, int prettyFlags)
> > > >          case PARTITION_STRATEGY_RANGE:
> > > >              appendStringInfo(&buf, "RANGE");
> > > >              break;
> > > > +        case PARTITION_STRATEGY_HASH:
> > > > +            appendStringInfo(&buf, "HASH");
> > > > +            break;
> > > >          default:
> > > >              elog(ERROR, "unexpected partition strategy: %d",
> > > >                   (int) form->partstrat);
> > > > @@ -8520,6 +8523,9 @@ get_rule_expr(Node *node, deparse_context *context,
> > > >                          appendStringInfoString(buf, ")");
> > > >                          break;
> > > >  
> > > > +                    case PARTITION_STRATEGY_HASH:
> > > > +                        break;
> > > > +
> > > >                      default:
> > > >                          elog(ERROR, "unrecognized partition strategy: %d",
> > > >                               (int) spec->strategy);
> > > > diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
> > > > index 9001e20..829e4d2 100644
> > > > --- a/src/backend/utils/cache/relcache.c
> > > > +++ b/src/backend/utils/cache/relcache.c
> > > > @@ -855,6 +855,9 @@ RelationBuildPartitionKey(Relation relation)
> > > >      key->strategy = form->partstrat;
> > > >      key->partnatts = form->partnatts;
> > > >  
> > > > +    key->partnparts = form->partnparts;
> > > > +    key->parthashfunc = form->parthashfunc;
> > > > +
> > > >      /*
> > > >       * We can rely on the first variable-length attribute being mapped to the
> > > >       * relevant field of the catalog's C struct, because all previous
> > > > @@ -999,6 +1002,9 @@ copy_partition_key(PartitionKey fromkey)
> > > >      newkey->strategy = fromkey->strategy;
> > > >      newkey->partnatts = n = fromkey->partnatts;
> > > >  
> > > > +    newkey->partnparts = fromkey->partnparts;
> > > > +    newkey->parthashfunc = fromkey->parthashfunc;
> > > > +
> > > >      newkey->partattrs = (AttrNumber *) palloc(n * sizeof(AttrNumber));
> > > >      memcpy(newkey->partattrs, fromkey->partattrs, n * sizeof(AttrNumber));
> > > >  
> > > > diff --git a/src/include/catalog/heap.h b/src/include/catalog/heap.h
> > > > index 1187797..367e2f8 100644
> > > > --- a/src/include/catalog/heap.h
> > > > +++ b/src/include/catalog/heap.h
> > > > @@ -141,7 +141,7 @@ extern void StorePartitionKey(Relation rel,
> > > >                    AttrNumber *partattrs,
> > > >                    List *partexprs,
> > > >                    Oid *partopclass,
> > > > -                  Oid *partcollation);
> > > > +                  Oid *partcollation, int16 partnparts, Oid hashfunc);
> > > >  extern void RemovePartitionKeyByRelId(Oid relid);
> > > >  extern void StorePartitionBound(Relation rel, Relation parent, Node *bound);
> > > >  
> > > > diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
> > > > index b195d1a..80f4b0e 100644
> > > > --- a/src/include/catalog/partition.h
> > > > +++ b/src/include/catalog/partition.h
> > > > @@ -89,4 +89,6 @@ extern int get_partition_for_tuple(PartitionDispatch *pd,
> > > >                          TupleTableSlot *slot,
> > > >                          EState *estate,
> > > >                          Oid *failed_at);
> > > > +extern Expr *convert_expr_for_hash(Expr *expr, int npart, Oid hashfunc);
> > > > +extern int get_next_hash_partition_index(Relation parent);
> > > >  #endif   /* PARTITION_H */
> > > > diff --git a/src/include/catalog/pg_partitioned_table.h b/src/include/catalog/pg_partitioned_table.h
> > > > index bdff36a..69e509c 100644
> > > > --- a/src/include/catalog/pg_partitioned_table.h
> > > > +++ b/src/include/catalog/pg_partitioned_table.h
> > > > @@ -33,6 +33,9 @@ CATALOG(pg_partitioned_table,3350) BKI_WITHOUT_OIDS
> > > >      char        partstrat;        /* partitioning strategy */
> > > >      int16        partnatts;        /* number of partition key columns */
> > > >  
> > > > +    int16        partnparts;
> > > > +    Oid            parthashfunc;
> > > > +
> > > >      /*
> > > >       * variable-length fields start here, but we allow direct access to
> > > >       * partattrs via the C struct.  That's because the first variable-length
> > > > @@ -49,6 +52,8 @@ CATALOG(pg_partitioned_table,3350) BKI_WITHOUT_OIDS
> > > >      pg_node_tree partexprs;        /* list of expressions in the partition key;
> > > >                                   * one item for each zero entry in partattrs[] */
> > > >  #endif
> > > > +
> > > > +
> > > >  } FormData_pg_partitioned_table;
> > > >  
> > > >  /* ----------------
> > > > @@ -62,13 +67,15 @@ typedef FormData_pg_partitioned_table *Form_pg_partitioned_table;
> > > >   *        compiler constants for pg_partitioned_table
> > > >   * ----------------
> > > >   */
> > > > -#define Natts_pg_partitioned_table                7
> > > > +#define Natts_pg_partitioned_table                9
> > > >  #define Anum_pg_partitioned_table_partrelid        1
> > > >  #define Anum_pg_partitioned_table_partstrat        2
> > > >  #define Anum_pg_partitioned_table_partnatts        3
> > > > -#define Anum_pg_partitioned_table_partattrs        4
> > > > -#define Anum_pg_partitioned_table_partclass        5
> > > > -#define Anum_pg_partitioned_table_partcollation 6
> > > > -#define Anum_pg_partitioned_table_partexprs        7
> > > > +#define Anum_pg_partitioned_table_partnparts    4
> > > > +#define Anum_pg_partitioned_table_parthashfunc    5
> > > > +#define Anum_pg_partitioned_table_partattrs        6
> > > > +#define Anum_pg_partitioned_table_partclass        7
> > > > +#define Anum_pg_partitioned_table_partcollation 8
> > > > +#define Anum_pg_partitioned_table_partexprs        9
> > > >  
> > > >  #endif   /* PG_PARTITIONED_TABLE_H */
> > > > diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
> > > > index 5afc3eb..1c3474f 100644
> > > > --- a/src/include/nodes/parsenodes.h
> > > > +++ b/src/include/nodes/parsenodes.h
> > > > @@ -730,11 +730,14 @@ typedef struct PartitionSpec
> > > >      NodeTag        type;
> > > >      char       *strategy;        /* partitioning strategy ('list' or 'range') */
> > > >      List       *partParams;        /* List of PartitionElems */
> > > > +    int            partnparts;
> > > > +    List       *hashfunc;
> > > >      int            location;        /* token location, or -1 if unknown */
> > > >  } PartitionSpec;
> > > >  
> > > >  #define PARTITION_STRATEGY_LIST        'l'
> > > >  #define PARTITION_STRATEGY_RANGE    'r'
> > > > +#define PARTITION_STRATEGY_HASH        'h'
> > > >  
> > > >  /*
> > > >   * PartitionBoundSpec - a partition bound specification
> > > > diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
> > > > index 985d650..0597939 100644
> > > > --- a/src/include/parser/kwlist.h
> > > > +++ b/src/include/parser/kwlist.h
> > > > @@ -180,6 +180,7 @@ PG_KEYWORD("greatest", GREATEST, COL_NAME_KEYWORD)
> > > >  PG_KEYWORD("group", GROUP_P, RESERVED_KEYWORD)
> > > >  PG_KEYWORD("grouping", GROUPING, COL_NAME_KEYWORD)
> > > >  PG_KEYWORD("handler", HANDLER, UNRESERVED_KEYWORD)
> > > > +PG_KEYWORD("hash", HASH, UNRESERVED_KEYWORD)
> > > >  PG_KEYWORD("having", HAVING, RESERVED_KEYWORD)
> > > >  PG_KEYWORD("header", HEADER_P, UNRESERVED_KEYWORD)
> > > >  PG_KEYWORD("hold", HOLD, UNRESERVED_KEYWORD)
> > > > @@ -291,6 +292,7 @@ PG_KEYWORD("parallel", PARALLEL, UNRESERVED_KEYWORD)
> > > >  PG_KEYWORD("parser", PARSER, UNRESERVED_KEYWORD)
> > > >  PG_KEYWORD("partial", PARTIAL, UNRESERVED_KEYWORD)
> > > >  PG_KEYWORD("partition", PARTITION, UNRESERVED_KEYWORD)
> > > > +PG_KEYWORD("partitions", PARTITIONS, UNRESERVED_KEYWORD)
> > > >  PG_KEYWORD("passing", PASSING, UNRESERVED_KEYWORD)
> > > >  PG_KEYWORD("password", PASSWORD, UNRESERVED_KEYWORD)
> > > >  PG_KEYWORD("placing", PLACING, RESERVED_KEYWORD)
> > > > diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
> > > > index a617a7c..660adfb 100644
> > > > --- a/src/include/utils/rel.h
> > > > +++ b/src/include/utils/rel.h
> > > > @@ -62,6 +62,9 @@ typedef struct PartitionKeyData
> > > >      Oid           *partopcintype;    /* OIDs of opclass declared input data types */
> > > >      FmgrInfo   *partsupfunc;    /* lookup info for support funcs */
> > > >  
> > > > +    int16        partnparts;        /* number of hash partitions */
> > > > +    Oid            parthashfunc;    /* OID of hash function */
> > > > +
> > > >      /* Partitioning collation per attribute */
> > > >      Oid           *partcollation;
> > > >  
> > > 
> > > > 
> > > > -- 
> > > > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> > > > To make changes to your subscription:
> > > > http://www.postgresql.org/mailpref/pgsql-hackers
> > > 
> > > 
> > > -- 
> > > Best regards,
> > > Aleksander Alekseev
> > 
> > 
> > -- 
> > Yugo Nagata <nagata@sraoss.co.jp>
> 
> -- 
> Best regards,
> Aleksander Alekseev


-- 
Yugo Nagata <nagata@sraoss.co.jp>

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

02 March 2017, 16:03:42

On Wed, Mar 1, 2017 at 3:50 PM, Yugo Nagata <nagata@sraoss.co.jp> wrote:

[....]

> I Agree that it is unavoidable partitions number in modulo hashing,
> but we can do in other hashing technique. Have you had thought about
> Linear hashing[1] or Consistent hashing[2]? This will allow us to
> add/drop
> partition with minimal row moment.

Thank you for your information of hash technique. I'll see them
and try to allowing the number of partitions to be changed.

Thanks for showing interest, I was also talking about this with Robert Haas and

hacking on this, here is what we came up with this.

If we want to introduce hash partitioning without syntax contort and minimal

movement while changing hash partitions (ADD-DROP/ATTACH-DETACH operation),

at start I thought we could pick up linear hashing, because of in both the

hashing we might need to move approx tot_num_of_tuple/tot_num_of_partitions

tuples at adding new partition and no row moment required at dropping

partitioning.

With further thinking and talking through the idea of using linear hashing

with my team, we realized that has some problems specially during pg_dump

and pg_upgrade. Both a regular pg_dump and the binary-upgrade version of

pg_dump which is used by pg_restore need to maintain the identity of the

partitions. We can't rely on things like OID order which may be unstable, or

name order which might not match the order in which partitions were added. So

somehow the partition position would need to be specified explicitly.

So later we came up with some syntax like this (just fyi, this doesn't add

any new keywords):

create table foo (a integer, b text) partition by hash (a);

create table foo1 partition of foo with (modulus 4, remainder 0);

create table foo2 partition of foo with (modulus 8, remainder 1); -- legal, modulus doesn't need to match

create table foo3 partition of foo with (modulus 8, remainder 4); -- illegal, overlaps foo1

Here we need to enforce a rule that every modulus must be a factor of the next

larger modulus. So, for example, if you have a bunch of partitions that all have

modulus 5, you can add a new partition with modulus 10 or a new partition with

modulus 15, but you cannot add both a partition with modulus 10 and a partition

with modulus 15, because 10 is not a factor of 15. However, you could

simultaneously use modulus 4, modulus 8, modulus 16, and modulus 32 if you

wished, because each modulus is a factor of the next larger one. You could

also use modulus 10, modulus 20, and modulus 60. But you could not use modulus

10, modulus 15, and modulus 60, because while both of the smaller module are

factors of 60, it is not true that each is a factor of the next.

Other advantages with this rule are:

1. Dropping (or detaching) and adding (or attaching) a partition can never

cause the rule to be violated.

2. We can easily build a tuple-routing data structure based on the largest

modulus.

For example: If the user has

partition 1 with (modulus 2, remainder 1),

partition 2 with (modulus 4, remainder 2),

partition 3 with (modulus 8, remainder 0) and

partition 4 with (modulus 8, remainder 4),

then we can build the following tuple routing array in the relcache:

== lookup table for hashvalue % 8 ==

0 => p3

1 => p1

2 => p2

3 => p1

4 => p4

5 => p1

6 => p2

7 => p1

3. It's also quite easy to test with a proposed new partition overlaps with any

existing partition. Just build the mapping array and see if you ever end up

trying to assign a partition to a slot that's already been assigned to some

other partition.

We can still work on the proposed syntax - and I am open for suggestions. One

more thought is to use FOR VALUES HAVING like:

CREATE TABLE foo1 PARTITION OF foo FOR VALUES HAVING (modulus 2, remainder 1);

But still more thoughts/inputs welcome here.

Attached patch implements former syntax, here is quick demonstration:

1.CREATE :

create table foo (a integer, b text) partition by hash (a);

create table foo1 partition of foo with (modulus 2, remainder 1);

create table foo2 partition of foo with (modulus 4, remainder 2);

create table foo3 partition of foo with (modulus 8, remainder 0);

create table foo4 partition of foo with (modulus 8, remainder 4);

2. Display parent table info:

postgres=# \d+ foo

Table "public.foo"

--------+---------+-----------+----------+---------+----------+--------------+-------------

Partition key: HASH (a)

Partitions: foo1 WITH (modulus 2, remainder 1),

foo2 WITH (modulus 4, remainder 2),

foo3 WITH (modulus 8, remainder 0),

foo4 WITH (modulus 8, remainder 4)

3. Display child table info:

postgres=# \d+ foo1

Table "public.foo1"

--------+---------+-----------+----------+---------+----------+--------------+-------------

Partition of: foo WITH (modulus 2, remainder 1)

4. INSERT:

postgres=# insert into foo select i, 'abc' from generate_series(1,10) i;

INSERT 0 10

postgres=# select tableoid::regclass as part, * from foo;

part | a | b

------+----+-----

foo1 | 3 | abc

foo1 | 4 | abc

foo1 | 7 | abc

foo1 | 10 | abc

foo2 | 1 | abc

foo2 | 2 | abc

foo2 | 9 | abc

foo3 | 6 | abc

foo4 | 5 | abc

foo4 | 8 | abc

(10 rows)

TODOs.

1. Maybe need some work in the CREATE TABLE .. PARTITION OF .. syntax.

2. Trim regression tests (if require).

3. Documentation

Thoughts/Comments?

Attachment

hash-partitioning_another_design-v1.patch

Re: [HACKERS] [POC] hash partitioning

From

Yugo Nagata

Date:

03 March 2017, 14:01:57

On Thu, 2 Mar 2017 18:33:42 +0530
amul sul <sulamul@gmail.com> wrote:

Thank you for the patch. This is very interesting. I'm going to look
into your code and write a feedback later.

> On Wed, Mar 1, 2017 at 3:50 PM, Yugo Nagata <nagata@sraoss.co.jp> wrote:
>
> > [....]
> >
> > I Agree that it is unavoidable partitions number in modulo hashing,
> > > but we can do in other hashing technique.  Have you had thought about
> > > Linear hashing[1] or Consistent hashing[2]?  This will allow us to
> > > add/drop
> > > partition with minimal row moment. 
> >
> > Thank you for your information of hash technique. I'll see them
> > and try to allowing the number of partitions to be changed.
> >
> > 
> Thanks for showing interest, I was also talking about this with Robert Haas
> and
> hacking on this, here is what we came up with this.
>
> If we want to introduce hash partitioning without syntax contort and minimal
> movement while changing hash partitions (ADD-DROP/ATTACH-DETACH operation),
> at start I thought we could pick up linear hashing, because of in both the
> hashing we might need to move approx tot_num_of_tuple/tot_num_of_partitions
> tuples at adding new partition and no row moment required at dropping
> partitioning.
>
> With further thinking and talking through the idea of using linear hashing
> with my team, we realized that has some problems specially during pg_dump
> and pg_upgrade. Both a regular pg_dump and the binary-upgrade version of
> pg_dump which is used by pg_restore need to maintain the identity of the
> partitions. We can't rely on things like OID order which may be unstable, or
> name order which might not match the order in which partitions were added.
> So
> somehow the partition position would need to be specified explicitly.
>
> So later we came up with some syntax like this (just fyi, this doesn't add
> any new keywords):
>
> create table foo (a integer, b text) partition by hash (a);
> create table foo1 partition of foo with (modulus 4, remainder 0);
> create table foo2 partition of foo with (modulus 8, remainder 1);  --
> legal, modulus doesn't need to match
> create table foo3 partition of foo with (modulus 8, remainder 4);  --
> illegal, overlaps foo1
>
> Here we need to enforce a rule that every modulus must be a factor of the
> next
> larger modulus. So, for example, if you have a bunch of partitions that all
> have
> modulus 5, you can add a new partition with modulus 10 or a new partition
> with
> modulus 15, but you cannot add both a partition with modulus 10 and a
> partition
> with modulus 15, because 10 is not a factor of 15. However, you could
> simultaneously use modulus 4, modulus 8, modulus 16, and modulus 32 if you
> wished, because each modulus is a factor of the next larger one. You could
> also use modulus 10, modulus 20, and modulus 60. But you could not use
> modulus
> 10, modulus 15, and modulus 60, because while both of the smaller module are
> factors of 60, it is not true that each is a factor of the next.
>
> Other advantages with this rule are:
>
> 1. Dropping (or detaching) and adding (or attaching) a partition can never
> cause the rule to be violated.
>
> 2. We can easily build a tuple-routing data structure based on the largest
> modulus.
>
> For example: If the user has
> partition 1 with (modulus 2, remainder 1),
> partition 2 with (modulus 4, remainder 2),
> partition 3 with (modulus 8, remainder 0) and
> partition 4 with (modulus 8, remainder 4),
>
> then we can build the following tuple routing array in the relcache:
>
> == lookup table for hashvalue % 8 ==
> 0 => p3
> 1 => p1
> 2 => p2
> 3 => p1
> 4 => p4
> 5 => p1
> 6 => p2
> 7 => p1
>
> 3. It's also quite easy to test with a proposed new partition overlaps with
> any
> existing partition. Just build the mapping array and see if you ever end up
> trying to assign a partition to a slot that's already been assigned to some
> other partition.
>
> We can still work on the proposed syntax - and I am open for suggestions.
> One
> more thought is to use FOR VALUES HAVING like:
> CREATE TABLE foo1 PARTITION OF foo FOR VALUES HAVING (modulus 2, remainder
> 1);
>
> But still more thoughts/inputs welcome here.
>
> Attached patch implements former syntax, here is quick demonstration:
>
> 1.CREATE :
> create table foo (a integer, b text) partition by hash (a);
> create table foo1 partition of foo with (modulus 2, remainder 1);
> create table foo2 partition of foo with (modulus 4, remainder 2);
> create table foo3 partition of foo with (modulus 8, remainder 0);
> create table foo4 partition of foo with (modulus 8, remainder 4);
>
> 2. Display parent table info:
> postgres=# \d+ foo
>                                     Table "public.foo"
>  Column |  Type   | Collation | Nullable | Default | Storage  | Stats
> target | Description
> --------+---------+-----------+----------+---------+----------+--------------+-------------
>  a      | integer |           |          |         | plain    |
>  |
>  b      | text    |           |          |         | extended |
>  |
> Partition key: HASH (a)
> Partitions: foo1 WITH (modulus 2, remainder 1),
>             foo2 WITH (modulus 4, remainder 2),
>             foo3 WITH (modulus 8, remainder 0),
>             foo4 WITH (modulus 8, remainder 4)
>
> 3. Display child table info:
> postgres=# \d+ foo1
>                                     Table "public.foo1"
>  Column |  Type   | Collation | Nullable | Default | Storage  | Stats
> target | Description
> --------+---------+-----------+----------+---------+----------+--------------+-------------
>  a      | integer |           |          |         | plain    |
>  |
>  b      | text    |           |          |         | extended |
>  |
> Partition of: foo WITH (modulus 2, remainder 1)
>
> 4. INSERT:
> postgres=# insert into foo select i, 'abc' from generate_series(1,10) i;
> INSERT 0 10
>
> postgres=# select tableoid::regclass as part, * from foo;
>  part | a  |  b
> ------+----+-----
>  foo1 |  3 | abc
>  foo1 |  4 | abc
>  foo1 |  7 | abc
>  foo1 | 10 | abc
>  foo2 |  1 | abc
>  foo2 |  2 | abc
>  foo2 |  9 | abc
>  foo3 |  6 | abc
>  foo4 |  5 | abc
>  foo4 |  8 | abc
> (10 rows)
>
> TODOs.
> 1. Maybe need some work in the CREATE TABLE .. PARTITION OF .. syntax.
> 2. Trim regression tests (if require).
> 3. Documentation
>
> Thoughts/Comments?


--
Yugo Nagata <nagata@sraoss.co.jp>

Re: [HACKERS] [POC] hash partitioning

From

Greg Stark

Date:

03 March 2017, 14:30:17

On 2 March 2017 at 13:03, amul sul <sulamul@gmail.com> wrote:
> create table foo (a integer, b text) partition by hash (a);
> create table foo1 partition of foo with (modulus 4, remainder 0);
> create table foo2 partition of foo with (modulus 8, remainder 1);  -- legal,
> modulus doesn't need to match
> create table foo3 partition of foo with (modulus 8, remainder 4);  --
> illegal, overlaps foo1

Instead of using modulus, why not just divide up the range of hash
keys using ranges? That should be just as good for a good hash
function (effectively using the high bits instead of the low bits of
the hash value). And it would mean you could reuse the machinery for
list partitioning for partition exclusion.

It also has the advantage that it's easier to see how to add more
partitions. You just split all the ranges and (and migrate the
data...). There's even the possibility of having uneven partitions if
you have a data distribution skew -- which can happen even if you have
a good hash function. In a degenerate case you could have a partition
for a single hash of a particularly common value then a reasonable
number of partitions for the remaining hash ranges.

-- 
greg

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

03 March 2017, 16:33:22

On Fri, Mar 3, 2017 at 5:00 PM, Greg Stark <stark@mit.edu> wrote:

On 2 March 2017 at 13:03, amul sul <sulamul@gmail.com> wrote:
> create table foo (a integer, b text) partition by hash (a);
> create table foo1 partition of foo with (modulus 4, remainder 0);
> create table foo2 partition of foo with (modulus 8, remainder 1); -- legal,
> modulus doesn't need to match
> create table foo3 partition of foo with (modulus 8, remainder 4); --
> illegal, overlaps foo1

Instead of using modulus, why not just divide up the range of hash
keys using ranges?

That should be just as good for a good hash

function (effectively using the high bits instead of the low bits of
the hash value). And it would mean you could reuse the machinery for
list partitioning for partition exclusion.

It also has the advantage that it's easier to see how to add more
partitions. You just split all the ranges and (and migrate the
data...). There's even the possibility of having uneven partitions if
you have a data distribution skew -- which can happen even if you have
a good hash function. In a degenerate case you could have a partition
for a single hash of a particularly common value then a reasonable
number of partitions for the remaining hash ranges.

Initially

had

to have

somewhat similar thought to make a range of hash

values for

each partition, using the same half-open interval syntax we use in general:

create table foo (a integer, b text) partition by hash (a);

create table foo1 partition of foo for values from (0) to (1073741824);

create table foo2 partition of foo for values from (1073741824) to (-2147483648);

create table foo3 partition of foo for values from (-2147483648) to (-1073741824);

create table foo4 partition of foo for values from (-1073741824) to (0);

That's really nice for the system, but not so much for the users. The system can

now generate each partition constraint correctly immediately upon seeing the SQL

statement for the corresponding table, which is very desirable. However, users are

not likely to know that the magic numbers to distribute keys equally across four

partitions are 1073741824, -2147483648, and -1073741824.

So it's pretty

user-unfriendly.

Regards,

Amul

Re: [HACKERS] [POC] hash partitioning

From

David Steele

Date:

14 March 2017, 17:08:14

On 3/3/17 8:33 AM, amul sul wrote:
> On Fri, Mar 3, 2017 at 5:00 PM, Greg Stark <stark@mit.edu
> 
>     It also has the advantage that it's easier to see how to add more
>     partitions. You just split all the ranges and (and migrate the
>     data...). There's even the possibility of having uneven partitions if
>     you have a data distribution skew -- which can happen even if you have
>     a good hash function. In a degenerate case you could have a partition
>     for a single hash of a particularly common value then a reasonable
>     number of partitions for the remaining hash ranges.
> 
> Initially
> we
>  had
> to have 
> somewhat similar thought to make a range of hash
> values for
>  
> each partition, using the same half-open interval syntax we use in general:
> 

<...>

> So it's pretty
>  
> user-unfriendly.

This patch is marked as POC and after a read-through I agree that's
exactly what it is.  As such, I'm not sure it belongs in the last
commitfest.  Furthermore, there has not been any activity or a new patch
in a while and we are halfway through the CF.

Please post an explanation for the delay and a schedule for the new
patch.  If no patch or explanation is posted by 2017-03-17 AoE I will
mark this submission "Returned with Feedback".

-- 
-David
david@pgmasters.net

Re: [HACKERS] [POC] hash partitioning

From

Robert Haas

Date:

15 March 2017, 19:25:23

On Tue, Mar 14, 2017 at 10:08 AM, David Steele <david@pgmasters.net> wrote:
> This patch is marked as POC and after a read-through I agree that's
> exactly what it is.

Just out of curiosity, were you looking at Nagata-san's patch, or Amul's?

> As such, I'm not sure it belongs in the last
> commitfest.  Furthermore, there has not been any activity or a new patch
> in a while and we are halfway through the CF.
>
> Please post an explanation for the delay and a schedule for the new
> patch.  If no patch or explanation is posted by 2017-03-17 AoE I will
> mark this submission "Returned with Feedback".

Regrettably, I do think it's too late to squeeze hash partitioning
into v10, but I plan to try to get something committed for v11.  I was
heavily involved in the design of Amul's patch, and I think that
design solves several problems that would be an issue for us if we did
as Nagata-san is proposing.  For example, he proposed this:
CREATE TABLE h1 PARTITION OF h;CREATE TABLE h2 PARTITION OF h;CREATE TABLE h3 PARTITION OF h;

That looks OK if you are thinking of typing this in interactively, but
if you're doing a pg_dump, maybe with --binary-upgrade, you don't want
the meaning of a series of nearly-identical SQL commands to depend on
the dump ordering.  You want it to be explicit in the SQL command
which partition is which, and Amul's patch solves that problem.  Also,
Nagata-san's proposal doesn't provide any way to increase the number
of partitions later, and Amul's approach gives you some options there.
I'm not sure those options are as good as we'd like them to be, and if
not then we may need to revise the approach, but I'm pretty sure
having no strategy at all for changing the partition count is not good
enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [POC] hash partitioning

From

David Steele

Date:

15 March 2017, 19:39:12

On 3/15/17 12:25 PM, Robert Haas wrote:
> On Tue, Mar 14, 2017 at 10:08 AM, David Steele <david@pgmasters.net> wrote:
>> This patch is marked as POC and after a read-through I agree that's
>> exactly what it is.
> 
> Just out of curiosity, were you looking at Nagata-san's patch, or Amul's?

Both - what I was looking for was some kind of reconciliation between
the two patches and I didn't find that.  It seemed from the thread that
Yugo intended to pull Amul's changes/idea into his patch.

>> As such, I'm not sure it belongs in the last
>> commitfest.  Furthermore, there has not been any activity or a new patch
>> in a while and we are halfway through the CF.
>>
>> Please post an explanation for the delay and a schedule for the new
>> patch.  If no patch or explanation is posted by 2017-03-17 AoE I will
>> mark this submission "Returned with Feedback".
> 
> Regrettably, I do think it's too late to squeeze hash partitioning
> into v10, but I plan to try to get something committed for v11.  

It would certainly be a nice feature to have.

> I was
> heavily involved in the design of Amul's patch, and I think that
> design solves several problems that would be an issue for us if we did
> as Nagata-san is proposing.  For example, he proposed this:
> 
>  CREATE TABLE h1 PARTITION OF h;
>  CREATE TABLE h2 PARTITION OF h;
>  CREATE TABLE h3 PARTITION OF h;
> 
> That looks OK if you are thinking of typing this in interactively, but
> if you're doing a pg_dump, maybe with --binary-upgrade, you don't want
> the meaning of a series of nearly-identical SQL commands to depend on
> the dump ordering.  You want it to be explicit in the SQL command
> which partition is which, and Amul's patch solves that problem.

OK, it wasn't clear to me that this was the case because of the stated
user-unfriendliness.

>  Also,
> Nagata-san's proposal doesn't provide any way to increase the number
> of partitions later, and Amul's approach gives you some options there.
> I'm not sure those options are as good as we'd like them to be, and if
> not then we may need to revise the approach, but I'm pretty sure
> having no strategy at all for changing the partition count is not good
> enough.

Agreed.  Perhaps both types of syntax should be supported, one that is
friendly to users and one that is precise for dump tools and those who
care get in the weeds.

-- 
-David
david@pgmasters.net

Re: [HACKERS] [POC] hash partitioning

From

Robert Haas

Date:

15 March 2017, 20:01:30

On Wed, Mar 15, 2017 at 12:39 PM, David Steele <david@pgmasters.net> wrote:
> Agreed.  Perhaps both types of syntax should be supported, one that is
> friendly to users and one that is precise for dump tools and those who
> care get in the weeds.

Eventually, sure.  For the first version, I want to skip the friendly
syntax and just add the necessary syntax.  That makes it easier to
make sure that pg_dump and everything are working the way you want.
Range and list partitioning could potentially grow convenience syntax
around partition creation, too, but that wasn't essential for the
first patch, so we cut it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [POC] hash partitioning

From

Yugo Nagata

Date:

17 March 2017, 14:57:23

On Tue, 14 Mar 2017 10:08:14 -0400
David Steele <david@pgmasters.net> wrote:

> Please post an explanation for the delay and a schedule for the new
> patch.  If no patch or explanation is posted by 2017-03-17 AoE I will
> mark this submission "Returned with Feedback".

I am sorry for my late response. I had not a enough time because I had a
business trip and was busy for other works.

I agree that fixing the number of partitions is bad and a way
to increase or decrease partitions should be provided. I also think
using linear hashing would be good as Amul is mentioning, but I
have not implemented it in my patch yet.

I also understanded that my design has a problem during pg_dump and
pg_upgrade, and that some information to identify the partition
is required not depending the command order. However, I feel that
Amul's design is a bit complicated with the rule to specify modulus.

I think we can use simpler syntax, for example, as below. 

 CREATE TABLE h1 PARTITION OF h FOR (0);
 CREATE TABLE h2 PARTITION OF h FOR (1);
 CREATE TABLE h3 PARTITION OF h FOR (2);

If user want to user any complicated partitioning rule, it can be defined
by specifying a user-defined hash function at creating partitioned table. 
If the hash function is omitted, we will be able to use default hash
operator class as well as in Amul's patch.

Attached is the updated patch taking the comments from Aleksander and Rushabh.
HASH keyword and unnecessary spaces are removed, and some comments are added.

Thanks,

-- 
Yugo Nagata <nagata@sraoss.co.jp>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

hash_partition.patch.v2

Re: [POC] hash partitioning

From

Tatsuo Ishii

Date:

28 March 2017, 04:06:46

> Please post an explanation for the delay and a schedule for the new
> patch.  If no patch or explanation is posted by 2017-03-17 AoE I will
> mark this submission "Returned with Feedback".

Depite the fact that Yugo has posted a new patch on 2017-03-17, this
item had been marked as "Returned with Feedback". I don't know why.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: [HACKERS] [POC] hash partitioning

From

Robert Haas

Date:

13 April 2017, 23:40:29

On Fri, Mar 17, 2017 at 7:57 AM, Yugo Nagata <nagata@sraoss.co.jp> wrote:
> I also understanded that my design has a problem during pg_dump and
> pg_upgrade, and that some information to identify the partition
> is required not depending the command order. However, I feel that
> Amul's design is a bit complicated with the rule to specify modulus.
>
> I think we can use simpler syntax, for example, as below.
>
>  CREATE TABLE h1 PARTITION OF h FOR (0);
>  CREATE TABLE h2 PARTITION OF h FOR (1);
>  CREATE TABLE h3 PARTITION OF h FOR (2);

I don't see how that can possibly work.  Until you see all the table
partitions, you don't know what the partitioning constraint for any
given partition should be, which seems to me to be a fatal problem.

I agree that Amul's syntax - really, I proposed it to him - is not the
simplest, but I think all the details needed to reconstruct the
partitioning constraint need to be explicit.  Otherwise, I'm pretty
sure things we're going to have lots of problems that we can't really
solve cleanly.  We can later invent convenience syntax that makes
common configurations easier to set up, but we should invent the
syntax that spells out all the details first.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [POC] hash partitioning

From

Yugo Nagata

Date:

14 April 2017, 11:23:00

On Thu, 13 Apr 2017 16:40:29 -0400
Robert Haas <robertmhaas@gmail.com> wrote:

> On Fri, Mar 17, 2017 at 7:57 AM, Yugo Nagata <nagata@sraoss.co.jp> wrote:
> > I also understanded that my design has a problem during pg_dump and
> > pg_upgrade, and that some information to identify the partition
> > is required not depending the command order. However, I feel that
> > Amul's design is a bit complicated with the rule to specify modulus.
> >
> > I think we can use simpler syntax, for example, as below.
> >
> >  CREATE TABLE h1 PARTITION OF h FOR (0);
> >  CREATE TABLE h2 PARTITION OF h FOR (1);
> >  CREATE TABLE h3 PARTITION OF h FOR (2);
> 
> I don't see how that can possibly work.  Until you see all the table
> partitions, you don't know what the partitioning constraint for any
> given partition should be, which seems to me to be a fatal problem.

If a partition has an id, the partitioning constraint can be written as
hash_func(hash_key) % N = id

wehre N is the number of paritions. Doesn't it work?

> I agree that Amul's syntax - really, I proposed it to him - is not the
> simplest, but I think all the details needed to reconstruct the
> partitioning constraint need to be explicit.  Otherwise, I'm pretty
> sure things we're going to have lots of problems that we can't really
> solve cleanly.  We can later invent convenience syntax that makes
> common configurations easier to set up, but we should invent the
> syntax that spells out all the details first.

I have a question about Amul's syntax. After we create partitions
as followings, 
create table foo (a integer, b text) partition by hash (a);create table foo1 partition of foo with (modulus 2,
remainder0);create table foo2 partition of foo with (modulus 2, remainder 1);  
 

we cannot create any additional partitions for the partition.

Then, after inserting records into foo1 and foo2, how we can
increase the number of partitions?

> 
> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company


-- 
Yugo Nagata <nagata@sraoss.co.jp>

Re: [HACKERS] [POC] hash partitioning

From

Robert Haas

Date:

14 April 2017, 16:05:14

On Fri, Apr 14, 2017 at 4:23 AM, Yugo Nagata <nagata@sraoss.co.jp> wrote:
> On Thu, 13 Apr 2017 16:40:29 -0400
> Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Mar 17, 2017 at 7:57 AM, Yugo Nagata <nagata@sraoss.co.jp> wrote:
>> > I also understanded that my design has a problem during pg_dump and
>> > pg_upgrade, and that some information to identify the partition
>> > is required not depending the command order. However, I feel that
>> > Amul's design is a bit complicated with the rule to specify modulus.
>> >
>> > I think we can use simpler syntax, for example, as below.
>> >
>> >  CREATE TABLE h1 PARTITION OF h FOR (0);
>> >  CREATE TABLE h2 PARTITION OF h FOR (1);
>> >  CREATE TABLE h3 PARTITION OF h FOR (2);
>>
>> I don't see how that can possibly work.  Until you see all the table
>> partitions, you don't know what the partitioning constraint for any
>> given partition should be, which seems to me to be a fatal problem.
>
> If a partition has an id, the partitioning constraint can be written as
>
>  hash_func(hash_key) % N = id
>
> wehre N is the number of paritions. Doesn't it work?

Only if you know the number of partitions.  But with your syntax,
after seeing only the first of the CREATE TABLE .. PARTITION OF
commands, what should the partition constraint be?  It depends on how
many more such commands appear later in the dump file, which you do
not know at that point.

>> I agree that Amul's syntax - really, I proposed it to him - is not the
>> simplest, but I think all the details needed to reconstruct the
>> partitioning constraint need to be explicit.  Otherwise, I'm pretty
>> sure things we're going to have lots of problems that we can't really
>> solve cleanly.  We can later invent convenience syntax that makes
>> common configurations easier to set up, but we should invent the
>> syntax that spells out all the details first.
>
> I have a question about Amul's syntax. After we create partitions
> as followings,
>
>  create table foo (a integer, b text) partition by hash (a);
>  create table foo1 partition of foo with (modulus 2, remainder 0);
>  create table foo2 partition of foo with (modulus 2, remainder 1);
>
> we cannot create any additional partitions for the partition.
>
> Then, after inserting records into foo1 and foo2, how we can
> increase the number of partitions?

You can detach foo1, create two new partitions with modulus 4 and
remainders 0 and 2, and move the data over from the old partition.

I realize that's not as automated as you might like, but it's no worse
than what is currently required for list and range partitioning when
you split a partition.  Someday we might build in tools to do that
kind of data migration automatically, but right now we have none.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [POC] hash partitioning

From

Yugo Nagata

Date:

17 April 2017, 11:50:42

On Fri, 14 Apr 2017 09:05:14 -0400
Robert Haas <robertmhaas@gmail.com> wrote:

> On Fri, Apr 14, 2017 at 4:23 AM, Yugo Nagata <nagata@sraoss.co.jp> wrote:
> > On Thu, 13 Apr 2017 16:40:29 -0400
> > Robert Haas <robertmhaas@gmail.com> wrote:
> >> On Fri, Mar 17, 2017 at 7:57 AM, Yugo Nagata <nagata@sraoss.co.jp> wrote:
> >> > I also understanded that my design has a problem during pg_dump and
> >> > pg_upgrade, and that some information to identify the partition
> >> > is required not depending the command order. However, I feel that
> >> > Amul's design is a bit complicated with the rule to specify modulus.
> >> >
> >> > I think we can use simpler syntax, for example, as below.
> >> >
> >> >  CREATE TABLE h1 PARTITION OF h FOR (0);
> >> >  CREATE TABLE h2 PARTITION OF h FOR (1);
> >> >  CREATE TABLE h3 PARTITION OF h FOR (2);
> >>
> >> I don't see how that can possibly work.  Until you see all the table
> >> partitions, you don't know what the partitioning constraint for any
> >> given partition should be, which seems to me to be a fatal problem.
> >
> > If a partition has an id, the partitioning constraint can be written as
> >
> >  hash_func(hash_key) % N = id
> >
> > wehre N is the number of paritions. Doesn't it work?
> 
> Only if you know the number of partitions.  But with your syntax,
> after seeing only the first of the CREATE TABLE .. PARTITION OF
> commands, what should the partition constraint be?  It depends on how
> many more such commands appear later in the dump file, which you do
> not know at that point.

I thought that the partition constraint could be decided every
time a new partition is created or attached, and that it woule be
needed to relocate records automatically when the partition configuration
changes. However, I have come to think that the automatic relocation
might not be needed at this point.

> 
> >> I agree that Amul's syntax - really, I proposed it to him - is not the
> >> simplest, but I think all the details needed to reconstruct the
> >> partitioning constraint need to be explicit.  Otherwise, I'm pretty
> >> sure things we're going to have lots of problems that we can't really
> >> solve cleanly.  We can later invent convenience syntax that makes
> >> common configurations easier to set up, but we should invent the
> >> syntax that spells out all the details first.
> >
> > I have a question about Amul's syntax. After we create partitions
> > as followings,
> >
> >  create table foo (a integer, b text) partition by hash (a);
> >  create table foo1 partition of foo with (modulus 2, remainder 0);
> >  create table foo2 partition of foo with (modulus 2, remainder 1);
> >
> > we cannot create any additional partitions for the partition.
> >
> > Then, after inserting records into foo1 and foo2, how we can
> > increase the number of partitions?
> 
> You can detach foo1, create two new partitions with modulus 4 and
> remainders 0 and 2, and move the data over from the old partition.
> 
> I realize that's not as automated as you might like, but it's no worse
> than what is currently required for list and range partitioning when
> you split a partition.  Someday we might build in tools to do that
> kind of data migration automatically, but right now we have none.

Thanks. I understood it. The automatic data migration feature 
would be better to be implemented separately.

> 
> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company


-- 
Yugo Nagata <nagata@sraoss.co.jp>

Re: [HACKERS] [POC] hash partitioning

From

Robert Haas

Date:

20 April 2017, 23:27:57

On Mon, Apr 17, 2017 at 10:50 AM, Yugo Nagata <nagata@sraoss.co.jp> wrote:
> I thought that the partition constraint could be decided every
> time a new partition is created or attached, and that it woule be
> needed to relocate records automatically when the partition configuration
> changes. However, I have come to think that the automatic relocation
> might not be needed at this point.

Great!  I am glad that we are in agreement about this point.  However,
actually I think the problem is worse than you are supposing.  If
you're restoring from a database dump created by pg_dump, then we will
try to load data into each individual partition using COPY.  Direct
insertions into individual partitions are not subject to tuple routing
-- that only affects inserts into the parent table.  So if the
partition constraint is not correct immediately after creating the
table, the COPY which tries to repopulate that partition will probably
fail with an ERROR, because there will likely be at least one row
(probably many) which match the "final" partition constraint but not
the "interim" partition constraint that we'd have after recreating
some but not all of the hash partitions.  For example, if we had
created 2 partitions so far out of a total of 3, we'd think the
constraint ought to be (hashvalue % 2) == 1 rather than (hashvalue %
3) == 1, which obviously will likely lead to the dump failing to
restore properly.

So, I think we really need something like the syntax in Amul's patch
in order for this to work at all.  Of course, the details can be
changed according to what seems best but I think the overall picture
is about right.

There is another point that I think also needs thought; not sure if
either your patch or Amit's patch handles it: constraint exclusion
will not work for hash partitioning.  For example, if the partitioning
constraint for each partition is of the form (hash(partcol) % 6) ==
SOME_VALUE_BETWEEN_0_AND_5, and the query contains the predicate
partcol == 37, constraint exclusion will not be able to prove anything
about which partitions need to be scanned.  Amit Langote has noted a
few times that partitioning relies on constraint exclusion *for now*,
which implies, I think, that he's thought about changing it to work
differently.  I think that would be a good idea.  For range
partitioning or list partitioning, a special-purpose mechanism for
partitioning could be much faster than constraint exclusion, since it
knows that partcol == 37 can only be true for one partition and can
reuse the tuple-routing infrastructure to figure out which one it is.
And that approach can also work for hash partitioning, where
constraint exclusion is useless.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [POC] hash partitioning

From

Robert Haas

Date:

26 April 2017, 23:12:21

On Thu, Apr 20, 2017 at 4:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> So, I think we really need something like the syntax in Amul's patch
> in order for this to work at all.  Of course, the details can be
> changed according to what seems best but I think the overall picture
> is about right.

I spent some time today looking at these patches.  It seems like there
is some more work still needed here to produce something committable
regardless of which way we go, but I am inclined to think that Amul's
patch is a better basis for work going forward than Nagata-san's
patch. Here are some general comments on the two patches:

- As noted above, the syntax implemented by Amul's patch allows us to
know the final partition constraint right away.  Nagata-san's proposed
syntax does not do that.  Also, Amul's syntax allows for a way to
split partitions (awkwardly, but we can improve it later);
Nagata-san's doesn't provide any method at all.

- Amul's patch derives the hash function to be used from the relevant
hash opclass, whereas Nagata-san's patch requires the user to specify
it explicitly.  I think that there is no real use case for a user
providing a custom hash function, and that using the opclass machinery
to derive the function to be used is better.  If a user DOES want to
provide their own, they can always create a custom opclass with the
appropriate support function and specify that it should be used when
creating a hash-partitioned table, but most users will be happy for
the system to supply the appropriate function automatically.

- In Nagata-san's patch, convert_expr_for_hash() looks up a function
called "abs" and an operator called "%" by name, which is not a good
idea.  We don't want to just find whatever is in the current search
path; we want to make sure we're using the system-defined operators
that we intend to be using.  Amul's patch builds the constraint using
a hard-coded internal function OID, F_SATISFIES_HASH_PARTITION.
That's a lot more robust, and it's also likely to be faster because,
in Amul's patch, we only call one function at the SQL level
(satisfies_hash_partition), whereas in Nagata-san's patch, we'll end
up calling three (abs, %, =).  Nagata-san's version of
get_qual_for_hash is implicated in this problem, too: it's looking up
the operator to use based on the operator name (=) rather than the
opclass properties.  Note that the existing get_qual_for_list() and
get_qual_for_range() use opclass properties, as does Amul's patch.

- Nagata-san's patch only supports hash partitioning based on a single
column, and that column must be NOT NULL.  Amul's patch does not have
these restrictions.

- Neither patch contains any documentation updates, which is bad.
Nagata-san's patch also contains no regression tests.  Amul's patch
does, but they need to be rebased, since they no longer apply, and I
think some other improvements are possible as well.  It's probably not
necessary to re-test things like whether temp and non-temp tables can
be mixed within a partitioning hierarchy, but there should be tests
that tuple routing actually works.  The case where it fails because no
matching partition exists should be tested as well.  Also, the tests
should validate not only that FOR VALUES isn't accept when creating a
hash partition (which they do) but also that WITH (...) isn't accepted
for a range or list partition (which they do not).

- When I try to do even something pretty trivial with Nagata-san's
patches, it crashes:

rhaas=# create table foo (a int, b text) partition by hash (a)
partitions 7 using hashint4;
CREATE TABLE
rhaas=# create table foo1 partition of foo;
<server crash>

The ruleutils.c support in Nagata-san's patch is broken.  If you
execute the non-crashing statement from the above example and then run
pg_dump, it doesn't dump "partitions 7 using hashint4", which means
that the syntax in the dump is invalid.

- Neither patch does anything about the fact that constraint exclusion
won't work for hash partitioning.  I mentioned this issue upthread in
the last paragraph of
http://postgr.es/m/CA+Tgmob7RsN5A=ehgYbLPx--c5CmptrK-dB=Y-v--o+TKyfteA@mail.gmail.com
and I think it's imperative that we fix it in some way before we think
about committing any of this.  I think that needs to be done by
extending relation_excluded_by_constraints() to have some specific
smarts about hash partitioning, and maybe other kinds of partitioning
as well (because it could probably be made much faster for list and
range partitioning, too).

- Amul's patch should perhaps update tab completion support:  create
table foo1 partition of foo <tab> completes with "for values", but now
"with" will be another option.

- Amul's patch probably needs to validate the WITH () clause more
thoroughly.  I bet you get a not-very-great error message if you leave
out "modulus" and no error at all if you leave out "remainder".

This is not yet a detailed review - I may be missing things, and
review and commentary from others is welcome.  If there is no major
disagreement with the idea of moving forward using Amul's patch as a
base, then I will do a more detailed review of that patch (or,
hopefully, an updated version that addresses the above comments).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [POC] hash partitioning

From

Jeff Davis

Date:

03 May 2017, 04:01:31

On Tue, Feb 28, 2017 at 6:33 AM, Yugo Nagata <nagata@sraoss.co.jp> wrote:
> In this patch, user can specify a hash function USING. However,
> we migth need default hash functions which are useful and
> proper for hash partitioning.

I suggest that we consider the hash functions more carefully. This is
(effectively) an on-disk format so it can't be changed easily later.

1. Consider a partition-wise join of two hash-partitioned tables. If
that's a hash join, and we just use the hash opclass, we immediately
lose some useful bits of the hash function. Same for hash aggregation
where the grouping key is the partition key.

To fix this, I think we need to include a salt in the hash API. Each
level of hashing can choose a random salt.

2. Consider a partition-wise join where the join keys are varchar(10)
and char(10). We can't do that join if we just use the existing hash
strategy, because 'foo' = 'foo       ' should match, but those values
have different hashes when using the standard hash opclass.

To fix this, we need to be smarter about normalizing values at a
logical level before hashing. We can take this to varying degrees,
perhaps even normalizing an integer to a numeric before hashing so
that you can do a cross-type join on int=numeric.

Furthermore, we need catalog metadata to indicate which hash functions
are suitable for which cross-type comparisons. Or, to put it the other
way, which typecasts preserve the partitioning.

3. We might want to use a hash function that is a little slower that
is more resistant to collisions. We may even want to use a 64-bit
hash.

My opinion is that we should work on this hashing infrastructure
first, and then support the DDL. If we get the hash functions right,
that frees us up to create better plans, with better push-downs, which
will be good for parallel query.

Regards,    Jeff Davis

Re: [HACKERS] [POC] hash partitioning

From

Robert Haas

Date:

03 May 2017, 05:01:07

On Tue, May 2, 2017 at 9:01 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Tue, Feb 28, 2017 at 6:33 AM, Yugo Nagata <nagata@sraoss.co.jp> wrote:
>> In this patch, user can specify a hash function USING. However,
>> we migth need default hash functions which are useful and
>> proper for hash partitioning.
>
> I suggest that we consider the hash functions more carefully. This is
> (effectively) an on-disk format so it can't be changed easily later.
>
> 1. Consider a partition-wise join of two hash-partitioned tables. If
> that's a hash join, and we just use the hash opclass, we immediately
> lose some useful bits of the hash function. Same for hash aggregation
> where the grouping key is the partition key.

Hmm, that could be a problem in some cases.  I think there's probably
much less of a problem if the modulus isn't a power of two?

> To fix this, I think we need to include a salt in the hash API. Each
> level of hashing can choose a random salt.

Do you mean that we'd salt partitioning hashing differently from
grouping hashing which would be salted different from aggregation
hashing which, I suppose, would be salted differently from hash index
hashing?  Or do you mean that you'd have to specify a salt when
creating a hash-partitioned table, and make sure it's the same across
all compatibly partitioned tables you might want to hash-join?  That
latter sounds unappealing.

> 2. Consider a partition-wise join where the join keys are varchar(10)
> and char(10). We can't do that join if we just use the existing hash
> strategy, because 'foo' = 'foo       ' should match, but those values
> have different hashes when using the standard hash opclass.
>
> To fix this, we need to be smarter about normalizing values at a
> logical level before hashing. We can take this to varying degrees,
> perhaps even normalizing an integer to a numeric before hashing so
> that you can do a cross-type join on int=numeric.
>
> Furthermore, we need catalog metadata to indicate which hash functions
> are suitable for which cross-type comparisons. Or, to put it the other
> way, which typecasts preserve the partitioning.

You're basically describing what a hash opfamily already does, except
that we don't have a single opfamily that covers both varchar(10) and
char(10), nor do we have one that covers both int and numeric.  We
have one that covers int2, int4, and int8, though.  If somebody wanted
to make the ones you're suggesting, there's nothing preventing it,
although I'm not sure exactly how we'd encourage people to start using
the new one and deprecating the old one.  We don't seem to have a good
infrastructure for that.

> 3. We might want to use a hash function that is a little slower that
> is more resistant to collisions. We may even want to use a 64-bit
> hash.
>
> My opinion is that we should work on this hashing infrastructure
> first, and then support the DDL. If we get the hash functions right,
> that frees us up to create better plans, with better push-downs, which
> will be good for parallel query.

I am opposed to linking the fate of this patch to multiple
independent, possibly large, possibly difficult, possibly
controversial enhancements to the hashing mechanism.  If there are
simple things that can reasonably be done in this patch to make hash
partitioning better, great.  If you want to work on improving the
hashing mechanism as an independent project, also great.  But I think
that most people would rather have hash partitioning in v11 than wait
for v12 or v13 so that other hashing improvements can be completed; I
know I would.  If we say "we shouldn't implement hash partitioning
because some day we might make incompatible changes to the hashing
mechanism" then we'll never implement it, because that will always be
true.  Even the day after we change it, there still may come a future
day when we change it again.

The stakes have already been raised by making hash indexes durable;
that too is arguably making future changes to the hashing
infrastructure harder.  But I still think it was the right thing to
proceed with that work.  If we get 64-bit hash codes in the next
release, and we want hash indexes to use them, then we will have to
invalidate existing hash indexes (again).  That's sad, but not as sad
as it would have been to not commit the work to make hash indexes
durable. There's a chicken-and-egg problem here: without durable hash
indexes and hash partitioning, there's not much incentive to make
hashing better, but once we have them, changes create a backward
compatibility issue.  Such is life; nothing we do is infinitely
future-proof.

The last significant overhaul of the hashing mechanism that I know
about was in 2009, cf. 2604359251d34177a14ef58250d8b4a51d83103b and
8205258fa675115439017b626c4932d5fefe2ea8.  Until this email, I haven't
seen any complaints about the quality of that hash function either in
terms of speed or collision properties - what makes you think those
things are serious problems?  I *have* heard some interest in widening
the output to 64 bits, and also in finding a way to combine multiple
hash values in some smarter way than we do at present.  Seeding has
come up, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

03 May 2017, 16:09:15

On Thu, Apr 27, 2017 at 1:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:

>I spent some time today looking at these patches.  It seems like there
>is some more work still needed here to produce something committable
>regardless of which way we go, but I am inclined to think that Amul's
>patch is a better basis for work going forward than Nagata-san's
>patch. Here are some general comments on the two patches:

Thanks for your time.

[...]

> - Neither patch contains any documentation updates, which is bad.

Fixed in the attached version.

>
> Nagata-san's patch also contains no regression tests.  Amul's patch
> does, but they need to be rebased, since they no longer apply, and I
> think some other improvements are possible as well.  It's probably not
> necessary to re-test things like whether temp and non-temp tables can
> be mixed within a partitioning hierarchy, but there should be tests
> that tuple routing actually works.  The case where it fails because no
> matching partition exists should be tested as well.  Also, the tests
> should validate not only that FOR VALUES isn't accept when creating a
> hash partition (which they do) but also that WITH (...) isn't accepted
> for a range or list partition (which they do not).
>

Fixed in the attached version.

[...]
> - Amul's patch should perhaps update tab completion support:  create
> table foo1 partition of foo <tab> completes with "for values", but now
> "with" will be another option.
>

Fixed in the attached version.

>
> - Amul's patch probably needs to validate the WITH () clause more
> thoroughly.  I bet you get a not-very-great error message if you leave
> out "modulus" and no error at all if you leave out "remainder".
>

Thats not true, there will be syntax error if you leave modulus or
remainder, see this:

postgres=# CREATE TABLE hpart_2 PARTITION OF hash_parted  WITH(modulus 4);
ERROR:  syntax error at or near ")"
LINE 1: ...hpart_2 PARTITION OF hash_parted WITH(modulus 4);

>
> This is not yet a detailed review - I may be missing things, and
> review and commentary from others is welcome.  If there is no major
> disagreement with the idea of moving forward using Amul's patch as a
> base, then I will do a more detailed review of that patch (or,
> hopefully, an updated version that addresses the above comments).
>

I have made a smaller change in earlier proposed syntax to create
partition to be aligned with current range and list partition syntax,
new syntax will be as follow:

CREATE TABLE p1 PARTITION OF hash_parted FOR VALUES WITH (modulus 10,
remainder 1);

Regards,
Amul

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

hash-partitioning_another_design-v2.patch

Re: [HACKERS] [POC] hash partitioning

From

Jeff Davis

Date:

04 May 2017, 08:44:54

On Tue, May 2, 2017 at 7:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, May 2, 2017 at 9:01 PM, Jeff Davis <pgsql@j-davis.com> wrote:
>> 1. Consider a partition-wise join of two hash-partitioned tables. If
>> that's a hash join, and we just use the hash opclass, we immediately
>> lose some useful bits of the hash function. Same for hash aggregation
>> where the grouping key is the partition key.
>
> Hmm, that could be a problem in some cases.  I think there's probably
> much less of a problem if the modulus isn't a power of two?

That's true, but it's awkward to describe that to users. And I think
most people would be inclined to use a power-of-two number of
partitions, perhaps coming from other systems.

>> To fix this, I think we need to include a salt in the hash API. Each
>> level of hashing can choose a random salt.
>
> Do you mean that we'd salt partitioning hashing differently from
> grouping hashing which would be salted different from aggregation
> hashing which, I suppose, would be salted differently from hash index
> hashing?

Yes. The way I think about it is that choosing a new random salt is an
easy way to get a new hash function.

> Or do you mean that you'd have to specify a salt when
> creating a hash-partitioned table, and make sure it's the same across
> all compatibly partitioned tables you might want to hash-join?  That
> latter sounds unappealing.

I don't see a reason to expose the salt to users. If we found a reason
in the future, we could, but it would create all of the problems you
are thinking about.

>> 2. Consider a partition-wise join where the join keys are varchar(10)
>> and char(10). We can't do that join if we just use the existing hash
>> strategy, because 'foo' = 'foo       ' should match, but those values
>> have different hashes when using the standard hash opclass.

...

> You're basically describing what a hash opfamily already does, except
> that we don't have a single opfamily that covers both varchar(10) and
> char(10), nor do we have one that covers both int and numeric.  We
> have one that covers int2, int4, and int8, though.  If somebody wanted
> to make the ones you're suggesting, there's nothing preventing it,
> although I'm not sure exactly how we'd encourage people to start using
> the new one and deprecating the old one.  We don't seem to have a good
> infrastructure for that.

OK. I will propose new hash opfamilies for varchar/bpchar/text,
int2/4/8/numeric, and timestamptz/date.

One approach is to promote the narrower type to the wider type, and
then hash. The problem is that would substantially slow down the
hashing of integers, so then we'd need to use one hash opfamily for
partitioning and one for hashjoin, and it gets messy.

The other approach is to check if the wider type is within the domain
of the narrower type, and if so, *demote* the value and then hash. For
instance, '4.2'::numeric would hash the same as it does today, but
'4'::numeric would hash as an int2. I prefer this approach, and int8
already does something resembling it.

For timestamptz/date, it's not nearly as important.

>> My opinion is that we should work on this hashing infrastructure
>> first, and then support the DDL. If we get the hash functions right,
>> that frees us up to create better plans, with better push-downs, which
>> will be good for parallel query.
>
> I am opposed to linking the fate of this patch to multiple
> independent, possibly large, possibly difficult, possibly
> controversial enhancements to the hashing mechanism.

It's a little early in the v11 cycle to be having this argument.
Really what I'm saying is that a small effort now may save us a lot of
headache later.

Regards,    Jeff Davis

Re: [HACKERS] [POC] hash partitioning

From

Ashutosh Bapat

Date:

10 May 2017, 15:34:38

On Wed, May 3, 2017 at 6:39 PM, amul sul <sulamul@gmail.com> wrote:
> On Thu, Apr 27, 2017 at 1:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>>
>> This is not yet a detailed review - I may be missing things, and
>> review and commentary from others is welcome.  If there is no major
>> disagreement with the idea of moving forward using Amul's patch as a
>> base, then I will do a more detailed review of that patch (or,
>> hopefully, an updated version that addresses the above comments).
>

I agree that Amul's approach makes dump/restore feasible whereas
Nagata-san's approach makes that difficult. That is a major plus point
about Amul's patch. Also, it makes it possible to implement
Nagata-san's syntax, which is more user-friendly in future.

Here are some review comments after my initial reading of Amul's patch:

Hash partitioning will partition the data based on the hash value of the
partition key. Does that require collation? Should we throw an error/warning if
collation is specified in PARTITION BY clause?

+    int           *indexes;        /* Partition indexes; in case of hash
+                                 * partitioned table array length will be
+                                 * value of largest modulus, and for others
+                                 * one entry per member of the datums array
+                                 * (plus one if range partitioned table) */
This may be rewritten as "Partition indexes: For hash partitioned table the
number of indexes will be same as the largest modulus. For list partitioned
table the number of indexes will be same as the number of datums. For range
partitioned table the number of indexes will be number of datums plus one.".
You may be able to reword it to a shorter version, but essentially we will have
separate description for each strategy.

I guess, we need to change the comments for the other members too. For example
"datums" does not contain tuples with key->partnatts attributes for hash
partitions. It contains a tuple with two attributes, modulus and remainder. We
may not want to track null_index separately since rows with NULL partition key
will fit in the partition corresponding to the hash value of NULL. OR may be we
want to set null_index to partition which contains NULL values, if there is a
partition created for corresponding remainder, modulus pair and set has_null
accordingly. Accordingly we will need to update the comments.

cal_hash_value() may be renamed as calc_has_value() or compute_hash_value()?

Should we change the if .. else if .. construct in RelationBuildPartitionDesc()
to a switch case? There's very less chance that we will support a fourth
partitioning strategy, so if .. else if .. may be fine.

+                        int        mod = hbounds[i]->modulus,
+                                place = hbounds[i]->remainder;
Although there are places in the code where we separate variable declaration
with same type by comma, most of the code declares each variable with the data
type on separate line. Should variable "place" be renamed as "remainder" since
that's what it is ultimately?

RelationBuildPartitionDesc() fills up mapping array but never uses it. In this
code the index into mapping array itself is the mapping so it doesn't need to
be maintained separately like list partiioning case. Similary next_index usage
looks unnecessary, although that probably improves readability, so may be fine.

+ *   for p_p1: satisfies_hash_partition(2, 1, pkey, value)
+ *   for p_p2: satisfies_hash_partition(4, 2, pkey, value)
+ *   for p_p3: satisfies_hash_partition(8, 0, pkey, value)
+ *   for p_p4: satisfies_hash_partition(8, 4, pkey, value)
What the function builds is satisfies_hash_partition(2, 1, pkey). I don't see
code to add value as an argument to the function. Is that correct?

+                        int        modulus = DatumGetInt32(datum);
May be you want to rename this variable to greatest_modulus like in the other
places.

+                        Assert(spec->modulus > 0 && spec->remainder >= 0);
I liked this assertion. Do you want to add spec->modulus > spec->reminder also
here?

+    char       *strategy;        /* partitioning strategy
+                                   ('hash', 'list' or 'range') */

We need the second line to start with '*'

+-- check validation when attaching list partitions
Do you want to say "hash" instead of "list" here?

I think we need to explain the reasoning behind this syntax somewhere
as a README or in the documentation or in the comments. Otherwise it's
difficult to understand how various pieces of code are related.

This is not full review. I am still trying to understand how the hash
partitioning implementation fits with list and range partitioning. I
am going to continue to review this patch further.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] [POC] hash partitioning

From

Robert Haas

Date:

10 May 2017, 19:38:54

On Thu, May 4, 2017 at 1:44 AM, Jeff Davis <pgsql@j-davis.com> wrote:
>> Hmm, that could be a problem in some cases.  I think there's probably
>> much less of a problem if the modulus isn't a power of two?
>
> That's true, but it's awkward to describe that to users. And I think
> most people would be inclined to use a power-of-two number of
> partitions, perhaps coming from other systems.

Yeah, true.

>>> To fix this, I think we need to include a salt in the hash API. Each
>>> level of hashing can choose a random salt.
>>
>> Do you mean that we'd salt partitioning hashing differently from
>> grouping hashing which would be salted different from aggregation
>> hashing which, I suppose, would be salted differently from hash index
>> hashing?
>
> Yes. The way I think about it is that choosing a new random salt is an
> easy way to get a new hash function.

OK.  One problem, though, is we don't quite have the opclass
infrastructure for this.  A hash opclass's support function is
expected to take one argument, a value of the data type at issue.  The
first idea that occurred to me was to allow an optional second
argument which would be a seed, but that seems like it would require
extensive changes to all of the datatype-specific hash functions and
some of them would probably emerge noticeably slower.  If a function
is just calling hash_uint32 right now then I don't see how we're going
to replace that with something more complex that folds in a salt
without causing performance to drop.  Even just the cost of unpacking
the extra argument might be noticeable.

Another alternative would be to be to add one additional, optional
hash opclass support function which takes a value of the type in
question as one argument and a seed as a second argument.  That seems
like it might work OK.  Existing code can use the existing support
function 1 with no change, and hash partitioning can use support
function 2.

>> Or do you mean that you'd have to specify a salt when
>> creating a hash-partitioned table, and make sure it's the same across
>> all compatibly partitioned tables you might want to hash-join?  That
>> latter sounds unappealing.
>
> I don't see a reason to expose the salt to users. If we found a reason
> in the future, we could, but it would create all of the problems you
> are thinking about.

Right, OK.

>> You're basically describing what a hash opfamily already does, except
>> that we don't have a single opfamily that covers both varchar(10) and
>> char(10), nor do we have one that covers both int and numeric.  We
>> have one that covers int2, int4, and int8, though.  If somebody wanted
>> to make the ones you're suggesting, there's nothing preventing it,
>> although I'm not sure exactly how we'd encourage people to start using
>> the new one and deprecating the old one.  We don't seem to have a good
>> infrastructure for that.
>
> OK. I will propose new hash opfamilies for varchar/bpchar/text,
> int2/4/8/numeric, and timestamptz/date.

Cool!  I have no idea how we'll convert from the old ones to the new
ones without breaking things but I agree that it would be nicer if it
were like that rather than the way it is now.

> One approach is to promote the narrower type to the wider type, and
> then hash. The problem is that would substantially slow down the
> hashing of integers, so then we'd need to use one hash opfamily for
> partitioning and one for hashjoin, and it gets messy.

Yes, that sounds messy.

> The other approach is to check if the wider type is within the domain
> of the narrower type, and if so, *demote* the value and then hash. For
> instance, '4.2'::numeric would hash the same as it does today, but
> '4'::numeric would hash as an int2. I prefer this approach, and int8
> already does something resembling it.

Sounds reasonable.

> It's a little early in the v11 cycle to be having this argument.
> Really what I'm saying is that a small effort now may save us a lot of
> headache later.

Well, that's fair enough.  My concern is basically that it may the
other way around: a large effort to save a small headache later. I
agree that it's probably a good idea to figure out a way to salt the
hash function so that we don't end up with this and partitionwise join
interacting badly, but I don't see the other issues as being very
critical.  I don't have any evidence that there's a big need to
replace our hash functions with new ones, and over on the
partitionwise join thread we gave up on the idea of a cross-type
partitionwise join.  It wouldn't be particularly common (or sensible,
really) even if we ended up supporting it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [POC] hash partitioning

From

Robert Haas

Date:

10 May 2017, 19:43:20

On Wed, May 10, 2017 at 8:34 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Hash partitioning will partition the data based on the hash value of the
> partition key. Does that require collation? Should we throw an error/warning if
> collation is specified in PARTITION BY clause?

Collation is only relevant for ordering, not equality.  Since hash
opclasses provide only equality, not ordering, it's not relevant here.
I'm not sure whether we should error out if it's specified or just
silently ignore it.  Maybe an ERROR is a good idea?  But not sure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [POC] hash partitioning

From

Robert Haas

Date:

10 May 2017, 21:09:04

On Wed, May 3, 2017 at 9:09 AM, amul sul <sulamul@gmail.com> wrote:
> Fixed in the attached version.

+[ PARTITION BY { HASH | RANGE | LIST } ( { <replaceable
class="parameter">column_name</replaceable> | ( <replaceable
class="parameter">expression</replaceable> ) } [ COLLATE <replaceable

In the department of severe nitpicking, I would have expected this to
either use alphabetical order (HASH | LIST | RANGE) or to add the new
method at the end on the theory that we probably did the important
ones first (RANGE | LIST | HASH).

+  WITH ( MODULUS <replaceable class="PARAMETER">value</replaceable>,
REMAINDER <replaceable class="PARAMETER">value</replaceable> ) }

Maybe value -> modulus and value -> remainder?
     <para>
+      When creating a hash partition, <literal>MODULUS</literal> should be
+      greater than zero and <literal>REMAINDER</literal> should be greater than
+      or equal to zero.  Every <literal>MODULUS</literal> must be a factor of
+      the next larger modulus.
[ ... and it goes on from there ... ]

This paragraph is fairly terrible, because it's a design spec that I
wrote, not an explanation intended for users.  Here's an attempt to
improve it:

===
When creating a hash partition, a modulus and remainder must be
specified.  The modulus must be a positive integer, and the remainder
must a non-negative integer less than the modulus.  Typically, when
initially setting up a hash-partitioned table, you should choose a
modulus equal to the number of partitions and assign every table the
same modulus and a different remainder (see examples, below).
However, it is not required that every partition have the same
modulus, only that every modulus which occurs among the children of a
hash-partitioned table is a factor of the next larger modulus.  This
allows the number of partitions to be increased incrementally without
needing to move all the data at once.  For example, suppose you have a
hash-partitioned table with 8 children, each of which has modulus 8,
but find it necessary to increase the number of partitions to 16.  You
can detach one of the modulus-8 partitions, create two new modulus-16
partitions covering the same portion of the key space (one with a
remainder equal to the remainder of the detached partition, and the
other with a remainder equal to that value plus 8), and repopulate
them with data.  You can then repeat this -- perhaps at a later time
-- for each modulus-8 partition until none remain.  While this may
still involve a large amount of data movement at each step, it is
still better than having to create a whole new table and move all the
data at once.
===

+CREATE TABLE postal_code (
+    code         int not null,
+    city_id      bigint not null,
+    address      text
+) PARTITION BY HASH (code);

It would be fairly silly to hash-partition the postal_code table,
because there aren't enough postal codes to justify it.  Maybe make
this a lineitem or order table, and partition on the order number.
Also, extend the example to show creating 4 partitions with modulus 4.

+                if (spec->strategy != PARTITION_STRATEGY_HASH)
+                    elog(ERROR, "invalid strategy in partition bound spec");

I think this should be an ereport() if it can happen or an Assert() if
it's supposed to be prevented by the grammar.

+            if (!(datumIsEqual(b1->datums[i][0], b2->datums[i][0],
+                               true, sizeof(int)) &&

It doesn't seem necessary to use datumIsEqual() here.  You know the
datums are pass-by-value, so why not just use == ?  I'd include a
comment but I don't think using datumIsEqual() adds anything here
except unnecessary complexity.  More broadly, I wonder why we're
cramming this into the datums arrays instead of just adding another
field to PartitionBoundInfoData that is only used by hash
partitioning.
                   /*
+                     * Check rule that every modulus must be a factor of the
+                     * next larger modulus.  For example, if you have a bunch
+                     * of partitions that all have modulus 5, you can add a new
+                     * new partition with modulus 10 or a new partition with
+                     * modulus 15, but you cannot add both a partition with
+                     * modulus 10 and a partition with modulus 15, because 10
+                     * is not a factor of 15.  However, you could
simultaneously
+                     * use modulus 4, modulus 8, modulus 16, and modulus 32 if
+                     * you wished, because each modulus is a factor of the next
+                     * larger one.  You could also use modulus 10, modulus 20,
+                     * and modulus 60. But you could not use modulus 10,
+                     * modulus 15, and modulus 60 for the same reason.
+                     */

I think just the first sentence is fine here; I'd nuke the rest of this.

The block that follows could be merged into the surrounding block.
There's no need to increase the indentation level here, so let's not.
I also suspect that the code itself is wrong.  There are two ways a
modulus can be invalid: it can either fail to be a multiple of the
next lower-modulus, or it can fail to be a factor of the next-higher
modulus.  I think your code only checks the latter.  So for example,
if the current modulus list is (4, 36), your code would correctly
disallow 3 because it's not a factor of 4 and would correctly disallow
23 because it's not a factor of 36, but it looks to me like it would
allow 9 because that's a factor of 36. However, then the list would be
(4, 9, 36), and 4 is not a factor of 9.

+                    greatest_modulus = DatumGetInt32(datums[ndatums - 1][0]);

Here, insert: /* Normally, the lowest remainder that could conflict
with the new partition is equal to the remainder specified for the new
partition, but when the new partition has a modulus higher than any
used so far, we need to adjust. */

+                    place = spec->remainder;
+                    if (place >= greatest_modulus)
+                        place = place % greatest_modulus;

Here, insert: /* Check every potentially-conflicting remainder. */

+                    do
+                    {
+                        if (boundinfo->indexes[place] != -1)
+                        {
+                            overlap = true;
+                            with = boundinfo->indexes[place];
+                            break;
+                        }
+                        place = place + spec->modulus;

Maybe use += ?

+                    } while (place < greatest_modulus);

+ * Used when sorting hash bounds across all hash modulus
+ * for hash partitioning

This is not a very descriptive comment.  Maybe /* We sort hash bounds
by modulus, then by remainder. */

+cal_hash_value(FmgrInfo *partsupfunc, int nkeys, Datum *values, bool *isnull)

I agree with Ashutosh's critique of this name.

+    /*
+     * Cache hash function information, similar to how record_eq() caches
+     * equality operator information.  (Perhaps no SQL syntax could cause
+     * PG_NARGS()/nkeys to change between calls through the same FmgrInfo.
+     * Checking nkeys here is just defensiveness.)
+     */

Unless I'm missing something, this comment does not actually describe
what the code does.  Each call to the function repeats the same
TypeCacheEntry lookups.  I'm not actually sure whether caching here
can actually help - is there any situation in which the same FmgrInfo
will get used repeatedly here?  But if it is possible then this code
fails to achieve its intended objective.

Another problem with this code is that, unless I'm missing something,
it completely ignores the opclass the user specified and just looks up
the default hash opclass.  I think you should create a non-default
hash opclass for some data type -- maybe create one for int4 that just
returns the input value unchanged -- and test that the specifying
default hash opclass routes tuples according to hash_uint32(val) %
modulus while specifying your customer opclass routes tuples according
to val % modulus.

Unless I'm severely misunderstanding the situation this code is
seriously undertested.

+             * Identify a btree opclass to use. Currently, we use only btree
+             * operators, which seems enough for list and range partitioning.

This comment is false, right?

+                        appendStringInfoString(buf, "FOR VALUES");
+                        appendStringInfo(buf, " WITH (modulus %d,
remainder %d)",
+                                         spec->modulus, spec->remainder);

You could combine these.

+ALTER TABLE hash_parted2 ATTACH PARTITION fail_part FOR VALUES WITH
(modulus 0, remainder 1);
+ERROR:  invalid bound specification for a hash partition
+HINT:  modulus must be greater than zero
+ALTER TABLE hash_parted2 ATTACH PARTITION fail_part FOR VALUES WITH
(modulus 8, remainder 8);
+ERROR:  invalid bound specification for a hash partition
+HINT:  modulus must be greater than remainder
+ALTER TABLE hash_parted2 ATTACH PARTITION fail_part FOR VALUES WITH
(modulus 3, remainder 2);
+ERROR:  invalid bound specification for a hash partition
+HINT:  every modulus must be factor of next largest modulus

It seems like you could merge the hint back into the error:

ERROR: hash partition modulus must be greater than 0
ERROR: hash partition remainder must be less than modulus
ERROR: every hash partition modulus must be a factor of the next larger modulus

+DETAIL:  Partition key of the failing row contains (HASHa, b) = (c, 5).

That's obviously garbled somehow.

+hash_partbound_elem:
+        NonReservedWord Iconst
+            {
+                $$ = makeDefElem($1, (Node *)makeInteger($2), @1);
+            }
+        ;
+
+hash_partbound:
+        hash_partbound_elem ',' hash_partbound_elem
+            {
+                $$ = list_make2($1, $3);
+            }
+        ;

I don't think that it's the grammar's job to enforce that exactly two
options are present.  It should allow any number of options, and some
later code, probably during parse analysis, should check that the ones
you need are present and that there are no invalid ones.  See the code
for EXPLAIN, VACUUM, etc.

Regarding the test cases, I think that you've got a lot of tests for
failure scenarios (which is good) but not enough for success
scenarios.  For example, you test that inserting a row into the wrong
hash partition fails, but not (unless I missed it) that tuple routing
succeeds.  I think it would be good to have a test where you insert
1000 or so rows into a hash partitioned table just to see it all work.

Also, you haven't done anything about the fact that constraint
exclusion doesn't work for hash partitioned tables, a point I raised
in http://postgr.es/m/CA+Tgmob7RsN5A=ehgYbLPx--c5CmptrK-dB=Y-v--o+TKyfteA@mail.gmail.com
and which I still think is quite important.  I think that to have a
committable patch for this feature that would have to be addressed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [POC] hash partitioning

From

Dilip Kumar

Date:

11 May 2017, 19:02:11

On Wed, May 3, 2017 at 6:39 PM, amul sul <sulamul@gmail.com> wrote:
> On Thu, Apr 27, 2017 at 1:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>>I spent some time today looking at these patches.  It seems like there
>>is some more work still needed here to produce something committable
>>regardless of which way we go, but I am inclined to think that Amul's
>>patch is a better basis for work going forward than Nagata-san's
>>patch. Here are some general comments on the two patches:
>
> Thanks for your time.
>
> [...]
>
>> - Neither patch contains any documentation updates, which is bad.
>
> Fixed in the attached version.

I have done an intial review of the patch and I have some comments.  I
will continue the review
and testing and report the results soon

-----
Patch need to be rebased

----

if (key->strategy == PARTITION_STRATEGY_RANGE)
{
/* Disallow nulls in the range partition key of the tuple */
for (i = 0; i < key->partnatts; i++)
if (isnull[i])
ereport(ERROR,
(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
errmsg("range partition key of row contains null")));
}

We need to add PARTITION_STRATEGY_HASH as well, we don't support NULL
for hash also, right?
----

RangeDatumContent **content;/* what's contained in each range bound datum? * (see the above enum); NULL for list *
partitionedtables */
 

This will be NULL for hash as well we need to change the comments.
-----
 bool has_null; /* Is there a null-accepting partition? false * for range partitioned tables */ int null_index; /*
Indexof the null-accepting partition; -1
 

Comments needs to be changed for these two members as well
----

+/* One bound of a hash partition */
+typedef struct PartitionHashBound
+{
+ int modulus;
+ int remainder;
+ int index;
+} PartitionHashBound;

It will good to add some comments to explain the structure members


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] [POC] hash partitioning

From

Robert Haas

Date:

12 May 2017, 04:42:32

On Thu, May 11, 2017 at 12:02 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> We need to add PARTITION_STRATEGY_HASH as well, we don't support NULL
> for hash also, right?

I think it should.

Actually, I think that not supporting nulls for range partitioning may
have been a fairly bad decision.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [POC] hash partitioning

From

Amit Langote

Date:

12 May 2017, 05:15:51

On 2017/05/12 10:42, Robert Haas wrote:
> On Thu, May 11, 2017 at 12:02 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> We need to add PARTITION_STRATEGY_HASH as well, we don't support NULL
>> for hash also, right?
> 
> I think it should.
> 
> Actually, I think that not supporting nulls for range partitioning may
> have been a fairly bad decision.

I think the relevant discussion concluded [1] that way, because we
couldn't decide which interface to provide for specifying where NULLs are
placed or because we decided to think about it later.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoZN_Zf7MBb48O66FAJgFe0S9_NkLVeQNBz6hsxb6Og93w%40mail.gmail.com

Re: [HACKERS] [POC] hash partitioning

From

Robert Haas

Date:

12 May 2017, 05:20:48

On Thu, May 11, 2017 at 10:15 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/05/12 10:42, Robert Haas wrote:
>> On Thu, May 11, 2017 at 12:02 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>> We need to add PARTITION_STRATEGY_HASH as well, we don't support NULL
>>> for hash also, right?
>>
>> I think it should.
>>
>> Actually, I think that not supporting nulls for range partitioning may
>> have been a fairly bad decision.
>
> I think the relevant discussion concluded [1] that way, because we
> couldn't decide which interface to provide for specifying where NULLs are
> placed or because we decided to think about it later.

Yeah, but I have a feeling that marking the columns NOT NULL is going
to make it really hard to support that in the future when we get the
syntax hammered out.  If it had only affected the partition
constraints that'd be different.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [POC] hash partitioning

From

Amit Langote

Date:

12 May 2017, 05:38:11

On 2017/05/12 11:20, Robert Haas wrote:
> On Thu, May 11, 2017 at 10:15 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2017/05/12 10:42, Robert Haas wrote:
>>> On Thu, May 11, 2017 at 12:02 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>>> We need to add PARTITION_STRATEGY_HASH as well, we don't support NULL
>>>> for hash also, right?
>>>
>>> I think it should.
>>>
>>> Actually, I think that not supporting nulls for range partitioning may
>>> have been a fairly bad decision.
>>
>> I think the relevant discussion concluded [1] that way, because we
>> couldn't decide which interface to provide for specifying where NULLs are
>> placed or because we decided to think about it later.
> 
> Yeah, but I have a feeling that marking the columns NOT NULL is going
> to make it really hard to support that in the future when we get the
> syntax hammered out.  If it had only affected the partition
> constraints that'd be different.

So, adding keycol IS NOT NULL (like we currently do for expressions) in
the implicit partition constraint would be more future-proof than
generating an actual catalogued NOT NULL constraint on the keycol?  I now
tend to think it would be better.  Directly inserting into a range
partition with a NULL value for a column currently generates a "null value
in column \"%s\" violates not-null constraint" instead of perhaps more
relevant "new row for relation \"%s\" violates partition constraint".
That said, we *do* document the fact that a NOT NULL constraint is added
on range key columns, but we might as well document instead that we don't
currently support routing tuples with NULL values in the partition key
through a range-partitioned table and so NULL values cause error.

Can we still decide to do that instead?

Thanks,
Amit

Re: [HACKERS] [POC] hash partitioning

From

Ashutosh Bapat

Date:

12 May 2017, 08:14:46

On Fri, May 12, 2017 at 7:12 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, May 11, 2017 at 12:02 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> We need to add PARTITION_STRATEGY_HASH as well, we don't support NULL
>> for hash also, right?
>
> I think it should.
>
+1

As long as we can hash a NULL value, we should place a value with NULL
key in the corresponding partition, most probably the one with
remainder 0.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] [POC] hash partitioning

From

Ashutosh Bapat

Date:

12 May 2017, 08:24:46

On Fri, May 12, 2017 at 8:08 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/05/12 11:20, Robert Haas wrote:
>> On Thu, May 11, 2017 at 10:15 PM, Amit Langote
>> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>> On 2017/05/12 10:42, Robert Haas wrote:
>>>> On Thu, May 11, 2017 at 12:02 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>>>> We need to add PARTITION_STRATEGY_HASH as well, we don't support NULL
>>>>> for hash also, right?
>>>>
>>>> I think it should.
>>>>
>>>> Actually, I think that not supporting nulls for range partitioning may
>>>> have been a fairly bad decision.
>>>
>>> I think the relevant discussion concluded [1] that way, because we
>>> couldn't decide which interface to provide for specifying where NULLs are
>>> placed or because we decided to think about it later.
>>
>> Yeah, but I have a feeling that marking the columns NOT NULL is going
>> to make it really hard to support that in the future when we get the
>> syntax hammered out.  If it had only affected the partition
>> constraints that'd be different.
>
> So, adding keycol IS NOT NULL (like we currently do for expressions) in
> the implicit partition constraint would be more future-proof than
> generating an actual catalogued NOT NULL constraint on the keycol?  I now
> tend to think it would be better.  Directly inserting into a range
> partition with a NULL value for a column currently generates a "null value
> in column \"%s\" violates not-null constraint" instead of perhaps more
> relevant "new row for relation \"%s\" violates partition constraint".
> That said, we *do* document the fact that a NOT NULL constraint is added
> on range key columns, but we might as well document instead that we don't
> currently support routing tuples with NULL values in the partition key
> through a range-partitioned table and so NULL values cause error.

in get_partition_for_tuple() we have       if (key->strategy == PARTITION_STRATEGY_RANGE)       {           /* Disallow
nullsin the range partition key of the tuple */           for (i = 0; i < key->partnatts; i++)               if
(isnull[i])                  ereport(ERROR,                           (errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
             errmsg("range partition key of row contains null")));       }
 

Instead of throwing an error here, we should probably return -1 and
let the error be ""no partition of relation \"%s\" found for row",
which is the real error, not having a partition which can accept NULL.
If in future we decide to support NULL values in partition keys, we
need to just remove above code from get_partition_for_tuple() and
everything will work as is. I am assuming that we don't add any
implicit/explicit NOT NULL constraint right now.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] [POC] hash partitioning

From

Amit Langote

Date:

12 May 2017, 11:16:24

On 2017/05/12 14:24, Ashutosh Bapat wrote:
> On Fri, May 12, 2017 at 8:08 AM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2017/05/12 11:20, Robert Haas wrote:
>>> Yeah, but I have a feeling that marking the columns NOT NULL is going
>>> to make it really hard to support that in the future when we get the
>>> syntax hammered out.  If it had only affected the partition
>>> constraints that'd be different.
>>
>> So, adding keycol IS NOT NULL (like we currently do for expressions) in
>> the implicit partition constraint would be more future-proof than
>> generating an actual catalogued NOT NULL constraint on the keycol?  I now
>> tend to think it would be better.  Directly inserting into a range
>> partition with a NULL value for a column currently generates a "null value
>> in column \"%s\" violates not-null constraint" instead of perhaps more
>> relevant "new row for relation \"%s\" violates partition constraint".
>> That said, we *do* document the fact that a NOT NULL constraint is added
>> on range key columns, but we might as well document instead that we don't
>> currently support routing tuples with NULL values in the partition key
>> through a range-partitioned table and so NULL values cause error.
> 
> in get_partition_for_tuple() we have
>         if (key->strategy == PARTITION_STRATEGY_RANGE)
>         {
>             /* Disallow nulls in the range partition key of the tuple */
>             for (i = 0; i < key->partnatts; i++)
>                 if (isnull[i])
>                     ereport(ERROR,
>                             (errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
>                         errmsg("range partition key of row contains null")));
>         }
> 
> Instead of throwing an error here, we should probably return -1 and
> let the error be ""no partition of relation \"%s\" found for row",
> which is the real error, not having a partition which can accept NULL.
> If in future we decide to support NULL values in partition keys, we
> need to just remove above code from get_partition_for_tuple() and
> everything will work as is. I am assuming that we don't add any
> implicit/explicit NOT NULL constraint right now.

We *do* actually, for real columns:

create table p (a int) partition by range (a);
\d p             Table "public.p"Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------a      | integer |           | not null |
Partition key: RANGE (a)

For expression keys, we emit IS NOT NULL as part of the implicit partition
constraint.  The above check for NULL is really for the expressions,
because if any simple columns of the key contain NULL, they will fail the
NOT NULL constraint itself (with that error message).  As I said in my
previous message, I'm thinking that emitting IS NOT NULL as part of the
implicit partition constraint might be better instead of adding it as a
NOT NULL constraint, that is, for the simple column keys; we already do
that for the expression keys for which we cannot add the NOT NULL
constraint anyway.

The way things are currently, error messages generated when a row with
NULL in the range partition key is *directly* into the partition looks a
bit inconsistent, depending on whether the target key is a simple column
or expression:

create table p (a int, b int) partition by range (a, abs(b));
create table p1 partition of p for values from (1, 1) to (1, 10);

insert into p1 values (NULL, NULL);
ERROR:  null value in column "a" violates not-null constraint
DETAIL:  Failing row contains (null, null).

insert into p1 values (1, NULL);
ERROR:  new row for relation "p1" violates partition constraint
DETAIL:  Failing row contains (1, null).

It would be nice if both said "violates partition constraint".

BTW, note that this is independent of your suggestion to emit "partition
not found" message instead of the "no NULLs allowed in the range partition
key" message, which seems fine to me to implement.

Thanks,
Amit

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

12 May 2017, 13:34:23

On Wed, May 10, 2017 at 6:04 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Wed, May 3, 2017 at 6:39 PM, amul sul <sulamul@gmail.com> wrote:
>> On Thu, Apr 27, 2017 at 1:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>>>
>>> This is not yet a detailed review - I may be missing things, and
>>> review and commentary from others is welcome.  If there is no major
>>> disagreement with the idea of moving forward using Amul's patch as a
>>> base, then I will do a more detailed review of that patch (or,
>>> hopefully, an updated version that addresses the above comments).
>>
>
> I agree that Amul's approach makes dump/restore feasible whereas
> Nagata-san's approach makes that difficult. That is a major plus point
> about Amul's patch. Also, it makes it possible to implement
> Nagata-san's syntax, which is more user-friendly in future.
>
> Here are some review comments after my initial reading of Amul's patch:
>
> Hash partitioning will partition the data based on the hash value of the
> partition key. Does that require collation? Should we throw an error/warning if
> collation is specified in PARTITION BY clause?
>
> +    int           *indexes;        /* Partition indexes; in case of hash
> +                                 * partitioned table array length will be
> +                                 * value of largest modulus, and for others
> +                                 * one entry per member of the datums array
> +                                 * (plus one if range partitioned table) */
> This may be rewritten as "Partition indexes: For hash partitioned table the
> number of indexes will be same as the largest modulus. For list partitioned
> table the number of indexes will be same as the number of datums. For range
> partitioned table the number of indexes will be number of datums plus one.".
> You may be able to reword it to a shorter version, but essentially we will have
> separate description for each strategy.
>
Okay, will fix this.

> I guess, we need to change the comments for the other members too. For example
> "datums" does not contain tuples with key->partnatts attributes for hash
> partitions. It contains a tuple with two attributes, modulus and remainder. We
> may not want to track null_index separately since rows with NULL partition key
> will fit in the partition corresponding to the hash value of NULL. OR may be we
> want to set null_index to partition which contains NULL values, if there is a
> partition created for corresponding remainder, modulus pair and set has_null
> accordingly. Accordingly we will need to update the comments.
>
> cal_hash_value() may be renamed as calc_has_value() or compute_hash_value()?
>
Okay, will rename to compute_hash_value().

> Should we change the if .. else if .. construct in RelationBuildPartitionDesc()
> to a switch case? There's very less chance that we will support a fourth
> partitioning strategy, so if .. else if .. may be fine.
>
> +                        int        mod = hbounds[i]->modulus,
> +                                place = hbounds[i]->remainder;
> Although there are places in the code where we separate variable declaration
> with same type by comma, most of the code declares each variable with the data
> type on separate line. Should variable "place" be renamed as "remainder" since
> that's what it is ultimately?
>
Okay, will rename "place" to "remainder".

> RelationBuildPartitionDesc() fills up mapping array but never uses it. In this

Agreed, mapping array is not that much useful but not useless, it
required at the end of RelationBuildPartitionDesc() while assigning
OIDs to result->oids, see for-loop just before releasing mapping
memory.

> code the index into mapping array itself is the mapping so it doesn't need to
> be maintained separately like list partiioning case. Similary next_index usage
> looks unnecessary, although that probably improves readability, so may be fine.
>
Anyway, will remove uses of "next_index".

> + *   for p_p1: satisfies_hash_partition(2, 1, pkey, value)
> + *   for p_p2: satisfies_hash_partition(4, 2, pkey, value)
> + *   for p_p3: satisfies_hash_partition(8, 0, pkey, value)
> + *   for p_p4: satisfies_hash_partition(8, 4, pkey, value)
> What the function builds is satisfies_hash_partition(2, 1, pkey). I don't see
> code to add value as an argument to the function. Is that correct?
>
Sorry for confusion,  "pkey" & "value" are the column of table in the
give example.
Renamed those column name to "a" & "b".

> +                        int        modulus = DatumGetInt32(datum);
> May be you want to rename this variable to greatest_modulus like in the other
> places.
>
Okay, will fix this.

> +                        Assert(spec->modulus > 0 && spec->remainder >= 0);
> I liked this assertion. Do you want to add spec->modulus > spec->reminder also
> here?
>
Okay, will add this too.

> +    char       *strategy;        /* partitioning strategy
> +                                   ('hash', 'list' or 'range') */
>
> We need the second line to start with '*'
>
> +-- check validation when attaching list partitions
> Do you want to say "hash" instead of "list" here?
>
You are correct, will fix this too.

> I think we need to explain the reasoning behind this syntax somewhere
> as a README or in the documentation or in the comments. Otherwise it's
> difficult to understand how various pieces of code are related.
>
Not sure about README, I think we should focus on documentation & code
comments first, and then think about developer perspective README if
hash partitioning logic is too difficult to understand .

> This is not full review. I am still trying to understand how the hash
> partitioning implementation fits with list and range partitioning. I
> am going to continue to review this patch further.
>
Thanks a lots for your help.

Regards,
Amul

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

12 May 2017, 15:38:36

Hi,

Please find the following updated patches attached:

0001-Cleanup.patch : Does some cleanup and code refactoring required
for hash partition patch. Otherwise, there will be unnecessary diff in
0002 patch

0002-hash-partitioning_another_design-v3.patch: Addressed review
comments given by Ashutosh and Robert.

On Wed, May 10, 2017 at 11:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, May 3, 2017 at 9:09 AM, amul sul <sulamul@gmail.com> wrote:
>> Fixed in the attached version.
>
> +[ PARTITION BY { HASH | RANGE | LIST } ( { <replaceable
> class="parameter">column_name</replaceable> | ( <replaceable
> class="parameter">expression</replaceable> ) } [ COLLATE <replaceable
>
> In the department of severe nitpicking, I would have expected this to
> either use alphabetical order (HASH | LIST | RANGE) or to add the new
> method at the end on the theory that we probably did the important
> ones first (RANGE | LIST | HASH).
>
Fixed in the attached version.

> +  WITH ( MODULUS <replaceable class="PARAMETER">value</replaceable>,
> REMAINDER <replaceable class="PARAMETER">value</replaceable> ) }
>
> Maybe value -> modulus and value -> remainder?
>
Fixed in the attached version.

>       <para>
> +      When creating a hash partition, <literal>MODULUS</literal> should be
> +      greater than zero and <literal>REMAINDER</literal> should be greater than
> +      or equal to zero.  Every <literal>MODULUS</literal> must be a factor of
> +      the next larger modulus.
> [ ... and it goes on from there ... ]
>
> This paragraph is fairly terrible, because it's a design spec that I
> wrote, not an explanation intended for users.  Here's an attempt to
> improve it:
>
> ===
> When creating a hash partition, a modulus and remainder must be
> specified.  The modulus must be a positive integer, and the remainder
> must a non-negative integer less than the modulus.  Typically, when
> initially setting up a hash-partitioned table, you should choose a
> modulus equal to the number of partitions and assign every table the
> same modulus and a different remainder (see examples, below).
> However, it is not required that every partition have the same
> modulus, only that every modulus which occurs among the children of a
> hash-partitioned table is a factor of the next larger modulus.  This
> allows the number of partitions to be increased incrementally without
> needing to move all the data at once.  For example, suppose you have a
> hash-partitioned table with 8 children, each of which has modulus 8,
> but find it necessary to increase the number of partitions to 16.  You
> can detach one of the modulus-8 partitions, create two new modulus-16
> partitions covering the same portion of the key space (one with a
> remainder equal to the remainder of the detached partition, and the
> other with a remainder equal to that value plus 8), and repopulate
> them with data.  You can then repeat this -- perhaps at a later time
> -- for each modulus-8 partition until none remain.  While this may
> still involve a large amount of data movement at each step, it is
> still better than having to create a whole new table and move all the
> data at once.
> ===
>
Thanks a lot, added in attached version.

> +CREATE TABLE postal_code (
> +    code         int not null,
> +    city_id      bigint not null,
> +    address      text
> +) PARTITION BY HASH (code);
>
> It would be fairly silly to hash-partition the postal_code table,
> because there aren't enough postal codes to justify it.  Maybe make
> this a lineitem or order table, and partition on the order number.
> Also, extend the example to show creating 4 partitions with modulus 4.
>
Understood, added order table example.

> +                if (spec->strategy != PARTITION_STRATEGY_HASH)
> +                    elog(ERROR, "invalid strategy in partition bound spec");
>
> I think this should be an ereport() if it can happen or an Assert() if
> it's supposed to be prevented by the grammar.
>
Used Assert() in the attach version patch, also changed same for RANGE
and LIST in 0001- cleanup patch.

> +            if (!(datumIsEqual(b1->datums[i][0], b2->datums[i][0],
> +                               true, sizeof(int)) &&
>
> It doesn't seem necessary to use datumIsEqual() here.  You know the
> datums are pass-by-value, so why not just use == ?  I'd include a
> comment but I don't think using datumIsEqual() adds anything here
> except unnecessary complexity.  More broadly, I wonder why we're
> cramming this into the datums arrays instead of just adding another
> field to PartitionBoundInfoData that is only used by hash
> partitioning.
>
Fixed in the attached version.

>                     /*
> +                     * Check rule that every modulus must be a factor of the
> +                     * next larger modulus.  For example, if you have a bunch
> +                     * of partitions that all have modulus 5, you can add a new
> +                     * new partition with modulus 10 or a new partition with
> +                     * modulus 15, but you cannot add both a partition with
> +                     * modulus 10 and a partition with modulus 15, because 10
> +                     * is not a factor of 15.  However, you could
> simultaneously
> +                     * use modulus 4, modulus 8, modulus 16, and modulus 32 if
> +                     * you wished, because each modulus is a factor of the next
> +                     * larger one.  You could also use modulus 10, modulus 20,
> +                     * and modulus 60. But you could not use modulus 10,
> +                     * modulus 15, and modulus 60 for the same reason.
> +                     */
>
> I think just the first sentence is fine here; I'd nuke the rest of this.
>
Fixed in the attached version.

> The block that follows could be merged into the surrounding block.
> There's no need to increase the indentation level here, so let's not.
> I also suspect that the code itself is wrong.  There are two ways a
> modulus can be invalid: it can either fail to be a multiple of the
> next lower-modulus, or it can fail to be a factor of the next-higher
> modulus.  I think your code only checks the latter.  So for example,
> if the current modulus list is (4, 36), your code would correctly
> disallow 3 because it's not a factor of 4 and would correctly disallow
> 23 because it's not a factor of 36, but it looks to me like it would
> allow 9 because that's a factor of 36. However, then the list would be
> (4, 9, 36), and 4 is not a factor of 9.
>
This case is already handled in previous patch and similar regression
test does exists in create_table.sql, see this in v2 patch.

  +-- check partition bound syntax for the hash partition
  +CREATE TABLE hash_parted (
  +   a int
  +) PARTITION BY HASH (a);
  +CREATE TABLE hpart_1 PARTITION OF hash_parted FOR VALUES WITH
(modulus 10, remainder 1);
  +CREATE TABLE hpart_2 PARTITION OF hash_parted FOR VALUES WITH
(modulus 50, remainder 0);
  +-- modulus 25 is factor of modulus of 50 but 10 is not factor of 25.
  +CREATE TABLE fail_part PARTITION OF hash_parted FOR VALUES WITH
(modulus 25, remainder 2);

> +                    greatest_modulus = DatumGetInt32(datums[ndatums - 1][0]);
>
> Here, insert: /* Normally, the lowest remainder that could conflict
> with the new partition is equal to the remainder specified for the new
> partition, but when the new partition has a modulus higher than any
> used so far, we need to adjust. */
>
> +                    place = spec->remainder;
> +                    if (place >= greatest_modulus)
> +                        place = place % greatest_modulus;
>
Fixed in the attached version.

> Here, insert: /* Check every potentially-conflicting remainder. */
>
> +                    do
> +                    {
> +                        if (boundinfo->indexes[place] != -1)
> +                        {
> +                            overlap = true;
> +                            with = boundinfo->indexes[place];
> +                            break;
> +                        }
> +                        place = place + spec->modulus;
>
> Maybe use += ?
>
Fixed.

> +                    } while (place < greatest_modulus);
>
> + * Used when sorting hash bounds across all hash modulus
> + * for hash partitioning
>
> This is not a very descriptive comment.  Maybe /* We sort hash bounds
> by modulus, then by remainder. */
>
Fixed.

> +cal_hash_value(FmgrInfo *partsupfunc, int nkeys, Datum *values, bool *isnull)
>
> I agree with Ashutosh's critique of this name.
>
Fixed.

> +    /*
> +     * Cache hash function information, similar to how record_eq() caches
> +     * equality operator information.  (Perhaps no SQL syntax could cause
> +     * PG_NARGS()/nkeys to change between calls through the same FmgrInfo.
> +     * Checking nkeys here is just defensiveness.)
> +     */
>
> Unless I'm missing something, this comment does not actually describe
> what the code does.  Each call to the function repeats the same
> TypeCacheEntry lookups.  I'm not actually sure whether caching here
> can actually help - is there any situation in which the same FmgrInfo
> will get used repeatedly here?  But if it is possible then this code
> fails to achieve its intended objective.
>
This code is no longer exists in new satisfies_hash_partition() code.

> Another problem with this code is that, unless I'm missing something,
> it completely ignores the opclass the user specified and just looks up
> the default hash opclass.  I think you should create a non-default
> hash opclass for some data type -- maybe create one for int4 that just
> returns the input value unchanged -- and test that the specifying
> default hash opclass routes tuples according to hash_uint32(val) %
> modulus while specifying your customer opclass routes tuples according
> to val % modulus.
>
> Unless I'm severely misunderstanding the situation this code is
> seriously undertested.
>
You are correct, I've missed to opclass handling.  Fixed in the
attached version, and added same case regression test.

> +             * Identify a btree opclass to use. Currently, we use only btree
> +             * operators, which seems enough for list and range partitioning.
>
> This comment is false, right?
>
Not really, this has been re-added due to indentation change.

> +                        appendStringInfoString(buf, "FOR VALUES");
> +                        appendStringInfo(buf, " WITH (modulus %d,
> remainder %d)",
> +                                         spec->modulus, spec->remainder);
>
> You could combine these.
>
I am not sure about this, I've used same code style exist in
get_rule_expr() for range and list.  Do you want me to change this for
other partitioning as well?

> +ALTER TABLE hash_parted2 ATTACH PARTITION fail_part FOR VALUES WITH
> (modulus 0, remainder 1);
> +ERROR:  invalid bound specification for a hash partition
> +HINT:  modulus must be greater than zero
> +ALTER TABLE hash_parted2 ATTACH PARTITION fail_part FOR VALUES WITH
> (modulus 8, remainder 8);
> +ERROR:  invalid bound specification for a hash partition
> +HINT:  modulus must be greater than remainder
> +ALTER TABLE hash_parted2 ATTACH PARTITION fail_part FOR VALUES WITH
> (modulus 3, remainder 2);
> +ERROR:  invalid bound specification for a hash partition
> +HINT:  every modulus must be factor of next largest modulus
>
> It seems like you could merge the hint back into the error:
>
> ERROR: hash partition modulus must be greater than 0
> ERROR: hash partition remainder must be less than modulus
> ERROR: every hash partition modulus must be a factor of the next larger modulus
>
Added same in the attached version. Thanks again.

> +DETAIL:  Partition key of the failing row contains (HASHa, b) = (c, 5).
>
> That's obviously garbled somehow.
>
Oops.  Fixed in the attached version.

> +hash_partbound_elem:
> +        NonReservedWord Iconst
> +            {
> +                $$ = makeDefElem($1, (Node *)makeInteger($2), @1);
> +            }
> +        ;
> +
> +hash_partbound:
> +        hash_partbound_elem ',' hash_partbound_elem
> +            {
> +                $$ = list_make2($1, $3);
> +            }
> +        ;
>
> I don't think that it's the grammar's job to enforce that exactly two
> options are present.  It should allow any number of options, and some
> later code, probably during parse analysis, should check that the ones
> you need are present and that there are no invalid ones.  See the code
> for EXPLAIN, VACUUM, etc.
>
Tried to fixed in the attached version.

> Regarding the test cases, I think that you've got a lot of tests for
> failure scenarios (which is good) but not enough for success
> scenarios.  For example, you test that inserting a row into the wrong
> hash partition fails, but not (unless I missed it) that tuple routing
> succeeds.  I think it would be good to have a test where you insert
> 1000 or so rows into a hash partitioned table just to see it all work.
>
I am quite unsure about this test, now sure how can we verify correct
tuple routing?

> Also, you haven't done anything about the fact that constraint
> exclusion doesn't work for hash partitioned tables, a point I raised
> in http://postgr.es/m/CA+Tgmob7RsN5A=ehgYbLPx--c5CmptrK-dB=Y-v--o+TKyfteA@mail.gmail.com
> and which I still think is quite important.  I think that to have a
> committable patch for this feature that would have to be addressed.
>
Do you mean, we should come up with special handling(pre-pruning) for
hash partitioning or modify constraints exclusion so that it will
handle hash partition expression and cases that you have discussed in
thread[1] as well?  I was under the impression that we might going to
have this as a separate feature proposal.


1]. https://www.postgresql.org/message-id/CA%2BTgmoaE9NZ_RiqZQLp2aJXPO4E78QxkQYL-FR2zCDop96Ahdg%40mail.gmail.com

Regards,
Amul Sul

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Fri, May 12, 2017 at 6:08 PM, amul sul <sulamul@gmail.com> wrote:
> Hi,
>
> Please find the following updated patches attached:
>
> 0001-Cleanup.patch : Does some cleanup and code refactoring required
> for hash partition patch. Otherwise, there will be unnecessary diff in
> 0002 patch

Thanks for splitting the patch.

+                if (isnull[0])
+                    cur_index = partdesc->boundinfo->null_index;
This code assumes that null_index will be set to -1 when has_null is false. Per
RelationBuildPartitionDesc() this is true. But probably we should write this
code as
if (isnull[0])
{
    if (partdesc->boundinfo->has_null)
        cur_index = partdesc->boundinfo->null_index;
}
That way we are certain that when has_null is false, cur_index = -1 similar to
the original code.

Additional arguement to ComputePartitionAttrs() isn't used anywhere in this
patch, so may be this better be part of 0002. If we do this the only change
that will remain in patch is the refactoring of RelationBuildPartitionDesc(),
so we may consider merging it into 0002, unless we find that some more
refactoring is needed. But for now, having it as a separate patch helps.

Here's some more comments on 0002

+ * In the case of hash partitioning, datums is a 2-D array, stores modulus and
+ * remainder values at datums[x][0] and datums[x][1] respectively for each
+ * partition in the ascending order.

This comment about datums should appear in a paragraph of itself and may be
rephrased as in the attached patch. May be we could also add a line about
ndatums for hash partitioned tables as in the attached patch.


+                                 * (see the above enum); NULL for has and list
typo s/has/hash

+        if (key->strategy == PARTITION_STRATEGY_HASH)
+        {
+            ndatums = nparts;
+            hbounds = (PartitionHashBound **) palloc(nparts *
+
sizeof(PartitionHashBound *));
+            i = 0;
+            foreach (cell, boundspecs)
+            {
+                PartitionBoundSpec *spec = lfirst(cell);
+
[ clipped ]
+                hbounds[i]->index = i;
+                i++;
+            }
For list and range partitioned table we order the bounds so that two
partitioned tables have them in the same order irrespective of order in which
they are specified by the user or hence stored in the catalogs. The partitions
then get indexes according the order in which their bounds appear in ordered
arrays of bounds. Thus any two partitioned tables with same partition
specification always have same PartitionBoundInfoData. This helps in
partition-wise join to match partition bounds of two given tables.  Above code
assigns the indexes to the partitions as they appear in the catalogs. This
means that two partitioned tables with same partition specification but
different order for partition bound specification will have different
PartitionBoundInfoData represenation.

If we do that, probably partition_bounds_equal() would reduce to just matching
indexes and the last element of datums array i.e. the greatest modulus datum.
If ordered datums array of two partitioned table do not match exactly, the
mismatch can be because missing datums or different datums. If it's a missing
datum it will change the greatest modulus or have corresponding entry in
indexes array as -1. If the entry differs it will cause mismatching indexes in
the index arrays.

+                     * is not a factor of 15.
+                     *
+                     *
+                     * Get greatest bound in array boundinfo->datums which is
An extra line here.


+                    if (offset < 0)
+                    {
+                        nmod = DatumGetInt32(datums[0][0]);
+                        valid_bound = (nmod % spec->modulus) == 0;
+                    }
+                    else
+                    {
+                        pmod = DatumGetInt32(datums[offset][0]);
+                        valid_bound = (spec->modulus % pmod) == 0;
+
+                        if (valid_bound && (offset + 1) < ndatums)
+                        {
+                            nmod = DatumGetInt32(datums[offset + 1][0]);
+                            valid_bound = (nmod % spec->modulus) == 0;
+                        }
+                    }
May be name the variables as prev_mod(ulus) and next_mod(ulus) for better
readability.

+ *   for p_p1: satisfies_hash_partition(2, 1, hash_fn(a), hash_fn(b))
+ *   for p_p2: satisfies_hash_partition(4, 2, hash_fn(a), hash_fn(b))
+ *   for p_p3: satisfies_hash_partition(8, 0, hash_fn(a), hash_fn(b))
+ *   for p_p4: satisfies_hash_partition(8, 4, hash_fn(a), hash_fn(b))
The description here may be read as if we are calling the same hash function
for both a and b, but that's not true. So, you may want to clarify that
in hash_fn(a) hash_fn means hash function specified for key a.


+        if (key->partattrs[i] != 0)
+        {
+            keyCol = (Node *) makeVar(1,
+                                      key->partattrs[i],
+                                      key->parttypid[i],
+                                      key->parttypmod[i],
+                                      key->parttypcoll[i],
+                                      0);
+
+            /* Form hash_fn(value) expression */
+            keyCol = (Node *) makeFuncExpr(key->partsupfunc[i].fn_oid,
+                                    get_fn_expr_rettype(&key->partsupfunc[i]),
+                                    list_make1(keyCol),
+                                    InvalidOid,
+                                    InvalidOid,
+                                    COERCE_EXPLICIT_CALL);
+        }
+        else
+        {
+            keyCol = (Node *) copyObject(lfirst(partexprs_item));
+            partexprs_item = lnext(partexprs_item);
+        }
I think we should add FuncExpr for column Vars as well as expressions.

The logic to compare two bounds is duplicated in qsort_partition_hbound_cmp()
and partition_bound_cmp(). Should we separate it into a separate function
accepting moduli and remainders. That way in case we change it in future, we
have to change only one place.

I think we need more comments for compute_hash_value(), mix_hash_value() and
satisfies_hash_partition() as to what each of them accepts and what it
computes.

+        /* key's hash values start from third argument of function. */
+        if (!PG_ARGISNULL(i + 2))
+        {
+            values[i] = PG_GETARG_DATUM(i + 2);
+            isnull[i] = false;
+        }
+        else
+            isnull[i] = true;
You could write this as
isnull[i] = PG_ARGISNULL(i + 2);
if (isnull[i])
    values[i] = PG_GETARG_DATUM(i + 2);


+         * Identify a btree or hash opclass to use. Currently, we use only
+         * btree operators, which seems enough for list and range partitioning,
+         * and hash operators for hash partitioning.

The wording, if not read carefully, might be read as "we use only btree
operators".  I suggest we rephrase it as "Identify opclass to use. For
list and range
partitioning we use only btree operators, which seems enough for those. For
hash partitioning, we use hash operators." for clarity.

+                    foreach (lc, $5)
+                    {
+                        DefElem    *opt = (DefElem *) lfirst(lc);
A search on WITH in gram.y shows that we do not handle WITH options in gram.y.
Usually they are handled at the transformation stage. Why is this an exception?
If you do that, we can have all the error handling in
transformPartitionBound().

+DATA(insert OID = 5028 ( satisfies_hash_partition PGNSP PGUID 12 1 0
2276 0 f f f f f f i s 3 0 16 "23 23 2276" _null_ _null_ _null_ _null_
_null_ satisfies_hash_partition _null_ _null_ _null_ ));
Why is third argument to this function ANY? Shouldn't it be INT4ARRAY (variadic
INT4)?

I am yet to review the testcases and thumb through all the places using
PARTITION_STRATEGY_RANGE/LIST to make sure that we are handling
PARTITION_STRATEGY_HASH in all those places.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

0002_additional_changes.patch

Re: [HACKERS] [POC] hash partitioning

From

Dilip Kumar

Date:

13 May 2017, 09:41:09

On Fri, May 12, 2017 at 6:08 PM, amul sul <sulamul@gmail.com> wrote:
> Hi,
>
> Please find the following updated patches attached:

I have done some testing with the new patch, most of the cases worked
as per the expectation except below

I expect the planner to select only "Seq Scan on t1" whereas it's
scanning both the partitions?

create table t (a int, b varchar) partition by hash(a);
create table t1 partition of t for values with (modulus 8, remainder 0);
create table t2 partition of t for values with (modulus 8, remainder 1);

postgres=# explain select * from t where a=8;                       QUERY PLAN
----------------------------------------------------------Append  (cost=0.00..51.75 rows=12 width=36)  ->  Seq Scan on
t1 (cost=0.00..25.88 rows=6 width=36)        Filter: (a = 8)  ->  Seq Scan on t2  (cost=0.00..25.88 rows=6 width=36)
   Filter: (a = 8)
 
(5 rows)


Some cosmetic comments.
-----------------------------------
+ RangeVar   *rv = makeRangeVarFromNameList(castNode(List, nameEl->arg));
+

Useless Hunk.
/*
- * Build a CREATE SEQUENCE command to create the sequence object, and
- * add it to the list of things to be done before this CREATE/ALTER
- * TABLE.
+ * Build a CREATE SEQUENCE command to create the sequence object, and add
+ * it to the list of things to be done before this CREATE/ALTER TABLE. */

Seems like, in src/backend/parser/parse_utilcmd.c, you have changed
the existing code with
pgindent.  I think it's not a good idea to mix pgindent changes with your patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

14 May 2017, 10:00:58

On Fri, May 12, 2017 at 10:39 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Fri, May 12, 2017 at 6:08 PM, amul sul <sulamul@gmail.com> wrote:
>> Hi,
>>
>> Please find the following updated patches attached:
>>
>> 0001-Cleanup.patch : Does some cleanup and code refactoring required
>> for hash partition patch. Otherwise, there will be unnecessary diff in
>> 0002 patch
>
> Thanks for splitting the patch.
>
> +                if (isnull[0])
> +                    cur_index = partdesc->boundinfo->null_index;
> This code assumes that null_index will be set to -1 when has_null is false. Per
> RelationBuildPartitionDesc() this is true. But probably we should write this
> code as
> if (isnull[0])
> {
>     if (partdesc->boundinfo->has_null)
>         cur_index = partdesc->boundinfo->null_index;
> }
> That way we are certain that when has_null is false, cur_index = -1 similar to
> the original code.
>
Okay will add this.  I still don't understood point of having has_null
variable, if no null accepting partition exists then null_index is
alway set to -1 in RelationBuildPartitionDesc.  Anyway, let not change
the original code.

> Additional arguement to ComputePartitionAttrs() isn't used anywhere in this
> patch, so may be this better be part of 0002. If we do this the only change
> that will remain in patch is the refactoring of RelationBuildPartitionDesc(),
> so we may consider merging it into 0002, unless we find that some more
> refactoring is needed. But for now, having it as a separate patch helps.
>
Okay.

> Here's some more comments on 0002
>
> + * In the case of hash partitioning, datums is a 2-D array, stores modulus and
> + * remainder values at datums[x][0] and datums[x][1] respectively for each
> + * partition in the ascending order.
>
> This comment about datums should appear in a paragraph of itself and may be
> rephrased as in the attached patch. May be we could also add a line about
> ndatums for hash partitioned tables as in the attached patch.
>
Thanks, looks good to me; will include this.

[...]
>
> +        if (key->strategy == PARTITION_STRATEGY_HASH)
> +        {
> +            ndatums = nparts;
> +            hbounds = (PartitionHashBound **) palloc(nparts *
> +
> sizeof(PartitionHashBound *));
> +            i = 0;
> +            foreach (cell, boundspecs)
> +            {
> +                PartitionBoundSpec *spec = lfirst(cell);
> +
> [ clipped ]
> +                hbounds[i]->index = i;
> +                i++;
> +            }
> For list and range partitioned table we order the bounds so that two
> partitioned tables have them in the same order irrespective of order in which
> they are specified by the user or hence stored in the catalogs. The partitions
> then get indexes according the order in which their bounds appear in ordered
> arrays of bounds. Thus any two partitioned tables with same partition
> specification always have same PartitionBoundInfoData. This helps in
> partition-wise join to match partition bounds of two given tables.  Above code
> assigns the indexes to the partitions as they appear in the catalogs. This
> means that two partitioned tables with same partition specification but
> different order for partition bound specification will have different
> PartitionBoundInfoData represenation.
>
> If we do that, probably partition_bounds_equal() would reduce to just matching
> indexes and the last element of datums array i.e. the greatest modulus datum.
> If ordered datums array of two partitioned table do not match exactly, the
> mismatch can be because missing datums or different datums. If it's a missing
> datum it will change the greatest modulus or have corresponding entry in
> indexes array as -1. If the entry differs it will cause mismatching indexes in
> the index arrays.
>
Make sense, will fix this.

[...]
>
> +                    if (offset < 0)
> +                    {
> +                        nmod = DatumGetInt32(datums[0][0]);
> +                        valid_bound = (nmod % spec->modulus) == 0;
> +                    }
> +                    else
> +                    {
> +                        pmod = DatumGetInt32(datums[offset][0]);
> +                        valid_bound = (spec->modulus % pmod) == 0;
> +
> +                        if (valid_bound && (offset + 1) < ndatums)
> +                        {
> +                            nmod = DatumGetInt32(datums[offset + 1][0]);
> +                            valid_bound = (nmod % spec->modulus) == 0;
> +                        }
> +                    }
> May be name the variables as prev_mod(ulus) and next_mod(ulus) for better
> readability.
>
Okay, will rename to prev_modulus and next_modulus resp.

> + *   for p_p1: satisfies_hash_partition(2, 1, hash_fn(a), hash_fn(b))
> + *   for p_p2: satisfies_hash_partition(4, 2, hash_fn(a), hash_fn(b))
> + *   for p_p3: satisfies_hash_partition(8, 0, hash_fn(a), hash_fn(b))
> + *   for p_p4: satisfies_hash_partition(8, 4, hash_fn(a), hash_fn(b))
> The description here may be read as if we are calling the same hash function
> for both a and b, but that's not true. So, you may want to clarify that
> in hash_fn(a) hash_fn means hash function specified for key a.
>
Okay.

>
> +        if (key->partattrs[i] != 0)
> +        {
> +            keyCol = (Node *) makeVar(1,
> +                                      key->partattrs[i],
> +                                      key->parttypid[i],
> +                                      key->parttypmod[i],
> +                                      key->parttypcoll[i],
> +                                      0);
> +
> +            /* Form hash_fn(value) expression */
> +            keyCol = (Node *) makeFuncExpr(key->partsupfunc[i].fn_oid,
> +                                    get_fn_expr_rettype(&key->partsupfunc[i]),
> +                                    list_make1(keyCol),
> +                                    InvalidOid,
> +                                    InvalidOid,
> +                                    COERCE_EXPLICIT_CALL);
> +        }
> +        else
> +        {
> +            keyCol = (Node *) copyObject(lfirst(partexprs_item));
> +            partexprs_item = lnext(partexprs_item);
> +        }
> I think we should add FuncExpr for column Vars as well as expressions.
>
Okay, will fix this.

> The logic to compare two bounds is duplicated in qsort_partition_hbound_cmp()
> and partition_bound_cmp(). Should we separate it into a separate function
> accepting moduli and remainders. That way in case we change it in future, we
> have to change only one place.
>
Okay.

> I think we need more comments for compute_hash_value(), mix_hash_value() and
> satisfies_hash_partition() as to what each of them accepts and what it
> computes.
>
> +        /* key's hash values start from third argument of function. */
> +        if (!PG_ARGISNULL(i + 2))
> +        {
> +            values[i] = PG_GETARG_DATUM(i + 2);
> +            isnull[i] = false;
> +        }
> +        else
> +            isnull[i] = true;
> You could write this as
> isnull[i] = PG_ARGISNULL(i + 2);
> if (isnull[i])
>     values[i] = PG_GETARG_DATUM(i + 2);
>
Okay.

>
> +         * Identify a btree or hash opclass to use. Currently, we use only
> +         * btree operators, which seems enough for list and range partitioning,
> +         * and hash operators for hash partitioning.
>
> The wording, if not read carefully, might be read as "we use only btree
> operators".  I suggest we rephrase it as "Identify opclass to use. For
> list and range
> partitioning we use only btree operators, which seems enough for those. For
> hash partitioning, we use hash operators." for clarity.
>
Okay

> +                    foreach (lc, $5)
> +                    {
> +                        DefElem    *opt = (DefElem *) lfirst(lc);
> A search on WITH in gram.y shows that we do not handle WITH options in gram.y.
> Usually they are handled at the transformation stage. Why is this an exception?
> If you do that, we can have all the error handling in
> transformPartitionBound().
>
If so, ForValues need to return list for hash and PartitionBoundSpec
for other two; wouldn't  that break code consistency? And such
validation is not new in gram.y see xmltable_column_el.

> +DATA(insert OID = 5028 ( satisfies_hash_partition PGNSP PGUID 12 1 0
> 2276 0 f f f f f f i s 3 0 16 "23 23 2276" _null_ _null_ _null_ _null_
> _null_ satisfies_hash_partition _null_ _null_ _null_ ));
> Why is third argument to this function ANY? Shouldn't it be INT4ARRAY (variadic
> INT4)?
>
Will use INT4ARRAY in next patch, but I am little sceptical of it.  we
need an unsigned int32, but unfortunately there is not variadic uint32
support.  How about INT8ARRAY?

Regards,
Amul

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

14 May 2017, 11:00:24

On Sat, May 13, 2017 at 12:11 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Fri, May 12, 2017 at 6:08 PM, amul sul <sulamul@gmail.com> wrote:
>> Hi,
>>
>> Please find the following updated patches attached:
>
> I have done some testing with the new patch, most of the cases worked
> as per the expectation except below
>
> I expect the planner to select only "Seq Scan on t1" whereas it's
> scanning both the partitions?
>
> create table t (a int, b varchar) partition by hash(a);
> create table t1 partition of t for values with (modulus 8, remainder 0);
> create table t2 partition of t for values with (modulus 8, remainder 1);
>
> postgres=# explain select * from t where a=8;
>                         QUERY PLAN
> ----------------------------------------------------------
>  Append  (cost=0.00..51.75 rows=12 width=36)
>    ->  Seq Scan on t1  (cost=0.00..25.88 rows=6 width=36)
>          Filter: (a = 8)
>    ->  Seq Scan on t2  (cost=0.00..25.88 rows=6 width=36)
>          Filter: (a = 8)
> (5 rows)
>
You are correct.  As of now constraint exclusion doesn't work on
partition constraint involves function call[1], and hash partition
constraint does have satisfies_hash_partition() function call.

>
> Some cosmetic comments.
> -----------------------------------
> + RangeVar   *rv = makeRangeVarFromNameList(castNode(List, nameEl->arg));
> +
>
> Useless Hunk.
>
>  /*
> - * Build a CREATE SEQUENCE command to create the sequence object, and
> - * add it to the list of things to be done before this CREATE/ALTER
> - * TABLE.
> + * Build a CREATE SEQUENCE command to create the sequence object, and add
> + * it to the list of things to be done before this CREATE/ALTER TABLE.
>   */
>
> Seems like, in src/backend/parser/parse_utilcmd.c, you have changed
> the existing code with
> pgindent.  I think it's not a good idea to mix pgindent changes with your patch.
>
Oops, my silly mistake, sorry about that. Fixed in attached version.

Regards,
Amul

1] https://www.postgresql.org/message-id/CA%2BTgmoaE9NZ_RiqZQLp2aJXPO4E78QxkQYL-FR2zCDop96Ahdg%40mail.gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

15 May 2017, 13:57:13

On Wed, May 10, 2017 at 10:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, May 10, 2017 at 8:34 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>> Hash partitioning will partition the data based on the hash value of the
>> partition key. Does that require collation? Should we throw an error/warning if
>> collation is specified in PARTITION BY clause?
>
> Collation is only relevant for ordering, not equality.  Since hash
> opclasses provide only equality, not ordering, it's not relevant here.
> I'm not sure whether we should error out if it's specified or just
> silently ignore it.  Maybe an ERROR is a good idea?  But not sure.
>
IMHO, we could simply have a WARNING, and ignore collation, thoughts?

Updated patches attached.

Regards,
Amul

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Hi,
Here's patch with some cosmetic fixes to 0002, to be applied on top of 0002.

On Tue, May 16, 2017 at 1:02 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Sun, May 14, 2017 at 12:30 PM, amul sul <sulamul@gmail.com> wrote:
>> On Fri, May 12, 2017 at 10:39 PM, Ashutosh Bapat
>> <ashutosh.bapat@enterprisedb.com> wrote:
>>> On Fri, May 12, 2017 at 6:08 PM, amul sul <sulamul@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> Please find the following updated patches attached:
>>>>
>>>> 0001-Cleanup.patch : Does some cleanup and code refactoring required
>>>> for hash partition patch. Otherwise, there will be unnecessary diff in
>>>> 0002 patch
>>>
>>> Thanks for splitting the patch.
>>>
>>> +                if (isnull[0])
>>> +                    cur_index = partdesc->boundinfo->null_index;
>>> This code assumes that null_index will be set to -1 when has_null is false. Per
>>> RelationBuildPartitionDesc() this is true. But probably we should write this
>>> code as
>>> if (isnull[0])
>>> {
>>>     if (partdesc->boundinfo->has_null)
>>>         cur_index = partdesc->boundinfo->null_index;
>>> }
>>> That way we are certain that when has_null is false, cur_index = -1 similar to
>>> the original code.
>>>
>> Okay will add this.
>
> Thanks.
>
>> I still don't understood point of having has_null
>> variable, if no null accepting partition exists then null_index is
>> alway set to -1 in RelationBuildPartitionDesc.  Anyway, let not change
>> the original code.
>
> I agree. has_null might have been folded as null_index == -1. But
> that's not the problem of this patch.
>
> 0001 looks good to me now.
>
>
>>
>> [...]
>>>
>>> +        if (key->strategy == PARTITION_STRATEGY_HASH)
>>> +        {
>>> +            ndatums = nparts;
>>> +            hbounds = (PartitionHashBound **) palloc(nparts *
>>> +
>>> sizeof(PartitionHashBound *));
>>> +            i = 0;
>>> +            foreach (cell, boundspecs)
>>> +            {
>>> +                PartitionBoundSpec *spec = lfirst(cell);
>>> +
>>> [ clipped ]
>>> +                hbounds[i]->index = i;
>>> +                i++;
>>> +            }
>>> For list and range partitioned table we order the bounds so that two
>>> partitioned tables have them in the same order irrespective of order in which
>>> they are specified by the user or hence stored in the catalogs. The partitions
>>> then get indexes according the order in which their bounds appear in ordered
>>> arrays of bounds. Thus any two partitioned tables with same partition
>>> specification always have same PartitionBoundInfoData. This helps in
>>> partition-wise join to match partition bounds of two given tables.  Above code
>>> assigns the indexes to the partitions as they appear in the catalogs. This
>>> means that two partitioned tables with same partition specification but
>>> different order for partition bound specification will have different
>>> PartitionBoundInfoData represenation.
>>>
>>> If we do that, probably partition_bounds_equal() would reduce to just matching
>>> indexes and the last element of datums array i.e. the greatest modulus datum.
>>> If ordered datums array of two partitioned table do not match exactly, the
>>> mismatch can be because missing datums or different datums. If it's a missing
>>> datum it will change the greatest modulus or have corresponding entry in
>>> indexes array as -1. If the entry differs it will cause mismatching indexes in
>>> the index arrays.
>>>
>> Make sense, will fix this.
>
> I don't see this being addressed in the patches attached in the reply to Dilip.
>
>>
>>>
>>> +        if (key->partattrs[i] != 0)
>>> +        {
>>> +            keyCol = (Node *) makeVar(1,
>>> +                                      key->partattrs[i],
>>> +                                      key->parttypid[i],
>>> +                                      key->parttypmod[i],
>>> +                                      key->parttypcoll[i],
>>> +                                      0);
>>> +
>>> +            /* Form hash_fn(value) expression */
>>> +            keyCol = (Node *) makeFuncExpr(key->partsupfunc[i].fn_oid,
>>> +                                    get_fn_expr_rettype(&key->partsupfunc[i]),
>>> +                                    list_make1(keyCol),
>>> +                                    InvalidOid,
>>> +                                    InvalidOid,
>>> +                                    COERCE_EXPLICIT_CALL);
>>> +        }
>>> +        else
>>> +        {
>>> +            keyCol = (Node *) copyObject(lfirst(partexprs_item));
>>> +            partexprs_item = lnext(partexprs_item);
>>> +        }
>>> I think we should add FuncExpr for column Vars as well as expressions.
>>>
>> Okay, will fix this.
>
> Here, please add a check similar to get_quals_for_range()
> 1840             if (partexprs_item == NULL)
> 1841                 elog(ERROR, "wrong number of partition key expressions");
>
>
>>
>>> I think we need more comments for compute_hash_value(), mix_hash_value() and
>>> satisfies_hash_partition() as to what each of them accepts and what it
>>> computes.
>>>
>>> +        /* key's hash values start from third argument of function. */
>>> +        if (!PG_ARGISNULL(i + 2))
>>> +        {
>>> +            values[i] = PG_GETARG_DATUM(i + 2);
>>> +            isnull[i] = false;
>>> +        }
>>> +        else
>>> +            isnull[i] = true;
>>> You could write this as
>>> isnull[i] = PG_ARGISNULL(i + 2);
>>> if (isnull[i])
>>>     values[i] = PG_GETARG_DATUM(i + 2);
>>>
>> Okay.
>
> If we have used this technique somewhere else in PG code, please
> mention that function/place.
>         /*
>          * Rotate hash left 1 bit before mixing in the next column.  This
>          * prevents equal values in different keys from cancelling each other.
>          */
>
>
>>
>>> +                    foreach (lc, $5)
>>> +                    {
>>> +                        DefElem    *opt = (DefElem *) lfirst(lc);
>>> A search on WITH in gram.y shows that we do not handle WITH options in gram.y.
>>> Usually they are handled at the transformation stage. Why is this an exception?
>>> If you do that, we can have all the error handling in
>>> transformPartitionBound().
>>>
>> If so, ForValues need to return list for hash and PartitionBoundSpec
>> for other two; wouldn't  that break code consistency? And such
>> validation is not new in gram.y see xmltable_column_el.
>
> Thanks for pointing that out. Ok, then may be leave it in gram.y. But
> may be we should move the error handling in transform function.
>
>
>>
>>> +DATA(insert OID = 5028 ( satisfies_hash_partition PGNSP PGUID 12 1 0
>>> 2276 0 f f f f f f i s 3 0 16 "23 23 2276" _null_ _null_ _null_ _null_
>>> _null_ satisfies_hash_partition _null_ _null_ _null_ ));
>>> Why is third argument to this function ANY? Shouldn't it be INT4ARRAY (variadic
>>> INT4)?
>>>
>> Will use INT4ARRAY in next patch, but I am little sceptical of it.  we
>> need an unsigned int32, but unfortunately there is not variadic uint32
>> support.  How about INT8ARRAY?
>
> Hmm, I think as long as the binary representation of given unsigned
> integer doesn't change in the function call, we could cast an INT32
> datums into unsigned int32, so spending extra 4 bytes per partition
> key doesn't look like worth the effort.
>
> A related question is, all hash functions have return type as
> "integer" but internally they return uint32. Why not to do the same
> for this function as well?
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

0002-cosmetic_fixes.patch

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

16 May 2017, 13:00:00

On Tue, May 16, 2017 at 1:02 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
 [...]
>>>
>>> +        if (key->strategy == PARTITION_STRATEGY_HASH)
>>> +        {
>>> +            ndatums = nparts;
>>> +            hbounds = (PartitionHashBound **) palloc(nparts *
>>> +
>>> sizeof(PartitionHashBound *));
>>> +            i = 0;
>>> +            foreach (cell, boundspecs)
>>> +            {
>>> +                PartitionBoundSpec *spec = lfirst(cell);
>>> +
>>> [ clipped ]
>>> +                hbounds[i]->index = i;
>>> +                i++;
>>> +            }
>>> For list and range partitioned table we order the bounds so that two
>>> partitioned tables have them in the same order irrespective of order in which
>>> they are specified by the user or hence stored in the catalogs. The partitions
>>> then get indexes according the order in which their bounds appear in ordered
>>> arrays of bounds. Thus any two partitioned tables with same partition
>>> specification always have same PartitionBoundInfoData. This helps in
>>> partition-wise join to match partition bounds of two given tables.  Above code
>>> assigns the indexes to the partitions as they appear in the catalogs. This
>>> means that two partitioned tables with same partition specification but
>>> different order for partition bound specification will have different
>>> PartitionBoundInfoData represenation.
>>>
>>> If we do that, probably partition_bounds_equal() would reduce to just matching
>>> indexes and the last element of datums array i.e. the greatest modulus datum.
>>> If ordered datums array of two partitioned table do not match exactly, the
>>> mismatch can be because missing datums or different datums. If it's a missing
>>> datum it will change the greatest modulus or have corresponding entry in
>>> indexes array as -1. If the entry differs it will cause mismatching indexes in
>>> the index arrays.
>>>
>> Make sense, will fix this.
>
> I don't see this being addressed in the patches attached in the reply to Dilip.
>

Fixed in the attached version.

>>
>>>
>>> +        if (key->partattrs[i] != 0)
>>> +        {
>>> +            keyCol = (Node *) makeVar(1,
>>> +                                      key->partattrs[i],
>>> +                                      key->parttypid[i],
>>> +                                      key->parttypmod[i],
>>> +                                      key->parttypcoll[i],
>>> +                                      0);
>>> +
>>> +            /* Form hash_fn(value) expression */
>>> +            keyCol = (Node *) makeFuncExpr(key->partsupfunc[i].fn_oid,
>>> +                                    get_fn_expr_rettype(&key->partsupfunc[i]),
>>> +                                    list_make1(keyCol),
>>> +                                    InvalidOid,
>>> +                                    InvalidOid,
>>> +                                    COERCE_EXPLICIT_CALL);
>>> +        }
>>> +        else
>>> +        {
>>> +            keyCol = (Node *) copyObject(lfirst(partexprs_item));
>>> +            partexprs_item = lnext(partexprs_item);
>>> +        }
>>> I think we should add FuncExpr for column Vars as well as expressions.
>>>
>> Okay, will fix this.
>
> Here, please add a check similar to get_quals_for_range()
> 1840             if (partexprs_item == NULL)
> 1841                 elog(ERROR, "wrong number of partition key expressions");
>
>

Fixed in the attached version.

>>
>>> I think we need more comments for compute_hash_value(), mix_hash_value() and
>>> satisfies_hash_partition() as to what each of them accepts and what it
>>> computes.
>>>
>>> +        /* key's hash values start from third argument of function. */
>>> +        if (!PG_ARGISNULL(i + 2))
>>> +        {
>>> +            values[i] = PG_GETARG_DATUM(i + 2);
>>> +            isnull[i] = false;
>>> +        }
>>> +        else
>>> +            isnull[i] = true;
>>> You could write this as
>>> isnull[i] = PG_ARGISNULL(i + 2);
>>> if (isnull[i])
>>>     values[i] = PG_GETARG_DATUM(i + 2);
>>>
>> Okay.
>
> If we have used this technique somewhere else in PG code, please
> mention that function/place.
>         /*
>          * Rotate hash left 1 bit before mixing in the next column.  This
>          * prevents equal values in different keys from cancelling each other.
>          */
>

Fixed in the attached version.

>
>>
>>> +                    foreach (lc, $5)
>>> +                    {
>>> +                        DefElem    *opt = (DefElem *) lfirst(lc);
>>> A search on WITH in gram.y shows that we do not handle WITH options in gram.y.
>>> Usually they are handled at the transformation stage. Why is this an exception?
>>> If you do that, we can have all the error handling in
>>> transformPartitionBound().
>>>
>> If so, ForValues need to return list for hash and PartitionBoundSpec
>> for other two; wouldn't  that break code consistency? And such
>> validation is not new in gram.y see xmltable_column_el.
>
> Thanks for pointing that out. Ok, then may be leave it in gram.y. But
> may be we should move the error handling in transform function.
>

IMO, let it be there for readability.  It will be easier to understand
why do we have set -1 for modulus and remainder.

>
>>
>>> +DATA(insert OID = 5028 ( satisfies_hash_partition PGNSP PGUID 12 1 0
>>> 2276 0 f f f f f f i s 3 0 16 "23 23 2276" _null_ _null_ _null_ _null_
>>> _null_ satisfies_hash_partition _null_ _null_ _null_ ));
>>> Why is third argument to this function ANY? Shouldn't it be INT4ARRAY (variadic
>>> INT4)?
>>>
>> Will use INT4ARRAY in next patch, but I am little sceptical of it.  we
>> need an unsigned int32, but unfortunately there is not variadic uint32
>> support.  How about INT8ARRAY?
>
> Hmm, I think as long as the binary representation of given unsigned
> integer doesn't change in the function call, we could cast an INT32
> datums into unsigned int32, so spending extra 4 bytes per partition
> key doesn't look like worth the effort.
>
> A related question is, all hash functions have return type as
> "integer" but internally they return uint32. Why not to do the same
> for this function as well?

I see. IIUC, there is no harm to use INT4ARRAY,  thanks for explanation.

Regards,
Amul

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

16 May 2017, 13:09:53

On Tue, May 16, 2017 at 1:17 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> Hi,
> Here's patch with some cosmetic fixes to 0002, to be applied on top of 0002.
>

Thank you, included in v6 patch.

Regards,
Amul

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

16 May 2017, 13:52:07

On Tue, May 16, 2017 at 3:30 PM, amul sul <sulamul@gmail.com> wrote:
> On Tue, May 16, 2017 at 1:02 PM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>  [...]
>>>>
>>>> +        if (key->strategy == PARTITION_STRATEGY_HASH)
>>>> +        {
>>>> +            ndatums = nparts;
>>>> +            hbounds = (PartitionHashBound **) palloc(nparts *
>>>> +
>>>> sizeof(PartitionHashBound *));
>>>> +            i = 0;
>>>> +            foreach (cell, boundspecs)
>>>> +            {
>>>> +                PartitionBoundSpec *spec = lfirst(cell);
>>>> +
>>>> [ clipped ]
>>>> +                hbounds[i]->index = i;
>>>> +                i++;
>>>> +            }
>>>> For list and range partitioned table we order the bounds so that two
>>>> partitioned tables have them in the same order irrespective of order in which
>>>> they are specified by the user or hence stored in the catalogs. The partitions
>>>> then get indexes according the order in which their bounds appear in ordered
>>>> arrays of bounds. Thus any two partitioned tables with same partition
>>>> specification always have same PartitionBoundInfoData. This helps in
>>>> partition-wise join to match partition bounds of two given tables.  Above code
>>>> assigns the indexes to the partitions as they appear in the catalogs. This
>>>> means that two partitioned tables with same partition specification but
>>>> different order for partition bound specification will have different
>>>> PartitionBoundInfoData represenation.
>>>>
>>>> If we do that, probably partition_bounds_equal() would reduce to just matching
>>>> indexes and the last element of datums array i.e. the greatest modulus datum.
>>>> If ordered datums array of two partitioned table do not match exactly, the
>>>> mismatch can be because missing datums or different datums. If it's a missing
>>>> datum it will change the greatest modulus or have corresponding entry in
>>>> indexes array as -1. If the entry differs it will cause mismatching indexes in
>>>> the index arrays.
>>>>
>>> Make sense, will fix this.
>>
>> I don't see this being addressed in the patches attached in the reply to Dilip.
>>
>
> Fixed in the attached version.
>

v6 patch has bug in partition oid mapping and indexing, fixed in the
attached version.

Now partition oids will be arranged in the ascending order of hash
partition bound  (i.e. modulus and remainder sorting order)

Regards,
Amul

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Tue, May 16, 2017 at 10:00 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Tue, May 16, 2017 at 4:22 PM, amul sul <sulamul@gmail.com> wrote:
>> v6 patch has bug in partition oid mapping and indexing, fixed in the
>> attached version.
>>
>> Now partition oids will be arranged in the ascending order of hash
>> partition bound  (i.e. modulus and remainder sorting order)
>
> Thanks for the update patch. I have some more comments.
>
> ------------
> + if (spec->remainder < 0)
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
> +  errmsg("hash partition remainder must be less than modulus")));
>
> I think this error message is not correct, you might want to change it
> to "hash partition remainder must be non-negative integer"
>

Fixed in the attached version;  used "hash partition remainder must be
greater than or equal to 0" instead.

> -------
>
> +         The table is partitioned by specifying remainder and modulus for each
> +         partition. Each partition holds rows for which the hash value of
>
> Wouldn't it be better to say "modulus and remainder" instead of
> "remainder and modulus" then it will be consistent?
>

You are correct, fixed in the attached version.

> -------
> +       An <command>UPDATE</> that causes a row to move from one partition to
> +       another fails, because
>
> fails, because -> fails because
>

This hunk is no longer exists in the attached patch, that was mistaken
copied, sorry about that.

> -------
>
> Wouldn't it be a good idea to document how to increase the number of
> hash partitions, I think we can document it somewhere with an example,
> something like Robert explained upthread?
>
> create table foo (a integer, b text) partition by hash (a);
> create table foo1 partition of foo with (modulus 2, remainder 0);
> create table foo2 partition of foo with (modulus 2, remainder 1);
>
> You can detach foo1, create two new partitions with modulus 4 and
> remainders 0 and 2, and move the data over from the old partition
>
> I think it will be good information for a user to have? or it's
> already documented and I missed it?
>

I think, we should, but not sure about it.

Regards,
Amul

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Wed, May 17, 2017 at 11:11 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Wed, May 17, 2017 at 12:04 AM, amul sul <sulamul@gmail.com> wrote:
>> On Tue, May 16, 2017 at 10:00 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>> On Tue, May 16, 2017 at 4:22 PM, amul sul <sulamul@gmail.com> wrote:
>>>> v6 patch has bug in partition oid mapping and indexing, fixed in the
>>>> attached version.
>>>>
>>>> Now partition oids will be arranged in the ascending order of hash
>>>> partition bound  (i.e. modulus and remainder sorting order)
>>>
>>> Thanks for the update patch. I have some more comments.
>>>
>>> ------------
>>> + if (spec->remainder < 0)
>>> + ereport(ERROR,
>>> + (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
>>> +  errmsg("hash partition remainder must be less than modulus")));
>>>
>>> I think this error message is not correct, you might want to change it
>>> to "hash partition remainder must be non-negative integer"
>>>
>>
>> Fixed in the attached version;  used "hash partition remainder must be
>> greater than or equal to 0" instead.
>
> I would suggest "non-zero positive", since that's what we are using in
> the documentation.
>

Understood, Fixed in the attached version.

>>
>>> -------
>>>
>>> +         The table is partitioned by specifying remainder and modulus for each
>>> +         partition. Each partition holds rows for which the hash value of
>>>
>>> Wouldn't it be better to say "modulus and remainder" instead of
>>> "remainder and modulus" then it will be consistent?
>>>
>>
>> You are correct, fixed in the attached version.
>>
>>> -------
>>> +       An <command>UPDATE</> that causes a row to move from one partition to
>>> +       another fails, because
>>>
>>> fails, because -> fails because
>>>
>>
>> This hunk is no longer exists in the attached patch, that was mistaken
>> copied, sorry about that.
>>
>>> -------
>>>
>>> Wouldn't it be a good idea to document how to increase the number of
>>> hash partitions, I think we can document it somewhere with an example,
>>> something like Robert explained upthread?
>>>
>>> create table foo (a integer, b text) partition by hash (a);
>>> create table foo1 partition of foo with (modulus 2, remainder 0);
>>> create table foo2 partition of foo with (modulus 2, remainder 1);
>>>
>>> You can detach foo1, create two new partitions with modulus 4 and
>>> remainders 0 and 2, and move the data over from the old partition
>>>
>>> I think it will be good information for a user to have? or it's
>>> already documented and I missed it?
>>>
>
> This is already part of documentation contained in the patch.
>
> Here are some more comments
> @@ -3296,6 +3311,14 @@ ALTER TABLE measurement ATTACH PARTITION
> measurement_y2008m02
>         not the partitioned table.
>        </para>
>       </listitem>
> +
> +     <listitem>
> +      <para>
> +       An <command>UPDATE</> that causes a row to move from one partition to
> +       another fails, because the new value of the row fails to satisfy the
> +       implicit partition constraint of the original partition.
> +      </para>
> +     </listitem>
>      </itemizedlist>
>      </para>
>      </sect3>
> The description in this chunk is applicable to all the kinds of partitioning.
> Why should it be part of a patch implementing hash partitioning?
>

This was already addressed in the previous patch(v8).

> +        Declarative partitioning only supports hash, list and range
> +        partitioning, whereas table inheritance allows data to be
> +        divided in a manner of the user's choosing.  (Note, however,
> +        that if constraint exclusion is unable to prune partitions
> +        effectively, query performance will be very poor.)
> Looks like the line width is less than 80 characters.
>

Fixed in the attached version.

> In partition_bounds_equal(), please add comments explaining why is it safe to
> check just the indexes? May be we should add code under assertion to make sure
> that the datums are equal as well.

Added assert in the attached version.

> The comment could be something
> like, "If two partitioned tables have different greatest moduli, their
> partition schemes don't match. If they have same greatest moduli, and
> all remainders have different indexes, they all have same modulus
> specified and the partitions are ordered by remainders, thus indexes
> array will be an identity i.e. index[i] = i. If the partition
> corresponding to a given remainder exists, it will have same index
> entry for both partitioned tables or if it's missing it will be -1.
> Thus if indexes array matches, corresponding datums array matches. If
> there are multiple remainders corresponding to a given partition,
> their partitions are ordered by the lowest of the remainders, thus if
> indexes array matches, both of the tables have same indexes arrays, in
> both the tables remainders corresponding to multiple partitions all
> have same indexes and thus same modulus. Thus again if the indexes are
> same, datums are same.".
>

Thanks, added with minor modification.

> In the same function
>     if (key->strategy == PARTITION_STRATEGY_HASH)
>     {
>         int            greatest_modulus;
>
>         /*
>          * Compare greatest modulus of hash partition bound which
>          * is the last element of datums array.
>          */
>         if (b1->datums[b1->ndatums - 1][0] != b2->datums[b2->ndatums - 1][0])
>             return false;
>
>         /* Compare indexes */
>         greatest_modulus = DatumGetInt32(b1->datums[b1->ndatums - 1][0]);
>         for (i = 0; i < greatest_modulus; i++)
>             if (b1->indexes[i] != b2->indexes[i])
>                 return false;
>     }
> if we return true from where this block ends, we will save one indenation level
> for rest of the code and also FWIW extra diffs in this patch because of this
> indentation change.
>

I still do believe having this code in the IF - ELSE block will be
better for longterm, rather having code clutter to avoid diff that
unpleasant for now.

> +        /*
> +         * Hash operator classes provide only equality, not ordering.
> +         * Collation, which is relevant for ordering and not equality is
> +         * irrelevant for hash partitioning.
> +         */
> A comma is missing after "equality", and may be we need "for" before
> "equality".
>          * Collation, which is relevant for ordering and not equality, is
>
> +         * we use hash operator class. */
> */ should be on new line.
>

Fixed.

Regards,
Amul

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] [POC] hash partitioning

From

Ashutosh Bapat

Date:

17 May 2017, 16:24:46

On Wed, May 17, 2017 at 2:07 PM, amul sul <sulamul@gmail.com> wrote:

>
>> In partition_bounds_equal(), please add comments explaining why is it safe to
>> check just the indexes? May be we should add code under assertion to make sure
>> that the datums are equal as well.
>
> Added assert in the attached version.
>
>> The comment could be something
>> like, "If two partitioned tables have different greatest moduli, their
>> partition schemes don't match. If they have same greatest moduli, and
>> all remainders have different indexes, they all have same modulus
>> specified and the partitions are ordered by remainders, thus indexes
>> array will be an identity i.e. index[i] = i. If the partition
>> corresponding to a given remainder exists, it will have same index
>> entry for both partitioned tables or if it's missing it will be -1.
>> Thus if indexes array matches, corresponding datums array matches. If
>> there are multiple remainders corresponding to a given partition,
>> their partitions are ordered by the lowest of the remainders, thus if
>> indexes array matches, both of the tables have same indexes arrays, in
>> both the tables remainders corresponding to multiple partitions all
>> have same indexes and thus same modulus. Thus again if the indexes are
>> same, datums are same.".
>>
>
> Thanks, added with minor modification.

I have reworded this slightly better. See the attached patch as diff of 0002.

>
>> In the same function
>>     if (key->strategy == PARTITION_STRATEGY_HASH)
>>     {
>>         int            greatest_modulus;
>>
>>         /*
>>          * Compare greatest modulus of hash partition bound which
>>          * is the last element of datums array.
>>          */
>>         if (b1->datums[b1->ndatums - 1][0] != b2->datums[b2->ndatums - 1][0])
>>             return false;
>>
>>         /* Compare indexes */
>>         greatest_modulus = DatumGetInt32(b1->datums[b1->ndatums - 1][0]);
>>         for (i = 0; i < greatest_modulus; i++)
>>             if (b1->indexes[i] != b2->indexes[i])
>>                 return false;
>>     }
>> if we return true from where this block ends, we will save one indenation level
>> for rest of the code and also FWIW extra diffs in this patch because of this
>> indentation change.
>>
>
> I still do believe having this code in the IF - ELSE block will be
> better for longterm, rather having code clutter to avoid diff that
> unpleasant for now.

Ok, I will leave it to the committer to judge.


Comments on the tests
+#ifdef USE_ASSERT_CHECKING
+        {
+            /*
+             * Hash partition bound stores modulus and remainder at
+             * b1->datums[i][0] and b1->datums[i][1] position respectively.
+             */
+            for (i = 0; i < b1->ndatums; i++)
+                Assert((b1->datums[i][0] == b2->datums[i][0] &&
+                        b1->datums[i][1] == b2->datums[i][1]));
+        }
+#endif
Why do we need extra {} here?

Comments on testcases
+CREATE TABLE hpart_1 PARTITION OF hash_parted FOR VALUES WITH
(modulus 8, remainder 0);
+CREATE TABLE fail_part (LIKE hpart_1 INCLUDING CONSTRAINTS);
+ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH
(modulus 4, remainder 0);
Probably you should also test the other-way round case i.e. create modulus 4,
remainder 0 partition and then try to add partitions with modulus 8, remainder
4 and modulus 8, remainder 0. That should fail.

Why to create two tables hash_parted and hash_parted2, you should be able to
test with only a single table.

+INSERT INTO hpart_2 VALUES (3, 'a');
+DELETE FROM hpart_2;
+INSERT INTO hpart_5_a (a, b) VALUES (6, 'a');
This is slightly tricky. On different platforms the row may map to different
partitions depending upon how the values are hashed. So, this test may not be
portable on all the platforms. Probably you should add such testcases with a
custom hash operator class which is identity function as suggested by Robert.
This also applies to the tests in insert.sql and update.sql for partitioned
table without custom opclass.

+-- delete the faulting row and also add a constraint to skip the scan
+ALTER TABLE hpart_5 ADD CONSTRAINT hcheck_a CHECK (a IN (5)), ALTER a
SET NOT NULL;
The constraint is not same as the implicit constraint added for that partition.
I am not sure whether it's really going to avoid the scan. Did you verify it?
If yes, then how?

+ALTER TABLE hash_parted2 ATTACH PARTITION fail_part FOR VALUES WITH
(modulus 3, remainder 2);
+ERROR:  every hash partition modulus must be a factor of the next
larger modulus
We should add this test with at least two partitions in there so that we can
check lower and upper modulus. Also, testing with some interesting
bounds discussed earlier
in this mail e.g. adding modulus 15 when 5, 10, 60 exist will be better than
testing with 3, 4 and 8.

+ERROR:  cannot use collation for hash partition key column "a"
This seems to indicate that we can not specify collation for hash partition key
column, which isn't true. Column a here can have its collation. What's not
allowed is specifying collation in PARTITION BY clause.
May be reword the error as "cannot use collation for hash partitioning". or
plain "cannot use collation in PARTITION BY clause for hash partitioning".

+ERROR:  invalid bound specification for a list partition
+LINE 1: CREATE TABLE fail_part PARTITION OF list_parted FOR VALUES W...
+                                                        ^
Should the location for this error be that of WITH clause like in case of range
and list partitioned table.

+select tableoid::regclass as part, a from hash_parted order by part;
May be add a % 4 to show clearly that the data really goes to the partitioning
with that remainder.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

0002-extras.patch

Re: [HACKERS] [POC] hash partitioning

From

Robert Haas

Date:

17 May 2017, 21:21:48

On Wed, May 17, 2017 at 1:41 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>> Fixed in the attached version;  used "hash partition remainder must be
>> greater than or equal to 0" instead.
>
> I would suggest "non-zero positive", since that's what we are using in
> the documentation.

Well, that's not very good terminology, because zero is not a positive
number.  Existing error messages seem to use phrasing such as "THING
must be a positive integer" when zero is not allowed or "THING must be
a non-negative integer" when zero is allowed.  For examples, do git
grep errmsg.*positive or git grep errmsg.*negative.

> In partition_bounds_equal(), please add comments explaining why is it safe to
> check just the indexes? May be we should add code under assertion to make sure
> that the datums are equal as well. The comment could be something
> like, "If two partitioned tables have different greatest moduli, their
> partition schemes don't match. If they have same greatest moduli, and
> all remainders have different indexes, they all have same modulus
> specified and the partitions are ordered by remainders, thus indexes
> array will be an identity i.e. index[i] = i. If the partition
> corresponding to a given remainder exists, it will have same index
> entry for both partitioned tables or if it's missing it will be -1.
> Thus if indexes array matches, corresponding datums array matches. If
> there are multiple remainders corresponding to a given partition,
> their partitions are ordered by the lowest of the remainders, thus if
> indexes array matches, both of the tables have same indexes arrays, in
> both the tables remainders corresponding to multiple partitions all
> have same indexes and thus same modulus. Thus again if the indexes are
> same, datums are same.".

That seems quite long.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [POC] hash partitioning

From

Ashutosh Bapat

Date:

18 May 2017, 06:58:22

On Wed, May 17, 2017 at 11:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, May 17, 2017 at 1:41 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
>>> Fixed in the attached version;  used "hash partition remainder must be
>>> greater than or equal to 0" instead.
>>
>> I would suggest "non-zero positive", since that's what we are using in
>> the documentation.
>
> Well, that's not very good terminology, because zero is not a positive
> number.  Existing error messages seem to use phrasing such as "THING
> must be a positive integer" when zero is not allowed or "THING must be
> a non-negative integer" when zero is allowed.  For examples, do git
> grep errmsg.*positive or git grep errmsg.*negative.

Ok. We need to change all the usages in the documentation and in the
comments to non-negative. The point is to use same phrases
consistently.

>
>> In partition_bounds_equal(), please add comments explaining why is it safe to
>> check just the indexes? May be we should add code under assertion to make sure
>> that the datums are equal as well. The comment could be something
>> like, "If two partitioned tables have different greatest moduli, their
>> partition schemes don't match. If they have same greatest moduli, and
>> all remainders have different indexes, they all have same modulus
>> specified and the partitions are ordered by remainders, thus indexes
>> array will be an identity i.e. index[i] = i. If the partition
>> corresponding to a given remainder exists, it will have same index
>> entry for both partitioned tables or if it's missing it will be -1.
>> Thus if indexes array matches, corresponding datums array matches. If
>> there are multiple remainders corresponding to a given partition,
>> their partitions are ordered by the lowest of the remainders, thus if
>> indexes array matches, both of the tables have same indexes arrays, in
>> both the tables remainders corresponding to multiple partitions all
>> have same indexes and thus same modulus. Thus again if the indexes are
>> same, datums are same.".
>
> That seems quite long.

I have shared a patch containing a denser explanation with my last set
of comments.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] [POC] hash partitioning

From

Dilip Kumar

Date:

18 May 2017, 19:09:03

On Wed, May 17, 2017 at 2:07 PM, amul sul <sulamul@gmail.com> wrote:
>> I would suggest "non-zero positive", since that's what we are using in
>> the documentation.
>>
>
> Understood, Fixed in the attached version.

Why non-zero positive?  We do support zero for the remainder right?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] [POC] hash partitioning

From

Amit Langote

Date:

19 May 2017, 08:01:39

On 2017/05/19 1:09, Dilip Kumar wrote:
> On Wed, May 17, 2017 at 2:07 PM, amul sul <sulamul@gmail.com> wrote:
>>> I would suggest "non-zero positive", since that's what we are using in
>>> the documentation.
>>>
>>
>> Understood, Fixed in the attached version.
> 
> Why non-zero positive?  We do support zero for the remainder right?

Using "non-negative integers" (for remainders) was suggested upthread.

Thanks,
Amit

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

19 May 2017, 12:32:51

On Wed, May 17, 2017 at 6:54 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
[...]

>
> Comments on the tests
> +#ifdef USE_ASSERT_CHECKING
> +        {
> +            /*
> +             * Hash partition bound stores modulus and remainder at
> +             * b1->datums[i][0] and b1->datums[i][1] position respectively.
> +             */
> +            for (i = 0; i < b1->ndatums; i++)
> +                Assert((b1->datums[i][0] == b2->datums[i][0] &&
> +                        b1->datums[i][1] == b2->datums[i][1]));
> +        }
> +#endif
> Why do we need extra {} here?
>

Okay, removed in the attached version.

> Comments on testcases
> +CREATE TABLE hpart_1 PARTITION OF hash_parted FOR VALUES WITH
> (modulus 8, remainder 0);
> +CREATE TABLE fail_part (LIKE hpart_1 INCLUDING CONSTRAINTS);
> +ALTER TABLE hash_parted ATTACH PARTITION fail_part FOR VALUES WITH
> (modulus 4, remainder 0);
> Probably you should also test the other-way round case i.e. create modulus 4,
> remainder 0 partition and then try to add partitions with modulus 8, remainder
> 4 and modulus 8, remainder 0. That should fail.
>

Fixed.

> Why to create two tables hash_parted and hash_parted2, you should be able to
> test with only a single table.
>

Fixed.

> +INSERT INTO hpart_2 VALUES (3, 'a');
> +DELETE FROM hpart_2;
> +INSERT INTO hpart_5_a (a, b) VALUES (6, 'a');
> This is slightly tricky. On different platforms the row may map to different
> partitions depending upon how the values are hashed. So, this test may not be
> portable on all the platforms. Probably you should add such testcases with a
> custom hash operator class which is identity function as suggested by Robert.
> This also applies to the tests in insert.sql and update.sql for partitioned
> table without custom opclass.
>

Yes, you are correct. Fixed in the attached version.

> +-- delete the faulting row and also add a constraint to skip the scan
> +ALTER TABLE hpart_5 ADD CONSTRAINT hcheck_a CHECK (a IN (5)), ALTER a
> SET NOT NULL;
> The constraint is not same as the implicit constraint added for that partition.
> I am not sure whether it's really going to avoid the scan. Did you verify it?
> If yes, then how?
>

I haven't tested that, may be I've copied blindly, sorry about that.
I don't think this test is needed again for hash partitioning, so removed.

> +ALTER TABLE hash_parted2 ATTACH PARTITION fail_part FOR VALUES WITH
> (modulus 3, remainder 2);
> +ERROR:  every hash partition modulus must be a factor of the next
> larger modulus
> We should add this test with at least two partitions in there so that we can
> check lower and upper modulus. Also, testing with some interesting
> bounds discussed earlier
> in this mail e.g. adding modulus 15 when 5, 10, 60 exist will be better than
> testing with 3, 4 and 8.
>
Similar test do exists in create_table.sql file.

> +ERROR:  cannot use collation for hash partition key column "a"
> This seems to indicate that we can not specify collation for hash partition key
> column, which isn't true. Column a here can have its collation. What's not
> allowed is specifying collation in PARTITION BY clause.
> May be reword the error as "cannot use collation for hash partitioning". or
> plain "cannot use collation in PARTITION BY clause for hash partitioning".
>
> +ERROR:  invalid bound specification for a list partition
> +LINE 1: CREATE TABLE fail_part PARTITION OF list_parted FOR VALUES W...
> +                                                        ^
> Should the location for this error be that of WITH clause like in case of range
> and list partitioned table.
>

Fixed.

> +select tableoid::regclass as part, a from hash_parted order by part;
> May be add a % 4 to show clearly that the data really goes to the partitioning
> with that remainder.
>

Fixed.

Updated patch attached.  0001-patch rebased against latest head.
0002-patch also incorporates code comments and error message changes
as per Robert's & your suggestions. Thanks !

Regards,
Amul

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Fri, May 19, 2017 at 10:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, May 19, 2017 at 5:32 AM, amul sul <sulamul@gmail.com> wrote:
>> Updated patch attached.  0001-patch rebased against latest head.
>> 0002-patch also incorporates code comments and error message changes
>> as per Robert's & your suggestions. Thanks !
>
> -                if (spec->strategy != PARTITION_STRATEGY_LIST)
> -                    elog(ERROR, "invalid strategy in partition bound spec");
> +                Assert(spec->strategy == PARTITION_STRATEGY_LIST);
>
> Let's just drop these hunks.  I realize this is a response to a review
> comment I made, but I take it back.  If the existing code is already
> doing it this way, there's no real need to revise it.  The patch
> doesn't even make it consistent anyway, since elsewhere you elog() for
> a similar case.  Perhaps elog() is best anyway.
>
Done.

> -    partitioning methods include range and list, where each partition is
> -    assigned a range of keys and a list of keys, respectively.
> +    partitioning methods include hash, range and list, where each partition is
> +    assigned a modulus and remainder of keys, a range of keys and a list of
> +    keys, respectively.
>
> I think this sentence has become too long and unwieldy, and is more
> unclear than helpful.  I'd just write "The currently supported
> partitioning methods are list, range, and hash."  The use of the word
> include is actually wrong here, because it implies that there are more
> not mentioned here, which is false.
>
Done.

> -      expression.  If no btree operator class is specified when creating a
> -      partitioned table, the default btree operator class for the datatype will
> -      be used.  If there is none, an error will be reported.
> +      expression.  List and range partitioning uses only btree operator class.
> +      Hash partitioning uses only hash operator class. If no operator class is
> +      specified when creating a partitioned table, the default operator class
> +      for the datatype will be used.  If there is none, an error will be
> +      reported.
> +     </para>
>
> I suggest: If no operator class is specified when creating a
> partitioned table, the default operator class of the appropriate type
> (btree for list and range partitioning, hash for hash partitioning)
> will be used.  If there is none, an error will be reported.
>
Done.

> +     <para>
> +      Since hash partitiong operator class, provide only equality,
> not ordering,
> +      collation is not relevant in hash partition key column. An error will be
> +      reported if collation is specified.
>
> partitiong -> partitioning.  Also, remove the comma after "operator
> class" and change "not relevant in hash partition key column" to "not
> relevant for hash partitioning".  Also change "if collation is
> specified" to "if a collation is specified".
>
Done.

> +   Create a hash partitioned table:
> +<programlisting>
> +CREATE TABLE orders (
> +    order_id     bigint not null,
> +    cust_id      bigint not null,
> +    status       text
> +) PARTITION BY HASH (order_id);
> +</programlisting></para>
>
> Move this down so it's just above the example of creating partitions.
>
Done.

> + * For range and list partitioned tables, datums is an array of datum-tuples
> + * with key->partnatts datums each.
> + * For hash partitioned tables, it is an array of datum-tuples with 2 datums,
> + * modulus and remainder, corresponding to a given partition.
>
> Second line is very short; reflow as one paragraph.
>
Done

>   * In case of range partitioning, it stores one entry per distinct range
>   * datum, which is the index of the partition for which a given datum
>   * is an upper bound.
> + * In the case of hash partitioning, the number of the entries in the indexes
> + * array is same as the greatest modulus amongst all partitions. For a given
> + * partition key datum-tuple, the index of the partition which would
> accept that
> + * datum-tuple would be given by the entry pointed by remainder produced when
> + * hash value of the datum-tuple is divided by the greatest modulus.
>
> Insert line break before the new text as a paragraph break.

Will wait for more inputs on Ashutosh's explanation upthread.

>
> +    char        strategy;        /* hash, list or range bounds? */
>
> Might be clearer to just write /* hash, list, or range? */ or /*
> bounds for hash, list, or range? */
>

Done, used "hash, list, or range?"

>
> +static uint32 compute_hash_value(PartitionKey key, Datum *values,
> bool *isnull);
> +
>
> I think there should be a blank line after this but not before it.
>

Done.

> I don't really see why hash partitioning needs to touch
> partition_bounds_equal() at all.  Why can't the existing logic work
> for hash partitioning without change?
>

Unlike list and range partition, ndatums does not represents size of
the indexes array, also dimension of datums  array in not the same as
a key->partnatts.

> +                                valid_bound = true;
>
> valid_modulus, maybe?
>

Sure, added.

> -                   errmsg("data type %s has no default btree operator class",
> -                          format_type_be(atttype)),
> -                         errhint("You must specify a btree operator
> class or define a default btree operator class for the data type.")));
> +                      errmsg("data type %s has no default %s operator class",
> +                             format_type_be(atttype), am_method),
> +                         errhint("You must specify a %s operator
> class or define a default %s operator class for the data type.",
> +                                 am_method, am_method)));
>
> Let's use this existing wording from typecmds.c:
>
>                      errmsg("data type %s has no default operator
> class for access method \"%s\"",
>
> and for the hint, maybe: You must specify an operator class or define
> a default operator class for the data type.  Leave out the %s, in
> other words.
>

Done.

> +        /*
> +         * Hash operator classes provide only equality, not ordering.
> +         * Collation, which is relevant for ordering and not for equality, is
> +         * irrelevant for hash partitioning.
> +         */
> +        if (*strategy == PARTITION_STRATEGY_HASH && pelem->collation != NIL)
> +            ereport(ERROR,
> +                    (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +                     errmsg("cannot use collation for hash partitioning"),
> +                     parser_errposition(pstate, pelem->location)));
>
> This error message is not very informative, and it requires
> propagating information about the partitioning type into parts of the
> code that otherwise don't require it.  I was waffling before on
> whether to ERROR here; I think now I'm in favor of ignoring the
> problem.  The collation won't do any harm; it just won't affect the
> behavior.
>

Removed.

> +         * Identify opclass to use.  For list and range partitioning we use
> +         * only btree operator class, which seems enough for those.  For hash
> +         * partitioning, we use hash operator class.
>
> Strange wording.  Suggest: Identify the appropriate operator class.
> For list and range partitioning, we use a btree operator class; hash
> partitioning uses a hash operator class.
>

Done

> +            FOR VALUES WITH '(' hash_partbound ')' /*TODO: syntax is
> not finalised*/
>
> Remove the comment.
>

Done.

> +                    foreach (lc, $5)
> +                    {
> +                        DefElem    *opt = (DefElem *) lfirst(lc);
> +
> +                        if (strcmp(opt->defname, "modulus") == 0)
> +                            n->modulus = defGetInt32(opt);
> +                        else if (strcmp(opt->defname, "remainder") == 0)
> +                            n->remainder = defGetInt32(opt);
> +                        else
> +                            ereport(ERROR,
> +                                    (errcode(ERRCODE_SYNTAX_ERROR),
> +                                     errmsg("unrecognized hash
> partition bound specification \"%s\"",
> +                                            opt->defname),
> +                                     parser_errposition(opt->location)));
> +                    }
>
> This logic doesn't complain if the same option is specified more than
> once.  I suggest adding a check for that, and also pushing this logic
> out into a helper function that gets called here instead of including
> it inline.
>

Added duplicate error.
About separate helper function,  can't we have as it is, because, imo,
we might not going to use that elsewhere?


> +                   errmsg("hash partition modulus must be a positive
> integer")));
>
> modulus for hash partition
>
> +                     errmsg("hash partition remainder must be a
> non-negative integer")));
>
> remainder for hash partition
>
> +            errmsg("hash partition modulus must be greater than remainder")));
>
> modulus for hash partition must be greater than remainder
>

Done.  Similar changes in gram.y as well.

> +-- values are hashed, row may map to different partitions, which result in
>
> the row
>
> +-- regression failure.  To avoid this, let's create non-default hash function
>
> create a non-default

Done.

Updated patch attached. Thanks a lot for review.

Regards,
Amul

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

25 May 2017, 07:29:00

On Mon, May 22, 2017 at 2:23 PM, amul sul <sulamul@gmail.com> wrote:
>
> Updated patch attached. Thanks a lot for review.
>
Minor fix in the document, PFA.

Regards,
Amul

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Hi Dilip,

Thanks for review.

On Sat, Jun 3, 2017 at 6:54 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Thu, May 25, 2017 at 9:59 AM, amul sul <sulamul@gmail.com> wrote:
>> On Mon, May 22, 2017 at 2:23 PM, amul sul <sulamul@gmail.com> wrote:
>>>
>>> Updated patch attached. Thanks a lot for review.
>>>
>> Minor fix in the document, PFA.
>
> Patch need rebase
>

Done.

> -------
> Function header is not consistent with other neighbouring functions
> (some function contains function name in the header but others don't)
> +/*
> + * Compute the hash value for given not null partition key values.
> + */
>
Done.

> ------
> postgres=# create table t1 partition of t for values with (modulus 2,
> remainder 1) partition by range(a);
> CREATE TABLE
> postgres=# create table t1_1 partition of t1 for values from (8) to (10);
> CREATE TABLE
> postgres=# insert into t1 values(8);
> 2017-06-03 18:41:46.067 IST [5433] ERROR:  new row for relation "t1_1"
> violates partition constraint
> 2017-06-03 18:41:46.067 IST [5433] DETAIL:  Failing row contains (8).
> 2017-06-03 18:41:46.067 IST [5433] STATEMENT:  insert into t1 values(8);
> ERROR:  new row for relation "t1_1" violates partition constraint
> DETAIL:  Failing row contains (8).
>
> The value 8 is violating the partition constraint of the t1 and we are
> trying to insert to value in t1,
> still, the error is coming from the leaf level table t1_1, that may be
> fine but from error, it appears that
> it's violating the constraint of t1_1 whereas it's actually violating
> the constraint of t1.
>
> From Implementation, it appears that based on the key are identifying
> the leaf partition and it's only failing during ExecInsert while
> checking the partition constraint.
>
May I ask you, how you sure about 8 is an unfit value for t1 relation?
And what if the value other than 8, for e.g. 7?

Updated patch attached.

Regards,
Amul Sul

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Fri, Jun 23, 2017 at 10:11 AM, Yugo Nagata <nagata@sraoss.co.jp> wrote:
> On Tue, 6 Jun 2017 13:03:58 +0530
> amul sul <sulamul@gmail.com> wrote:
>
>
>> Updated patch attached.
>
> I looked into the latest patch (v13) and have some comments
> althogh they might be trivial.
>
Thanks for your review.

> First, I couldn't apply this patch to the latest HEAD due to
> a documentation fix and pgintend updates. It needes rebase.
>
> $ git apply /tmp/0002-hash-partitioning_another_design-v13.patch
> error: patch failed: doc/src/sgml/ref/create_table.sgml:87
> error: doc/src/sgml/ref/create_table.sgml: patch does not apply
> error: patch failed: src/backend/catalog/partition.c:76
> error: src/backend/catalog/partition.c: patch does not apply
> error: patch failed: src/backend/commands/tablecmds.c:13371
> error: src/backend/commands/tablecmds.c: patch does not apply
>
Fixed.

>
>        <varlistentry>
> +       <term>Hash Partitioning</term>
> +
> +       <listitem>
> +        <para>
> +         The table is partitioned by specifying modulus and remainder for each
> +         partition. Each partition holds rows for which the hash value of
> +         partition keys when divided by specified modulus produces specified
> +         remainder. For more clarification on modulus and remainder please refer
> +         <xref linkend="sql-createtable-partition">.
> +        </para>
> +       </listitem>
> +      </varlistentry>
> +
> +      <varlistentry>
>         <term>Range Partitioning</term>
>
> I think this section should be inserted after List Partitioning section because
> the order of the descriptions is Range, List, then Hash in other places of
> the documentation. At least,
>
Fixed in the attached version.

>
> -    <firstterm>partition bounds</firstterm>.  Currently supported
> -    partitioning methods include range and list, where each partition is
> -    assigned a range of keys and a list of keys, respectively.
> +    <firstterm>partition bounds</firstterm>.  The currently supported
> +    partitioning methods are list, range, and hash.
>     </para>
>
> Also in this hunk. I think "The currently supported partitioning methods are
> range, list, and hash." is better. We don't need to change the order of
> the original description.
>
Fixed in the attached version.

>
>        <listitem>
>         <para>
> -        Declarative partitioning only supports list and range partitioning,
> -        whereas table inheritance allows data to be divided in a manner of
> -        the user's choosing.  (Note, however, that if constraint exclusion is
> -        unable to prune partitions effectively, query performance will be very
> -        poor.)
> +        Declarative partitioning only supports hash, list and range
> +        partitioning, whereas table inheritance allows data to be divided in a
> +        manner of the user's choosing.  (Note, however, that if constraint
> +        exclusion is unable to prune partitions effectively, query performance
> +        will be very poor.)
>
> Similarly, I think "Declarative partitioning only supports range, list and hash
> partitioning," is better.
>
Fixed in the attached version.

>
> +
> +  <para>
> +   Create a hash partitioned table:
> +<programlisting>
> +CREATE TABLE orders (
> +    order_id     bigint not null,
> +    cust_id      bigint not null,
> +    status       text
> +) PARTITION BY HASH (order_id);
> +</programlisting></para>
> +
>
> This paragraph should be inserted between "Create a list partitioned table:"
> paragraph and "Ceate partition of a range partitioned table:" paragraph
> as well as range and list.
>
Fixed in the attached version.

>
>                 *strategy = PARTITION_STRATEGY_LIST;
>         else if (pg_strcasecmp(partspec->strategy, "range") == 0)
>                 *strategy = PARTITION_STRATEGY_RANGE;
> +       else if (pg_strcasecmp(partspec->strategy, "hash") == 0)
> +               *strategy = PARTITION_STRATEGY_HASH;
>         else
>                 ereport(ERROR,
>
> In the most of codes, the order is hash, range, then list, but only
> in transformPartitionSpec(), the order is list, range, then hash,
> as above. Maybe it is better to be uniform.
>
Make sense, fixed in the attached version.

>
> +                       {
> +                               if (strategy == PARTITION_STRATEGY_HASH)
> +                                       ereport(ERROR,
> +                                                       (errcode(ERRCODE_UNDEFINED_OBJECT),
> +                                                        errmsg("data type %s has no default hash operator class",
> +                                                                       format_type_be(atttype)),
> +                                                        errhint("You must specify a hash operator class or define a
defaulthash operator class for the data type.")));
 
> +                               else
> +                                       ereport(ERROR,
> +                                                       (errcode(ERRCODE_UNDEFINED_OBJECT),
> +                                                        errmsg("data type %s has no default btree operator class",
> +                                                                       format_type_be(atttype)),
> +                                                        errhint("You must specify a btree operator class or define a
defaultbtree operator class for the data type.")));
 
> +
> +
>
>                                                                                            atttype,
> -                                                                                          "btree",
> -                                                                                          BTREE_AM_OID);
> +                                                                                          am_oid == HASH_AM_OID ?
"hash": "btree",
 
> +                                                                                          am_oid);
>
> How about writing this part as following to reduce code redundancy?
>
> +       Oid                     am_oid;
> +       char       *am_name;
>
> <snip>
>
> +               if (strategy == PARTITION_STRATEGY_HASH)
> +               {
> +                       am_oid = HASH_AM_OID;
> +                       am_name = pstrdup("hash");
> +               }
> +               else
> +               {
> +                       am_oid = BTREE_AM_OID;
> +                       am_name = pstrdup("btree");
> +               }
> +
>                 if (!pelem->opclass)
>                 {
> -                       partopclass[attn] = GetDefaultOpClass(atttype, BTREE_AM_OID);
> +                       partopclass[attn] = GetDefaultOpClass(atttype, am_oid);
>
>                         if (!OidIsValid(partopclass[attn]))
>                                 ereport(ERROR,
>                                                 (errcode(ERRCODE_UNDEFINED_OBJECT),
> -                                  errmsg("data type %s has no default btree operator class",
> -                                                 format_type_be(atttype)),
> -                                                errhint("You must specify a btree operator class or define a default
btreeoperator class for the data type.")));
 
> +                                                errmsg("data type %s has no default %s operator class",
> +                                                               format_type_be(atttype), am_name),
> +                                                errhint("You must specify a %s operator class or define a default %s
operatorclass for the data type.",
 
> +                                                                am_name, am_name)));
> +
>                 }
>                 else
>                         partopclass[attn] = ResolveOpClass(pelem->opclass,
>                                                                                            atttype,
> -                                                                                          "btree",
> -                                                                                          BTREE_AM_OID);
> +                                                                                          am_name,
> +                                                                                          am_oid);
>
I had to have same thoughts before (see v12 patch & before), but
change due to review comments upthread.

>
> There is meaningless indentation change.
>
> @@ -2021,7 +2370,8 @@ get_partition_for_tuple(PartitionDispatch *pd,
>                     /* bsearch in partdesc->boundinfo */
>                     cur_offset = partition_bound_bsearch(key,
>                                                          partdesc->boundinfo,
> -                                                        values, false, &equal);
> +                                                     values, false, &equal);
> +
>                     /*
>                      * Offset returned is such that the bound at offset is
>
Fixed in the attached version.

>
> Fixing the comment of pg_get_partkeydef() is missing.
>
>  * pg_get_partkeydef
>  *
>  * Returns the partition key specification, ie, the following:
>  *
>  * PARTITION BY { RANGE | LIST } (column opt_collation opt_opclass [, ...])
>  */
> Datum
> pg_get_partkeydef(PG_FUNCTION_ARGS)
> {
>
Thanks to catching this, fixed in the attached version.

Regards,
Amul

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Wed, Jul 5, 2017 at 4:50 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Mon, Jul 3, 2017 at 4:39 PM, amul sul <sulamul@gmail.com> wrote:
>> Thanks to catching this, fixed in the attached version.
>
> Few comments on the latest version.
>

Thanks for your review, please find my comment inline:

> 0001 looks fine, for 0002 I have some comments.
>
> 1.
> + hbounds = (PartitionHashBound * *) palloc(nparts *
> +  sizeof(PartitionHashBound *));
>
> /s/(PartitionHashBound * *)/(PartitionHashBound **)/g
>

Fixed in the attached version.

> 2.
> RelationBuildPartitionDesc
> {
>      ....
>
>
> * catalog scan that retrieved them, whereas that in the latter is
> * defined by canonicalized representation of the list values or the
> * range bounds.
> */
> for (i = 0; i < nparts; i++)
> result->oids[mapping[i]] = oids[i];
>
> Should this comments mention about hash as well?
>

Instead, I have generalised this comment in the attached patch

> 3.
>
> if (b1->datums[b1->ndatums - 1][0] != b2->datums[b2->ndatums - 1][0])
> return false;
>
> if (b1->ndatums != b2->ndatums)
> return false;
>
> If ndatums itself is different then no need to access datum memory, so
> better to check ndatum first.
>

You are correct, we already doing this in the
partition_bounds_equal().   This is a redundant code, removed in the
attached version.

> 4.
> + * next larger modulus.  For example, if you have a bunch
> + * of partitions that all have modulus 5, you can add a
> + * new new partition with modulus 10 or a new partition
>
> Typo, "new new partition"  -> "new partition"
>

Fixed in the attached version.

Regards,
Amul

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

27 July 2017, 14:41:01

Attaching newer patches rebased against the latest master head. Thanks !

Regards,
Amul

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] [POC] hash partitioning

From

"yangjie@highgo.com"

Date:

28 August 2017, 10:33:46

Hello

Looking at your hash partitioning syntax, I implemented a hash partition in a more concise way, with no need to determine the number of sub-tables, and dynamically add partitions.

Description

The hash partition's implement is on the basis of the original range / list partition,and using similar syntax.

To create a partitioned table ,use:

CREATE TABLE h (id int) PARTITION BY HASH(id);

The partitioning key supports only one value, and I think the partition key can support multiple values,

which may be difficult to implement when querying, but it is not impossible.

A partition table can be create as bellow:

CREATE TABLE h1 PARTITION OF h;
CREATE TABLE h2 PARTITION OF h;
CREATE TABLE h3 PARTITION OF h;

FOR VALUES clause cannot be used, and the partition bound is calclulated automatically as partition index of single integer value.

An inserted record is stored in a partition whose index equals
DatumGetUInt32(OidFunctionCall1(lookup_type_cache(key->parttypid[0], TYPECACHE_HASH_PROC)->hash_proc, values[0])) % nparts/* Number of partitions */

;
In the above example, this is DatumGetUInt32(OidFunctionCall1(lookup_type_cache(key->parttypid[0], TYPECACHE_HASH_PROC)->hash_proc, id)) % 3;

postgres=# insert into h select generate_series(1,20);
INSERT 0 20
postgres=# select tableoid::regclass,* from h;
tableoid | id
----------+----
h1       |  3
h1       |  5
h1       | 17
h1       | 19
h2       |  2
h2       |  6
h2       |  7
h2       | 11
h2       | 12
h2       | 14
h2       | 15
h2       | 18
h2       | 20
h3       |  1
h3       |  4
h3       |  8
h3       |  9
h3       | 10
h3       | 13
h3       | 16
(20 rows)

The number of partitions here can be dynamically added, and if a new partition is created, the number of partitions changes, the calculated target partitions will change, and the same data is not reasonable in different partitions,So you need to re-calculate the existing data and insert the target partition when you create a new partition.

postgres=# create table h4 partition of h;
CREATE TABLE
postgres=# select tableoid::regclass,* from h;
tableoid | id
----------+----
h1       |  5
h1       | 17
h1       | 19
h1       |  6
h1       | 12
h1       |  8
h1       | 13
h2       | 11
h2       | 14
h3       |  1
h3       |  9
h3       |  2
h3       | 15
h4       |  3
h4       |  7
h4       | 18
h4       | 20
h4       |  4
h4       | 10
h4       | 16
(20 rows)

When querying the data, the hash partition uses the same algorithm as the insertion, and filters out the table that does not need to be scanned.

postgres=# explain analyze select * from h where id = 1;
                                             QUERY PLAN
----------------------------------------------------------------------------------------------------
Append  (cost=0.00..41.88 rows=13 width=4) (actual time=0.020..0.023 rows=1 loops=1)
   ->  Seq Scan on h3  (cost=0.00..41.88 rows=13 width=4) (actual time=0.013..0.016 rows=1 loops=1)
         Filter: (id = 1)
         Rows Removed by Filter: 3
Planning time: 0.346 ms
Execution time: 0.061 ms
(6 rows)

postgres=# explain analyze select * from h where id in (1,5);;
                                             QUERY PLAN
----------------------------------------------------------------------------------------------------
Append  (cost=0.00..83.75 rows=52 width=4) (actual time=0.016..0.028 rows=2 loops=1)
   ->  Seq Scan on h1  (cost=0.00..41.88 rows=26 width=4) (actual time=0.015..0.018 rows=1 loops=1)
         Filter: (id = ANY ('{1,5}'::integer[]))
         Rows Removed by Filter: 6
   ->  Seq Scan on h3  (cost=0.00..41.88 rows=26 width=4) (actual time=0.005..0.007 rows=1 loops=1)
         Filter: (id = ANY ('{1,5}'::integer[]))
         Rows Removed by Filter: 3
Planning time: 0.720 ms
Execution time: 0.074 ms
(9 rows)

postgres=# explain analyze select * from h where id = 1 or id = 5;;
                                             QUERY PLAN
----------------------------------------------------------------------------------------------------
Append  (cost=0.00..96.50 rows=50 width=4) (actual time=0.017..0.078 rows=2 loops=1)
   ->  Seq Scan on h1  (cost=0.00..48.25 rows=25 width=4) (actual time=0.015..0.019 rows=1 loops=1)
         Filter: ((id = 1) OR (id = 5))
         Rows Removed by Filter: 6
   ->  Seq Scan on h3  (cost=0.00..48.25 rows=25 width=4) (actual time=0.005..0.010 rows=1 loops=1)
         Filter: ((id = 1) OR (id = 5))
         Rows Removed by Filter: 3
Planning time: 0.396 ms
Execution time: 0.139 ms
(9 rows)

Can not detach / attach / drop partition table.

Best regards,
young

yonj1e.github.io

yangjie@highgo.com

Re: [HACKERS] [POC] hash partitioning

From

Yugo Nagata

Date:

28 August 2017, 11:30:15

Hi young,

On Mon, 28 Aug 2017 15:33:46 +0800
"yangjie@highgo.com" <yangjie@highgo.com> wrote:

> Hello
> 
> Looking at your hash partitioning syntax, I implemented a hash partition in a more concise way, with no need to
determinethe number of sub-tables, and dynamically add partitions.
 

I think it is great work, but the current consensus about hash-partitioning supports 
Amul's patch[1], in which the syntax is different from the my original proposal. 
So, you will have to read Amul's patch and make a discussion if you still want to
propose your implementation.

Regards,

[1] https://www.postgresql.org/message-id/CAAJ_b965A2oog=6eFUhELexL3RmgFssB3G7LwkVA1bw0WUJJoA@mail.gmail.com


> 
> Description
> 
> The hash partition's implement is on the basis of the original range / list partition,and using similar syntax.
> 
> To create a partitioned table ,use:
> 
> CREATE TABLE h (id int) PARTITION BY HASH(id);
> 
> The partitioning key supports only one value, and I think the partition key can support multiple values, 
> which may be difficult to implement when querying, but it is not impossible.
> 
> A partition table can be create as bellow:
> 
>  CREATE TABLE h1 PARTITION OF h;
>  CREATE TABLE h2 PARTITION OF h;
>  CREATE TABLE h3 PARTITION OF h;
>  
> FOR VALUES clause cannot be used, and the partition bound is calclulated automatically as partition index of single
integervalue.
 
> 
> An inserted record is stored in a partition whose index equals 
> DatumGetUInt32(OidFunctionCall1(lookup_type_cache(key->parttypid[0], TYPECACHE_HASH_PROC)->hash_proc, values[0])) %
nparts/*Number of partitions */
 
> ;
> In the above example, this is DatumGetUInt32(OidFunctionCall1(lookup_type_cache(key->parttypid[0],
TYPECACHE_HASH_PROC)->hash_proc,id)) % 3;
 
> 
> postgres=# insert into h select generate_series(1,20);
> INSERT 0 20
> postgres=# select tableoid::regclass,* from h;
>  tableoid | id 
> ----------+----
>  h1       |  3
>  h1       |  5
>  h1       | 17
>  h1       | 19
>  h2       |  2
>  h2       |  6
>  h2       |  7
>  h2       | 11
>  h2       | 12
>  h2       | 14
>  h2       | 15
>  h2       | 18
>  h2       | 20
>  h3       |  1
>  h3       |  4
>  h3       |  8
>  h3       |  9
>  h3       | 10
>  h3       | 13
>  h3       | 16
> (20 rows)
> 
> The number of partitions here can be dynamically added, and if a new partition is created, the number of partitions
changes,the calculated target partitions will change, and the same data is not reasonable in different partitions,So
youneed to re-calculate the existing data and insert the target partition when you create a new partition.
 
> 
> postgres=# create table h4 partition of h;
> CREATE TABLE
> postgres=# select tableoid::regclass,* from h;
>  tableoid | id 
> ----------+----
>  h1       |  5
>  h1       | 17
>  h1       | 19
>  h1       |  6
>  h1       | 12
>  h1       |  8
>  h1       | 13
>  h2       | 11
>  h2       | 14
>  h3       |  1
>  h3       |  9
>  h3       |  2
>  h3       | 15
>  h4       |  3
>  h4       |  7
>  h4       | 18
>  h4       | 20
>  h4       |  4
>  h4       | 10
>  h4       | 16
> (20 rows)
> 
> When querying the data, the hash partition uses the same algorithm as the insertion, and filters out the table that
doesnot need to be scanned.
 
> 
> postgres=# explain analyze select * from h where id = 1;
>                                              QUERY PLAN                                             
> ----------------------------------------------------------------------------------------------------
>  Append  (cost=0.00..41.88 rows=13 width=4) (actual time=0.020..0.023 rows=1 loops=1)
>    ->  Seq Scan on h3  (cost=0.00..41.88 rows=13 width=4) (actual time=0.013..0.016 rows=1 loops=1)
>          Filter: (id = 1)
>          Rows Removed by Filter: 3
>  Planning time: 0.346 ms
>  Execution time: 0.061 ms
> (6 rows)
> 
> postgres=# explain analyze select * from h where id in (1,5);;
>                                              QUERY PLAN                                             
> ----------------------------------------------------------------------------------------------------
>  Append  (cost=0.00..83.75 rows=52 width=4) (actual time=0.016..0.028 rows=2 loops=1)
>    ->  Seq Scan on h1  (cost=0.00..41.88 rows=26 width=4) (actual time=0.015..0.018 rows=1 loops=1)
>          Filter: (id = ANY ('{1,5}'::integer[]))
>          Rows Removed by Filter: 6
>    ->  Seq Scan on h3  (cost=0.00..41.88 rows=26 width=4) (actual time=0.005..0.007 rows=1 loops=1)
>          Filter: (id = ANY ('{1,5}'::integer[]))
>          Rows Removed by Filter: 3
>  Planning time: 0.720 ms
>  Execution time: 0.074 ms
> (9 rows)
> 
> postgres=# explain analyze select * from h where id = 1 or id = 5;;
>                                              QUERY PLAN                                             
> ----------------------------------------------------------------------------------------------------
>  Append  (cost=0.00..96.50 rows=50 width=4) (actual time=0.017..0.078 rows=2 loops=1)
>    ->  Seq Scan on h1  (cost=0.00..48.25 rows=25 width=4) (actual time=0.015..0.019 rows=1 loops=1)
>          Filter: ((id = 1) OR (id = 5))
>          Rows Removed by Filter: 6
>    ->  Seq Scan on h3  (cost=0.00..48.25 rows=25 width=4) (actual time=0.005..0.010 rows=1 loops=1)
>          Filter: ((id = 1) OR (id = 5))
>          Rows Removed by Filter: 3
>  Planning time: 0.396 ms
>  Execution time: 0.139 ms
> (9 rows)
> 
> Can not detach / attach / drop partition table.
> 
> Best regards,
> young
> 
> 
> yonj1e.github.io
> yangjie@highgo.com


-- 
Yugo Nagata <nagata@sraoss.co.jp>

Re: [HACKERS] [POC] hash partitioning

From

yangjie

Date:

29 August 2017, 05:19:12

font{ line-height: 1.7; }

Hi,

This is my patch, before I forgot to add attachments, and the following address is also discussed.

https://www.postgresql.org/message-id/2017082612390093777512%40highgo.com

a#ntes-pcmail-signature-default:hover { text-decoration: underline; color: #199cff; cursor: pointer; } a#ntes-pcmail-signature-default:active { text-decoration: underline; color: #246fce; cursor: pointer; }

font{ line-height: 1.7; }

-------

young

HighGo Database: http://www.highgo.com

On 8/28/2017 16:28，Yugo Nagata<nagata@sraoss.co.jp> wrote：

Hi young,

On Mon, 28 Aug 2017 15:33:46 +0800
"yangjie@highgo.com" <yangjie@highgo.com> wrote:

> Hello
>
> Looking at your hash partitioning syntax, I implemented a hash partition in a more concise way, with no need to determine the number of sub-tables, and dynamically add partitions.

I think it is great work, but the current consensus about hash-partitioning supports
Amul's patch[1], in which the syntax is different from the my original proposal.
So, you will have to read Amul's patch and make a discussion if you still want to
propose your implementation.

Regards,

[1] https://www.postgresql.org/message-id/CAAJ_b965A2oog=6eFUhELexL3RmgFssB3G7LwkVA1bw0WUJJoA@mail.gmail.com

>
> Description
>
> The hash partition's implement is on the basis of the original range / list partition,and using similar syntax.
>
> To create a partitioned table ,use:
>
> CREATE TABLE h (id int) PARTITION BY HASH(id);
>
> The partitioning key supports only one value, and I think the partition key can support multiple values,
> which may be difficult to implement when querying, but it is not impossible.
>
> A partition table can be create as bellow:
>
>  CREATE TABLE h1 PARTITION OF h;
>  CREATE TABLE h2 PARTITION OF h;
>  CREATE TABLE h3 PARTITION OF h;
>
> FOR VALUES clause cannot be used, and the partition bound is calclulated automatically as partition index of single integer value.
>
> An inserted record is stored in a partition whose index equals
> DatumGetUInt32(OidFunctionCall1(lookup_type_cache(key->parttypid[0], TYPECACHE_HASH_PROC)->hash_proc, values[0])) % nparts/* Number of partitions */
> ;
> In the above example, this is DatumGetUInt32(OidFunctionCall1(lookup_type_cache(key->parttypid[0], TYPECACHE_HASH_PROC)->hash_proc, id)) % 3;
>
> postgres=# insert into h select generate_series(1,20);
> INSERT 0 20
> postgres=# select tableoid::regclass,* from h;
>  tableoid | id
> ----------+----
>  h1       |  3
>  h1       |  5
>  h1       | 17
>  h1       | 19
>  h2       |  2
>  h2       |  6
>  h2       |  7
>  h2       | 11
>  h2       | 12
>  h2       | 14
>  h2       | 15
>  h2       | 18
>  h2       | 20
>  h3       |  1
>  h3       |  4
>  h3       |  8
>  h3       |  9
>  h3       | 10
>  h3       | 13
>  h3       | 16
> (20 rows)
>
> The number of partitions here can be dynamically added, and if a new partition is created, the number of partitions changes, the calculated target partitions will change, and the same data is not reasonable in different partitions,So you need to re-calculate the existing data and insert the target partition when you create a new partition.
>
> postgres=# create table h4 partition of h;
> CREATE TABLE
> postgres=# select tableoid::regclass,* from h;
>  tableoid | id
> ----------+----
>  h1       |  5
>  h1       | 17
>  h1       | 19
>  h1       |  6
>  h1       | 12
>  h1       |  8
>  h1       | 13
>  h2       | 11
>  h2       | 14
>  h3       |  1
>  h3       |  9
>  h3       |  2
>  h3       | 15
>  h4       |  3
>  h4       |  7
>  h4       | 18
>  h4       | 20
>  h4       |  4
>  h4       | 10
>  h4       | 16
> (20 rows)
>
> When querying the data, the hash partition uses the same algorithm as the insertion, and filters out the table that does not need to be scanned.
>
> postgres=# explain analyze select * from h where id = 1;
>                                              QUERY PLAN
> ----------------------------------------------------------------------------------------------------
>  Append  (cost=0.00..41.88 rows=13 width=4) (actual time=0.020..0.023 rows=1 loops=1)
>    ->  Seq Scan on h3  (cost=0.00..41.88 rows=13 width=4) (actual time=0.013..0.016 rows=1 loops=1)
>          Filter: (id = 1)
>          Rows Removed by Filter: 3
>  Planning time: 0.346 ms
>  Execution time: 0.061 ms
> (6 rows)
>
> postgres=# explain analyze select * from h where id in (1,5);;
>                                              QUERY PLAN
> ----------------------------------------------------------------------------------------------------
>  Append  (cost=0.00..83.75 rows=52 width=4) (actual time=0.016..0.028 rows=2 loops=1)
>    ->  Seq Scan on h1  (cost=0.00..41.88 rows=26 width=4) (actual time=0.015..0.018 rows=1 loops=1)
>          Filter: (id = ANY ('{1,5}'::integer[]))
>          Rows Removed by Filter: 6
>    ->  Seq Scan on h3  (cost=0.00..41.88 rows=26 width=4) (actual time=0.005..0.007 rows=1 loops=1)
>          Filter: (id = ANY ('{1,5}'::integer[]))
>          Rows Removed by Filter: 3
>  Planning time: 0.720 ms
>  Execution time: 0.074 ms
> (9 rows)
>
> postgres=# explain analyze select * from h where id = 1 or id = 5;;
>                                              QUERY PLAN
> ----------------------------------------------------------------------------------------------------
>  Append  (cost=0.00..96.50 rows=50 width=4) (actual time=0.017..0.078 rows=2 loops=1)
>    ->  Seq Scan on h1  (cost=0.00..48.25 rows=25 width=4) (actual time=0.015..0.019 rows=1 loops=1)
>          Filter: ((id = 1) OR (id = 5))
>          Rows Removed by Filter: 6
>    ->  Seq Scan on h3  (cost=0.00..48.25 rows=25 width=4) (actual time=0.005..0.010 rows=1 loops=1)
>          Filter: ((id = 1) OR (id = 5))
>          Rows Removed by Filter: 3
>  Planning time: 0.396 ms
>  Execution time: 0.139 ms
> (9 rows)
>
> Can not detach / attach / drop partition table.
>
> Best regards,
> young
>
>
> yonj1e.github.io
> yangjie@highgo.com

--
Yugo Nagata <nagata@sraoss.co.jp>

Attachment

hash_part_on_beta2_v1.patch

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

04 September 2017, 13:38:45

I've updated patch to use an extended hash function (Commit # 81c5e46c490e2426db243eada186995da5bb0ba7) for the partitioning.

Regards,

Amul

On Thu, Jul 27, 2017 at 5:11 PM, amul sul <sulamul@gmail.com> wrote:

Attaching newer patches rebased against the latest master head. Thanks !

Regards,
Amul

Attachment

Re: [HACKERS] [POC] hash partitioning

From

Rajkumar Raghuwanshi

Date:

05 September 2017, 12:13:07

On Mon, Sep 4, 2017 at 4:08 PM, amul sul <sulamul@gmail.com> wrote:

I've updated patch to use an extended hash function (Commit # 81c5e46c490e2426db243eada186995da5bb0ba7) for the partitioning.

I have done some testing with these patches, everything looks fine, attaching sql and out file for reference.

Thanks & Regards,

Rajkumar Raghuwanshi

QMG, EnterpriseDB Corporation

Attachment

Re: [HACKERS] [POC] hash partitioning

From

Robert Haas

Date:

08 September 2017, 04:15:44

On Mon, Sep 4, 2017 at 6:38 AM, amul sul <sulamul@gmail.com> wrote:
> I've updated patch to use an extended hash function (Commit #
> 81c5e46c490e2426db243eada186995da5bb0ba7) for the partitioning.

Committed 0001 after noticing that Jeevan Ladhe also found that change
convenient for default partitioning.  I made a few minor cleanups;
hopefully I didn't break anything.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

08 September 2017, 15:40:25

On Fri, Sep 8, 2017 at 6:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Sep 4, 2017 at 6:38 AM, amul sul <sulamul@gmail.com> wrote:
> I've updated patch to use an extended hash function (Commit #
> 81c5e46c490e2426db243eada186995da5bb0ba7) for the partitioning.

Committed 0001 after noticing that Jeevan Ladhe also found that change
convenient for default partitioning. I made a few minor cleanups;
hopefully I didn't break anything.

Thanks you.

Rebased 0002 against this commit & renamed to 0001, PFA.

Regards,

Amul

Attachment

0001-hash-partitioning_another_design-v18.patch

Re: [HACKERS] [POC] hash partitioning

From

Ashutosh Bapat

Date:

11 September 2017, 11:17:22

On Fri, Sep 8, 2017 at 6:10 PM, amul sul <sulamul@gmail.com> wrote:
> On Fri, Sep 8, 2017 at 6:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Mon, Sep 4, 2017 at 6:38 AM, amul sul <sulamul@gmail.com> wrote:
>> > I've updated patch to use an extended hash function (Commit #
>> > 81c5e46c490e2426db243eada186995da5bb0ba7) for the partitioning.
>>
>> Committed 0001 after noticing that Jeevan Ladhe also found that change
>> convenient for default partitioning.  I made a few minor cleanups;
>> hopefully I didn't break anything.
>
>
> Thanks you.
>
> Rebased 0002 against this commit & renamed to 0001, PFA.

Given that we have default partition support now, I am wondering
whether hash partitioned tables also should have default partitions.
The way we have structured hash partitioning syntax, there can be
"holes" in partitions. Default partition would help plug those holes.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

From

Robert Haas

Date:

11 September 2017, 14:43:29

On Mon, Sep 11, 2017 at 4:17 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>> Rebased 0002 against this commit & renamed to 0001, PFA.
>
> Given that we have default partition support now, I am wondering
> whether hash partitioned tables also should have default partitions.
> The way we have structured hash partitioning syntax, there can be
> "holes" in partitions. Default partition would help plug those holes.

Yeah, I was thinking about that, too.  On the one hand, it seems like
it's solving the problem the wrong way: if you've set up hash
partitioning properly, you shouldn't have any holes.  On the other
hand, supporting it probably wouldn't cost anything noticeable and
might make things seem more consistent.  I'm not sure which way to
jump on this one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

From

Alvaro Herrera

Date:

11 September 2017, 15:00:53

Robert Haas wrote:
> On Mon, Sep 11, 2017 at 4:17 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
> >> Rebased 0002 against this commit & renamed to 0001, PFA.
> >
> > Given that we have default partition support now, I am wondering
> > whether hash partitioned tables also should have default partitions.
> > The way we have structured hash partitioning syntax, there can be
> > "holes" in partitions. Default partition would help plug those holes.
> 
> Yeah, I was thinking about that, too.  On the one hand, it seems like
> it's solving the problem the wrong way: if you've set up hash
> partitioning properly, you shouldn't have any holes.  On the other
> hand, supporting it probably wouldn't cost anything noticeable and
> might make things seem more consistent.  I'm not sure which way to
> jump on this one.

How difficult/tedious/troublesome would be to install the missing
partitions if you set hash partitioning with a default partition and
only later on notice that some partitions are missing?  I think if the
answer is that you need to exclusive-lock something for a long time and
this causes a disruption in production systems, then it's better not to
allow a default partition at all and just force all the hash partitions
to be there from the start.

On the other hand, if you can get tuples out of the default partition
into their intended regular partitions without causing any disruption,
then it seems okay to allow default partitions in hash partitioning
setups.

(I, like many others, was unable to follow the default partition stuff
as closely as I would have liked.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

From

Robert Haas

Date:

11 September 2017, 17:12:57

On Mon, Sep 11, 2017 at 8:00 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> How difficult/tedious/troublesome would be to install the missing
> partitions if you set hash partitioning with a default partition and
> only later on notice that some partitions are missing?  I think if the
> answer is that you need to exclusive-lock something for a long time and
> this causes a disruption in production systems, then it's better not to
> allow a default partition at all and just force all the hash partitions
> to be there from the start.
>
> On the other hand, if you can get tuples out of the default partition
> into their intended regular partitions without causing any disruption,
> then it seems okay to allow default partitions in hash partitioning
> setups.

I think there's no real use case for default partitioning, and yeah,
you do need exclusive locks to repartition things (whether hash
partitioning or otherwise).  It would be nice to fix that eventually,
but it's hard, because the executor has to cope with the floor moving
under it, and as of today, it really can't cope with that at all - not
because of partitioning specifically, but because of existing design
decisions that will require a lot of work (and probably arguing) to
revisit.

I think the way to get around the usability issues for hash
partitioning is to eventually add some syntax that does things like
(1) automatically create the table with N properly-configured
partitions, (2) automatically split an existing partition into N
pieces, and (3) automatically rewrite the whole table using a
different partition count.

People seem to find the hash partitioning stuff a little arcane.  I
don't want to discount that confusion with some sort of high-handed "I
know better" attitude, I think the interface that users will actually
see can end up being pretty straightforward.  The complexity that is
there in the syntax is to allow pg_upgrade and pg_dump/restore to work
properly.  But users don't necessarily have to use the same syntax
that pg_dump does, just as you can say CREATE INDEX ON a (b) and let
the system specify the index name, but at dump time the index name is
specified explicitly.

> (I, like many others, was unable to follow the default partition stuff
> as closely as I would have liked.)

Uh, sorry about that.  Would it help if I wrote a blog post on it or
something?  The general idea is simple: any tuples that don't route to
any other partition get routed to the default partition.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

11 September 2017, 19:15:04

On Mon, Sep 11, 2017 at 5:30 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Robert Haas wrote:
> On Mon, Sep 11, 2017 at 4:17 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
> >> Rebased 0002 against this commit & renamed to 0001, PFA.
> >
> > Given that we have default partition support now, I am wondering
> > whether hash partitioned tables also should have default partitions.
> > The way we have structured hash partitioning syntax, there can be
> > "holes" in partitions. Default partition would help plug those holes.
>
> Yeah, I was thinking about that, too. On the one hand, it seems like
> it's solving the problem the wrong way: if you've set up hash
> partitioning properly, you shouldn't have any holes. On the other
> hand, supporting it probably wouldn't cost anything noticeable and
> might make things seem more consistent. I'm not sure which way to
> jump on this one.

How difficult/tedious/troublesome would be to install the missing
partitions if you set hash partitioning with a default partition and
only later on notice that some partitions are missing? I think if the
answer is that you need to exclusive-lock something for a long time and
this causes a disruption in production systems, then it's better not to
allow a default partition at all and just force all the hash partitions
to be there from the start.

I am also leaning toward not to support a default partition for a hash partitioned table.

The major drawback I can see is the constraint get created on the default partition

table. IIUC, constraint on the default partition table are just negation of partition

constraint on all its sibling partitions.

Consider a hash partitioned table having partitions with (modulus 64, remainder 0) ,

...., (modulus 64, remainder 62) hash bound and partition column are col1, col2,...,so on,

then constraint for the default partition will be :

NOT( (satisfies_hash_partition(64, 0, hash_fn1(col1), hash_fn2(col2), ...) && ... &&

satisfies_hash_partition(64, 62, hash_fn1(col1),hash_fn2(col2), ...))

Which will be much harmful to the performance than any other partitioning

strategy because it calculate a hash for the same partitioning key multiple time.

We could overcome this by having an another SQL function (e.g satisfies_default_hash_partition)

which calculates hash value once and checks the remainder, and that would be

a different path from the current default partition framework.

Regards,

Amul

Re: [HACKERS] [POC] hash partitioning

From

Jesper Pedersen

Date:

13 September 2017, 17:13:40

Hi Amul,

On 09/08/2017 08:40 AM, amul sul wrote:
> Rebased 0002 against this commit & renamed to 0001, PFA.
> 

This patch needs a rebase.

Best regards, Jesper



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

14 September 2017, 11:58:49

On Wed, Sep 13, 2017 at 7:43 PM, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:

Hi Amul,

On 09/08/2017 08:40 AM, amul sul wrote:
Rebased 0002 against this commit & renamed to 0001, PFA.

This patch needs a rebase.

Thanks for your note.

Attached is the patch rebased on the latest master head.

Also added error on

creating

efault partition

for the hash partitioned table

and updated document &

test script for the same.

Regards,

Amul

Attachment

0001-hash-partitioning_another_design-v19.patch

Re: [HACKERS] [POC] hash partitioning

From

Jesper Pedersen

Date:

14 September 2017, 18:39:26

Hi Amul,

On 09/14/2017 04:58 AM, amul sul wrote:
> On Wed, Sep 13, 2017 at 7:43 PM, Jesper Pedersen <jesper.pedersen@redhat.com
>> This patch needs a rebase.
>>
>>
> Thanks for your note.
>  
> Attached is the patch rebased on the latest master head.
> Also added error on creating default partition for the hash partitioned table,
> and updated document & test script for the same.
> 

Thanks !

When I do

CREATE TABLE mytab (  a integer NOT NULL,  b integer NOT NULL,  c integer,  d integer
) PARTITION BY HASH (b);

and create 64 partitions;

CREATE TABLE mytab_p00 PARTITION OF mytab FOR VALUES WITH (MODULUS 64, 
REMAINDER 0);
...
CREATE TABLE mytab_p63 PARTITION OF mytab FOR VALUES WITH (MODULUS 64, 
REMAINDER 63);

and associated indexes

CREATE INDEX idx_p00 ON mytab_p00 USING btree (b, a);
...
CREATE INDEX idx_p63 ON mytab_p63 USING btree (b, a);

Populate the database, and do ANALYZE.

Given

EXPLAIN (ANALYZE, VERBOSE, BUFFERS ON) SELECT a, b, c, d FROM mytab 
WHERE b = 42

gives

Append  -> Index Scan using idx_p00 (cost rows=7) (actual rows=0)  ...  -> Index Scan using idx_p63 (cost rows=7)
(actualrows=0)

E.g. all partitions are being scanned. Of course one partition will 
contain the rows I'm looking for.

Best regards, Jesper

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

From

Robert Haas

Date:

14 September 2017, 19:05:16

On Thu, Sep 14, 2017 at 11:39 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
> When I do
>
> CREATE TABLE mytab (
>   a integer NOT NULL,
>   b integer NOT NULL,
>   c integer,
>   d integer
> ) PARTITION BY HASH (b);
>
> and create 64 partitions;
>
> CREATE TABLE mytab_p00 PARTITION OF mytab FOR VALUES WITH (MODULUS 64,
> REMAINDER 0);
> ...
> CREATE TABLE mytab_p63 PARTITION OF mytab FOR VALUES WITH (MODULUS 64,
> REMAINDER 63);
>
> and associated indexes
>
> CREATE INDEX idx_p00 ON mytab_p00 USING btree (b, a);
> ...
> CREATE INDEX idx_p63 ON mytab_p63 USING btree (b, a);
>
> Populate the database, and do ANALYZE.
>
> Given
>
> EXPLAIN (ANALYZE, VERBOSE, BUFFERS ON) SELECT a, b, c, d FROM mytab WHERE b
> = 42
>
> gives
>
> Append
>   -> Index Scan using idx_p00 (cost rows=7) (actual rows=0)
>   ...
>   -> Index Scan using idx_p63 (cost rows=7) (actual rows=0)
>
> E.g. all partitions are being scanned. Of course one partition will contain
> the rows I'm looking for.

Yeah, we need Amit Langote's work in
http://postgr.es/m/098b9c71-1915-1a2a-8d52-1a7a50ce79e8@lab.ntt.co.jp
to land and this patch to be adapted to make use of it.  I think
that's the major thing still standing in the way of this. Concerns
were also raised about not having a way to see the hash function, but
we fixed that in 81c5e46c490e2426db243eada186995da5bb0ba7 and
hopefully this patch has been updated to use a seed (I haven't looked
yet).  And there was a concern about hash functions not being
portable, but the conclusion of that was basically that most people
think --load-via-partition-root will be a satisfactory workaround for
cases where that becomes a problem (cf. commit
23d7680d04b958de327be96ffdde8f024140d50e).  So this is the major
remaining issue that I know about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

From

Jesper Pedersen

Date:

14 September 2017, 19:53:12

Hi,

On 09/14/2017 12:05 PM, Robert Haas wrote:
> On Thu, Sep 14, 2017 at 11:39 AM, Jesper Pedersen
> <jesper.pedersen@redhat.com> wrote:
>> When I do
>>
>> CREATE TABLE mytab (
>>    a integer NOT NULL,
>>    b integer NOT NULL,
>>    c integer,
>>    d integer
>> ) PARTITION BY HASH (b);
>>
>> and create 64 partitions;
>>
>> CREATE TABLE mytab_p00 PARTITION OF mytab FOR VALUES WITH (MODULUS 64,
>> REMAINDER 0);
>> ...
>> CREATE TABLE mytab_p63 PARTITION OF mytab FOR VALUES WITH (MODULUS 64,
>> REMAINDER 63);
>>
>> and associated indexes
>>
>> CREATE INDEX idx_p00 ON mytab_p00 USING btree (b, a);
>> ...
>> CREATE INDEX idx_p63 ON mytab_p63 USING btree (b, a);
>>
>> Populate the database, and do ANALYZE.
>>
>> Given
>>
>> EXPLAIN (ANALYZE, VERBOSE, BUFFERS ON) SELECT a, b, c, d FROM mytab WHERE b
>> = 42
>>
>> gives
>>
>> Append
>>    -> Index Scan using idx_p00 (cost rows=7) (actual rows=0)
>>    ...
>>    -> Index Scan using idx_p63 (cost rows=7) (actual rows=0)
>>
>> E.g. all partitions are being scanned. Of course one partition will contain
>> the rows I'm looking for.
> 
> Yeah, we need Amit Langote's work in
> http://postgr.es/m/098b9c71-1915-1a2a-8d52-1a7a50ce79e8@lab.ntt.co.jp
> to land and this patch to be adapted to make use of it.  I think
> that's the major thing still standing in the way of this. Concerns
> were also raised about not having a way to see the hash function, but
> we fixed that in 81c5e46c490e2426db243eada186995da5bb0ba7 and
> hopefully this patch has been updated to use a seed (I haven't looked
> yet).  And there was a concern about hash functions not being
> portable, but the conclusion of that was basically that most people
> think --load-via-partition-root will be a satisfactory workaround for
> cases where that becomes a problem (cf. commit
> 23d7680d04b958de327be96ffdde8f024140d50e).  So this is the major
> remaining issue that I know about.
> 

Thanks for the information, Robert !

Best regards, Jesper


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

From

David Fetter

Date:

14 September 2017, 19:54:44

On Mon, Sep 11, 2017 at 07:43:29AM -0400, Robert Haas wrote:
> On Mon, Sep 11, 2017 at 4:17 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
> >> Rebased 0002 against this commit & renamed to 0001, PFA.
> >
> > Given that we have default partition support now, I am wondering
> > whether hash partitioned tables also should have default
> > partitions.  The way we have structured hash partitioning syntax,
> > there can be "holes" in partitions. Default partition would help
> > plug those holes.
> 
> Yeah, I was thinking about that, too.  On the one hand, it seems
> like it's solving the problem the wrong way: if you've set up hash
> partitioning properly, you shouldn't have any holes.

Should we be pointing the gun away from people's feet by making hash
partitions that cover the space automagically when the partitioning
scheme[1] is specified?  In other words, do we have a good reason to have
only some of the hash partitions so defined by default?

Best,
David.

[1] For now, that's just the modulus, but the PoC included specifying
hashing functions, so I assume other ways to specify the partitioning
scheme could eventually be proposed.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

From

Robert Haas

Date:

14 September 2017, 19:56:57

On Thu, Sep 14, 2017 at 12:54 PM, David Fetter <david@fetter.org> wrote:
> Should we be pointing the gun away from people's feet by making hash
> partitions that cover the space automagically when the partitioning
> scheme[1] is specified?  In other words, do we have a good reason to have
> only some of the hash partitions so defined by default?

Sure, we can add some convenience syntax for that, but I'd like to get
the basic stuff working before doing that kind of polishing.

If nothing else, I assume Keith Fiske's pg_partman will provide a way
to magically DTRT about an hour after this goes in.  But probably we
can do better in core easily enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

From

Jesper Pedersen

Date:

14 September 2017, 20:07:11

On 09/14/2017 12:56 PM, Robert Haas wrote:
> On Thu, Sep 14, 2017 at 12:54 PM, David Fetter <david@fetter.org> wrote:
>> Should we be pointing the gun away from people's feet by making hash
>> partitions that cover the space automagically when the partitioning
>> scheme[1] is specified?  In other words, do we have a good reason to have
>> only some of the hash partitions so defined by default?
> 
> Sure, we can add some convenience syntax for that, but I'd like to get
> the basic stuff working before doing that kind of polishing.
> 
> If nothing else, I assume Keith Fiske's pg_partman will provide a way
> to magically DTRT about an hour after this goes in.  But probably we
> can do better in core easily enough.
> 

Yeah, it would be nice to have a syntax like

) PARTITION BY HASH (col) WITH (AUTO_CREATE = 64);

But then there also needs to be a way to create the 64 associated 
indexes too for everything to be easy.

Best regards, Jesper


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

From

Robert Haas

Date:

14 September 2017, 20:52:27

On Thu, Sep 14, 2017 at 1:07 PM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
> Yeah, it would be nice to have a syntax like
>
> ) PARTITION BY HASH (col) WITH (AUTO_CREATE = 64);
>
> But then there also needs to be a way to create the 64 associated indexes
> too for everything to be easy.

Well, for that, there's this proposal:

http://postgr.es/m/c8fe4f6b-ff46-aae0-89e3-e936a35f0cfd@postgrespro.ru

As several people have right pointed out, there's a lot of work to be
done on partitioning it to get it to where we want it to be.  Even in
v10, it's got significant benefits, such as much faster bulk-loading,
but I don't hear anybody disputing the notion that a lot more work is
needed.  The good news is that a lot of that work is already in
progress; the bad news is that a lot of that work is not done yet.

But I think that's OK.  We can't solve every problem at once, and I
think we're moving things along here at a reasonably brisk pace.  That
didn't stop me from complaining bitterly to someone just yesterday
that we aren't moving faster still, but unfortunately EnterpriseDB has
only been able to get 12 developers to do any work at all on
partitioning this release cycle, and 3 of those have so far helped
only with review and benchmarking.  It's a pity we can't do more, but
considering how many community projects are 1-person efforts I think
it's pretty good.

To be clear, I know you're not (or at least I assume you're not)
trying to beat me up about this, just raising a concern, and I'm not
trying to beat you up either, just let you know that it is definitely
on the radar screen but not there yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

From

Jesper Pedersen

Date:

14 September 2017, 21:05:02

On 09/14/2017 01:52 PM, Robert Haas wrote:
> On Thu, Sep 14, 2017 at 1:07 PM, Jesper Pedersen
> <jesper.pedersen@redhat.com> wrote:
>> Yeah, it would be nice to have a syntax like
>>
>> ) PARTITION BY HASH (col) WITH (AUTO_CREATE = 64);
>>
>> But then there also needs to be a way to create the 64 associated indexes
>> too for everything to be easy.
> 
> Well, for that, there's this proposal:
> 
> http://postgr.es/m/c8fe4f6b-ff46-aae0-89e3-e936a35f0cfd@postgrespro.ru
> 
> As several people have right pointed out, there's a lot of work to be
> done on partitioning it to get it to where we want it to be.  Even in
> v10, it's got significant benefits, such as much faster bulk-loading,
> but I don't hear anybody disputing the notion that a lot more work is
> needed.  The good news is that a lot of that work is already in
> progress; the bad news is that a lot of that work is not done yet.
> 
> But I think that's OK.  We can't solve every problem at once, and I
> think we're moving things along here at a reasonably brisk pace.  That
> didn't stop me from complaining bitterly to someone just yesterday
> that we aren't moving faster still, but unfortunately EnterpriseDB has
> only been able to get 12 developers to do any work at all on
> partitioning this release cycle, and 3 of those have so far helped
> only with review and benchmarking.  It's a pity we can't do more, but
> considering how many community projects are 1-person efforts I think
> it's pretty good.
> 
> To be clear, I know you're not (or at least I assume you're not)
> trying to beat me up about this, just raising a concern, and I'm not
> trying to beat you up either, just let you know that it is definitely
> on the radar screen but not there yet.
> 

Definitely not a complain about the work being done.

I think the scope of Amul's and others work on hash partition support is 
where it needs to be. Improvements can always follow in future release.

My point was that is easy to script the definition of the partitions and 
their associated indexes, so it is more important to focus on the core 
functionality with the developer / review resources available.

However, it is a little bit difficult to follow the dependencies between 
different partition patches, so I may not always provide sane feedback, 
as seen in [1].

[1] 
https://www.postgresql.org/message-id/579077fd-8f07-aff7-39bc-b92c855cdb70%40redhat.com

Best regards, Jesper


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

From

Robert Haas

Date:

14 September 2017, 22:47:50

On Thu, Sep 14, 2017 at 2:05 PM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
> However, it is a little bit difficult to follow the dependencies between
> different partition patches, so I may not always provide sane feedback, as
> seen in [1].
>
> [1]
> https://www.postgresql.org/message-id/579077fd-8f07-aff7-39bc-b92c855cdb70%40redhat.com

Yeah, no issues.  I knew about the dependency between those patches,
but I'm pretty sure there wasn't any terribly explicit discussion
about it, even if the issue probably came up parenthetically someplace
or other.  Oops.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

From

Thom Brown

Date:

15 September 2017, 02:00:04

On 14 September 2017 at 09:58, amul sul <sulamul@gmail.com> wrote:
> On Wed, Sep 13, 2017 at 7:43 PM, Jesper Pedersen
> <jesper.pedersen@redhat.com> wrote:
>>
>> Hi Amul,
>>
>> On 09/08/2017 08:40 AM, amul sul wrote:
>>>
>>> Rebased 0002 against this commit & renamed to 0001, PFA.
>>>
>>
>> This patch needs a rebase.
>>
>
> Thanks for your note.
> Attached is the patch rebased on the latest master head.
> Also added error on
> creating
> d
> efault partition
> for the hash partitioned table
> ,
> and updated document &
> test script for the same.

Sorry, but this needs another rebase as it's broken by commit
77b6b5e9ceca04dbd6f0f6cd3fc881519acc8714.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

15 September 2017, 09:30:09

On Fri, Sep 15, 2017 at 4:30 AM, Thom Brown <thom@linux.com> wrote:

On 14 September 2017 at 09:58, amul sul <sulamul@gmail.com> wrote:
> On Wed, Sep 13, 2017 at 7:43 PM, Jesper Pedersen
> <jesper.pedersen@redhat.com> wrote:
>>
>> Hi Amul,
>>
>> On 09/08/2017 08:40 AM, amul sul wrote:
>>>
>>> Rebased 0002 against this commit & renamed to 0001, PFA.
>>>
>>
>> This patch needs a rebase.
>>
>
> Thanks for your note.
> Attached is the patch rebased on the latest master head.
> Also added error on
> creating
> d
> efault partition
> for the hash partitioned table
> ,
> and updated document &
> test script for the same.

Sorry, but this needs another rebase as it's broken by commit
77b6b5e9ceca04dbd6f0f6cd3fc881519acc8714.

Attached rebased patch, thanks.

Regards,

Amul

Attachment

0001-hash-partitioning_another_design-v20.patch

Re: [HACKERS] [POC] hash partitioning

From

Jesper Pedersen

Date:

18 September 2017, 18:25:02

On 09/15/2017 02:30 AM, amul sul wrote:
> Attached rebased patch, thanks.
> 

While reading through the patch I thought it would be better to keep 
MODULUS and REMAINDER in caps, if CREATE TABLE was in caps too in order 
to highlight that these are "keywords" for hash partition.

Also updated some of the documentation.

V20 patch passes make check-world, and my testing (typical 64 
partitions, and various ATTACH/DETACH scenarios).

Thanks for working on this !

Best regards,
  Jesper

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

delta_v20_v1.patch

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

27 September 2017, 10:05:37

On Mon, Sep 18, 2017 at 8:55 PM, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:

On 09/15/2017 02:30 AM, amul sul wrote:
Attached rebased patch, thanks.

While reading through the patch I thought it would be better to keep MODULUS and REMAINDER in caps, if CREATE TABLE was in caps too in order to highlight that these are "keywords" for hash partition.

Also updated some of the documentation.

Thanks a lot for the patch, included in the attached version.

V20 patch passes make check-world, and my testing (typical 64 partitions, and various ATTACH/DETACH scenarios).

Nice, thanks again.

Regards,

Amul

Attachment

0001-hash-partitioning_another_design-v21.patch

Re: [HACKERS] [POC] hash partitioning

From

Jesper Pedersen

Date:

27 September 2017, 16:41:26

On 09/27/2017 03:05 AM, amul sul wrote:
>>> Attached rebased patch, thanks.
>>>
>>>
>> While reading through the patch I thought it would be better to keep
>> MODULUS and REMAINDER in caps, if CREATE TABLE was in caps too in order to
>> highlight that these are "keywords" for hash partition.
>>
>> Also updated some of the documentation.
>>
>>
> Thanks a lot for the patch, included in the attached version.
> 

Thank you.

Based on [1] I have moved the patch to "Ready for Committer".

[1] 
https://www.postgresql.org/message-id/CA%2BTgmoYsw3pusDen4_A44c7od%2BbEAST0eYo%2BjODtyofR0W2soQ%40mail.gmail.com

Best regards, Jesper



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

From

Amit Langote

Date:

28 September 2017, 08:54:59

On 2017/09/27 22:41, Jesper Pedersen wrote:
> On 09/27/2017 03:05 AM, amul sul wrote:
>>>> Attached rebased patch, thanks.
>>>>
>>>>
>>> While reading through the patch I thought it would be better to keep
>>> MODULUS and REMAINDER in caps, if CREATE TABLE was in caps too in order to
>>> highlight that these are "keywords" for hash partition.
>>>
>>> Also updated some of the documentation.
>>>
>>>
>> Thanks a lot for the patch, included in the attached version.
>>
> 
> Thank you.
> 
> Based on [1] I have moved the patch to "Ready for Committer".

Thanks a lot Amul for working on this.  Like Jesper said, the patch looks
pretty good overall.  I was looking at the latest version with intent to
study certain things about hash partitioning the way patch implements it,
during which I noticed some things.

+      The modulus must be a positive integer, and the remainder must a

must be a

+      suppose you have a hash-partitioned table with 8 children, each of
which
+      has modulus 8, but find it necessary to increase the number of
partitions
+      to 16.

Might it be a good idea to say 8 "partitions" instead of "children" in the
first sentence?

+      each modulus-8 partition until none remain.  While this may still
involve
+      a large amount of data movement at each step, it is still better than
+      having to create a whole new table and move all the data at once.
+     </para>
+

I read the paragraph that ends with the above text and started wondering
if the example to redistribute data in hash partitions by detaching and
attaching with new modulus/remainder could be illustrated with an example?
Maybe in the examples section of the ALTER TABLE page?

+      Since hash operator class provide only equality, not ordering,
collation

Either "Since hash operator classes provide" or "Since hash operator class
provides"

Other than the above points, patch looks good.

By the way, I noticed a couple of things about hash partition constraints:

1. In get_qual_for_hash(), using
get_fn_expr_rettype(&key->partsupfunc[i]), which returns InvalidOid for
the lack of fn_expr being set to non-NULL value, causes funcrettype of the
FuncExpr being generated for hashing partition key columns to be set to
InvalidOid, which I think is wrong.  That is, the following if condition
in get_fn_expr_rettype() is always satisfied:

    if (!flinfo || !flinfo->fn_expr)
        return InvalidOid;

I think we could use get_func_rettype(&key->partsupfunc[i].fn_oid)
instead.  Attached patch
hash-v21-set-funcexpr-funcrettype-correctly.patch, which applies on top
v21 of your patch.

2. It seems that the reason constraint exclusion doesn't work with hash
partitions as implemented by the patch is that predtest.c:
operator_predicate_proof() returns false even without looking into the
hash partition constraint, which is of the following form:

satisfies_hash_partition(<mod>, <rem>, <key1-exthash>,..)

beccause the above constraint expression doesn't translate into a a binary
opclause (an OpExpr), which operator_predicate_proof() knows how to work
with.  So, false is returned at the beginning of that function by the
following code:

    if (!is_opclause(predicate))
        return false;

For example,

create table p (a int) partition by hash (a);
create table p0 partition of p for values with (modulus 4, remainder 0);
create table p1 partition of p for values with (modulus 4, remainder 1);
\d+ p0
<...>
Partition constraint: satisfies_hash_partition(4, 0, hashint4extended(a,
'8816678312871386367'::bigint))

-- both p0 and p1 scanned
explain select * from p where satisfies_hash_partition(4, 0,
hashint4extended(a, '8816678312871386367'::bigint));
                                             QUERY PLAN

----------------------------------------------------------------------------------------------------
 Append  (cost=0.00..96.50 rows=1700 width=4)
   ->  Seq Scan on p0  (cost=0.00..48.25 rows=850 width=4)
         Filter: satisfies_hash_partition(4, 0, hashint4extended(a,
'8816678312871386367'::bigint))
   ->  Seq Scan on p1  (cost=0.00..48.25 rows=850 width=4)
         Filter: satisfies_hash_partition(4, 0, hashint4extended(a,
'8816678312871386367'::bigint))
(5 rows)

-- both p0 and p1 scanned
explain select * from p where satisfies_hash_partition(4, 1,
hashint4extended(a, '8816678312871386367'::bigint));
                                             QUERY PLAN

----------------------------------------------------------------------------------------------------
 Append  (cost=0.00..96.50 rows=1700 width=4)
   ->  Seq Scan on p0  (cost=0.00..48.25 rows=850 width=4)
         Filter: satisfies_hash_partition(4, 1, hashint4extended(a,
'8816678312871386367'::bigint))
   ->  Seq Scan on p1  (cost=0.00..48.25 rows=850 width=4)
         Filter: satisfies_hash_partition(4, 1, hashint4extended(a,
'8816678312871386367'::bigint))
(5 rows)

I looked into how satisfies_hash_partition() works and came up with an
idea that I think will make constraint exclusion work.  What if we emitted
the hash partition constraint in the following form instead:

hash_partition_mod(hash_partition_hash(key1-exthash, key2-exthash),
                   <mod>) = <rem>

With that form, constraint exclusion seems to work as illustrated below:

\d+ p0
<...>
Partition constraint:
(hash_partition_modulus(hash_partition_hash(hashint4extended(a,
'8816678312871386367'::bigint)), 4) = 0)

-- note only p0 is scanned
explain select * from p where
hash_partition_modulus(hash_partition_hash(hashint4extended(a,
'8816678312871386367'::bigint)), 4) = 0;
                     QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..61.00 rows=13 width=4)
   ->  Seq Scan on p0  (cost=0.00..61.00 rows=13 width=4)
         Filter:
(hash_partition_modulus(hash_partition_hash(hashint4extended(a,
'8816678312871386367'::bigint)), 4) = 0)
(3 rows)

-- note only p1 is scanned
explain select * from p where
hash_partition_modulus(hash_partition_hash(hashint4extended(a,
'8816678312871386367'::bigint)), 4) = 1;
                                                        QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..61.00 rows=13 width=4)
   ->  Seq Scan on p1  (cost=0.00..61.00 rows=13 width=4)
         Filter:
(hash_partition_modulus(hash_partition_hash(hashint4extended(a,
'8816678312871386367'::bigint)), 4) = 1)
(3 rows)

I tried to implement that in the attached
hash-v21-hash-part-constraint.patch, which applies on top v21 of your
patch (actually on top of
hash-v21-set-funcexpr-funcrettype-correctly.patch, which I think should be
applied anyway as it fixes a bug of the original patch).

What do you think?  Eventually, the new partition-pruning method [1] will
make using constraint exclusion obsolete, but it might be a good idea to
have it working if we can.

Thanks,
Amit

[1] https://commitfest.postgresql.org/14/1272/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

28 September 2017, 12:56:48

On Thu, Sep 28, 2017 at 11:24 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/09/27 22:41, Jesper Pedersen wrote:
>> On 09/27/2017 03:05 AM, amul sul wrote:
>>>>> Attached rebased patch, thanks.
>>>>>
>>>>>
>>>> While reading through the patch I thought it would be better to keep
>>>> MODULUS and REMAINDER in caps, if CREATE TABLE was in caps too in order to
>>>> highlight that these are "keywords" for hash partition.
>>>>
>>>> Also updated some of the documentation.
>>>>
>>>>
>>> Thanks a lot for the patch, included in the attached version.
>>>
>>
>> Thank you.
>>
>> Based on [1] I have moved the patch to "Ready for Committer".
>
> Thanks a lot Amul for working on this.  Like Jesper said, the patch looks
> pretty good overall.  I was looking at the latest version with intent to
> study certain things about hash partitioning the way patch implements it,
> during which I noticed some things.
>

Thanks Amit for looking at the patch.

> +      The modulus must be a positive integer, and the remainder must a
>
> must be a
>

Fixed in the attached version.

> +      suppose you have a hash-partitioned table with 8 children, each of
> which
> +      has modulus 8, but find it necessary to increase the number of
> partitions
> +      to 16.
>

Fixed in the attached version.

> Might it be a good idea to say 8 "partitions" instead of "children" in the
> first sentence?
>
> +      each modulus-8 partition until none remain.  While this may still
> involve
> +      a large amount of data movement at each step, it is still better than
> +      having to create a whole new table and move all the data at once.
> +     </para>
> +
>

Fixed in the attached version.

> I read the paragraph that ends with the above text and started wondering
> if the example to redistribute data in hash partitions by detaching and
> attaching with new modulus/remainder could be illustrated with an example?
> Maybe in the examples section of the ALTER TABLE page?
>

I think hint in the documentation is more than enough. There is N number of
ways of data redistribution, the document is not meant to explain all of those.

> +      Since hash operator class provide only equality, not ordering,
> collation
>
> Either "Since hash operator classes provide" or "Since hash operator class
> provides"
>

Fixed in the attached version.

> Other than the above points, patch looks good.
>
>
> By the way, I noticed a couple of things about hash partition constraints:
>
> 1. In get_qual_for_hash(), using
> get_fn_expr_rettype(&key->partsupfunc[i]), which returns InvalidOid for
> the lack of fn_expr being set to non-NULL value, causes funcrettype of the
> FuncExpr being generated for hashing partition key columns to be set to
> InvalidOid, which I think is wrong.  That is, the following if condition
> in get_fn_expr_rettype() is always satisfied:
>
>     if (!flinfo || !flinfo->fn_expr)
>         return InvalidOid;
>
> I think we could use get_func_rettype(&key->partsupfunc[i].fn_oid)
> instead.  Attached patch
> hash-v21-set-funcexpr-funcrettype-correctly.patch, which applies on top
> v21 of your patch.
>

Thanks for the patch, included in the attached version.

> 2. It seems that the reason constraint exclusion doesn't work with hash
> partitions as implemented by the patch is that predtest.c:
> operator_predicate_proof() returns false even without looking into the
> hash partition constraint, which is of the following form:
>
> satisfies_hash_partition(<mod>, <rem>, <key1-exthash>,..)
>
> beccause the above constraint expression doesn't translate into a a binary
> opclause (an OpExpr), which operator_predicate_proof() knows how to work
> with.  So, false is returned at the beginning of that function by the
> following code:
>
>     if (!is_opclause(predicate))
>         return false;
>
> For example,
>
> create table p (a int) partition by hash (a);
> create table p0 partition of p for values with (modulus 4, remainder 0);
> create table p1 partition of p for values with (modulus 4, remainder 1);
> \d+ p0
> <...>
> Partition constraint: satisfies_hash_partition(4, 0, hashint4extended(a,
> '8816678312871386367'::bigint))
>
> -- both p0 and p1 scanned
> explain select * from p where satisfies_hash_partition(4, 0,
> hashint4extended(a, '8816678312871386367'::bigint));
>                                              QUERY PLAN
>
> ----------------------------------------------------------------------------------------------------
>  Append  (cost=0.00..96.50 rows=1700 width=4)
>    ->  Seq Scan on p0  (cost=0.00..48.25 rows=850 width=4)
>          Filter: satisfies_hash_partition(4, 0, hashint4extended(a,
> '8816678312871386367'::bigint))
>    ->  Seq Scan on p1  (cost=0.00..48.25 rows=850 width=4)
>          Filter: satisfies_hash_partition(4, 0, hashint4extended(a,
> '8816678312871386367'::bigint))
> (5 rows)
>
> -- both p0 and p1 scanned
> explain select * from p where satisfies_hash_partition(4, 1,
> hashint4extended(a, '8816678312871386367'::bigint));
>                                              QUERY PLAN
>
> ----------------------------------------------------------------------------------------------------
>  Append  (cost=0.00..96.50 rows=1700 width=4)
>    ->  Seq Scan on p0  (cost=0.00..48.25 rows=850 width=4)
>          Filter: satisfies_hash_partition(4, 1, hashint4extended(a,
> '8816678312871386367'::bigint))
>    ->  Seq Scan on p1  (cost=0.00..48.25 rows=850 width=4)
>          Filter: satisfies_hash_partition(4, 1, hashint4extended(a,
> '8816678312871386367'::bigint))
> (5 rows)
>
>
> I looked into how satisfies_hash_partition() works and came up with an
> idea that I think will make constraint exclusion work.  What if we emitted
> the hash partition constraint in the following form instead:
>
> hash_partition_mod(hash_partition_hash(key1-exthash, key2-exthash),
>                    <mod>) = <rem>
>
> With that form, constraint exclusion seems to work as illustrated below:
>
> \d+ p0
> <...>
> Partition constraint:
> (hash_partition_modulus(hash_partition_hash(hashint4extended(a,
> '8816678312871386367'::bigint)), 4) = 0)
>
> -- note only p0 is scanned
> explain select * from p where
> hash_partition_modulus(hash_partition_hash(hashint4extended(a,
> '8816678312871386367'::bigint)), 4) = 0;
>                      QUERY PLAN
>
>
--------------------------------------------------------------------------------------------------------------------------
>  Append  (cost=0.00..61.00 rows=13 width=4)
>    ->  Seq Scan on p0  (cost=0.00..61.00 rows=13 width=4)
>          Filter:
> (hash_partition_modulus(hash_partition_hash(hashint4extended(a,
> '8816678312871386367'::bigint)), 4) = 0)
> (3 rows)
>
> -- note only p1 is scanned
> explain select * from p where
> hash_partition_modulus(hash_partition_hash(hashint4extended(a,
> '8816678312871386367'::bigint)), 4) = 1;
>                                                         QUERY PLAN
>
>
--------------------------------------------------------------------------------------------------------------------------
>  Append  (cost=0.00..61.00 rows=13 width=4)
>    ->  Seq Scan on p1  (cost=0.00..61.00 rows=13 width=4)
>          Filter:
> (hash_partition_modulus(hash_partition_hash(hashint4extended(a,
> '8816678312871386367'::bigint)), 4) = 1)
> (3 rows)
>
> I tried to implement that in the attached
> hash-v21-hash-part-constraint.patch, which applies on top v21 of your
> patch (actually on top of
> hash-v21-set-funcexpr-funcrettype-correctly.patch, which I think should be
> applied anyway as it fixes a bug of the original patch).
>
> What do you think?  Eventually, the new partition-pruning method [1] will
> make using constraint exclusion obsolete, but it might be a good idea to
> have it working if we can.
>

It does not really do the partition pruning via constraint exclusion and I don't
think anyone is going to use the remainder in the where condition to fetch
data and hash partitioning is not meant for that.

But I am sure that we could solve this problem using your and Beena's work
toward faster partition pruning[1] and Runtime Partition Pruning[2].

Will think on this changes if it is required for the pruning feature.

Regards,
Amul

1] https://postgr.es/m/098b9c71-1915-1a2a-8d52-1a7a50ce79e8@lab.ntt.co.jp
2] https://postgr.es/m/CAOG9ApE16ac-_VVZVvv0gePSgkg_BwYEV1NBqZFqDR2bBE0X0A@mail.gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

0001-hash-partitioning_another_design-v22.patch

Re: [HACKERS] [POC] hash partitioning

From

Robert Haas

Date:

29 September 2017, 19:53:39

On Thu, Sep 28, 2017 at 1:54 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I looked into how satisfies_hash_partition() works and came up with an
> idea that I think will make constraint exclusion work.  What if we emitted
> the hash partition constraint in the following form instead:
>
> hash_partition_mod(hash_partition_hash(key1-exthash, key2-exthash),
>                    <mod>) = <rem>
>
> With that form, constraint exclusion seems to work as illustrated below:
>
> \d+ p0
> <...>
> Partition constraint:
> (hash_partition_modulus(hash_partition_hash(hashint4extended(a,
> '8816678312871386367'::bigint)), 4) = 0)
>
> -- note only p0 is scanned
> explain select * from p where
> hash_partition_modulus(hash_partition_hash(hashint4extended(a,
> '8816678312871386367'::bigint)), 4) = 0;

What we actually want constraint exclusion to cover is SELECT * FROM p
WHERE a = 525600;

As Amul says, nobody's going to enter a query in the form you have it
here.  Life is too short to take time to put queries into bizarre
forms.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

From

Jesper Pedersen

Date:

06 October 2017, 15:05:51

Hi Amul,

On 09/28/2017 05:56 AM, amul sul wrote:
> It does not really do the partition pruning via constraint exclusion and I don't
> think anyone is going to use the remainder in the where condition to fetch
> data and hash partitioning is not meant for that.
> 
> But I am sure that we could solve this problem using your and Beena's work
> toward faster partition pruning[1] and Runtime Partition Pruning[2].
> 
> Will think on this changes if it is required for the pruning feature.
> 

Could you rebase on latest master ?

Best regards, Jesper


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

07 October 2017, 14:52:42

On Fri, Oct 6, 2017 at 5:35 PM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
> Hi Amul,
>
> Could you rebase on latest master ?
>

Sure will post that soon, but before that, I need to test hash partitioning
with recent partition-wise join commit (f49842d1ee), thanks.

Regards,
Amul

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

09 October 2017, 14:14:29

On Sat, Oct 7, 2017 at 5:22 PM, amul sul <sulamul@gmail.com> wrote:
> On Fri, Oct 6, 2017 at 5:35 PM, Jesper Pedersen
> <jesper.pedersen@redhat.com> wrote:
>> Hi Amul,
>>
>> Could you rebase on latest master ?
>>
>
> Sure will post that soon, but before that, I need to test hash partitioning
> with recent partition-wise join commit (f49842d1ee), thanks.
>

Updated patch attached.

0001 is the rebased of the previous patch, no new change.
0002 few changes in partition-wise join code to support
hash-partitioned table as well & regression tests.

Thanks & Regards,
Amul

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] [POC] hash partitioning

From

Ashutosh Bapat

Date:

09 October 2017, 15:21:21

On Mon, Oct 9, 2017 at 4:44 PM, amul sul <sulamul@gmail.com> wrote:


> 0002 few changes in partition-wise join code to support
> hash-partitioned table as well & regression tests.

+    switch (key->strategy)
+    {
+        case PARTITION_STRATEGY_HASH:
+            /*
+             * Indexes array is same as the greatest modulus.
+             * See partition_bounds_equal() for more explanation.
+             */
+            num_indexes = DatumGetInt32(src->datums[ndatums - 1][0]);
+            break;
This logic is duplicated at multiple places.  I think it's time we consolidate
these changes in a function/macro and call it from the places where we have to
calculate number of indexes based on the information in partition descriptor.
Refactoring existing code might be a separate patch and then add hash
partitioning case in hash partitioning patch.

+        int        dim = hash_part? 2 : partnatts;
Call the variable as natts_per_datum or just natts?

+                                    hash_part? true : key->parttypbyval[j],
+                                    key->parttyplen[j]);
parttyplen is the length of partition key attribute, whereas what you want here
is the length of type of modulus and remainder. Is that correct? Probably we
need some special handling wherever parttyplen and parttypbyval is used e.g. in
call to partition_bounds_equal() from build_joinrel_partition_info().

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

10 October 2017, 13:02:55

On Mon, Oct 9, 2017 at 5:51 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Mon, Oct 9, 2017 at 4:44 PM, amul sul <sulamul@gmail.com> wrote:
>

Thanks Ashutosh for your review, please find my comment inline.

>
>> 0002 few changes in partition-wise join code to support
>> hash-partitioned table as well & regression tests.
>
> +    switch (key->strategy)
> +    {
> +        case PARTITION_STRATEGY_HASH:
> +            /*
> +             * Indexes array is same as the greatest modulus.
> +             * See partition_bounds_equal() for more explanation.
> +             */
> +            num_indexes = DatumGetInt32(src->datums[ndatums - 1][0]);
> +            break;
> This logic is duplicated at multiple places.  I think it's time we consolidate
> these changes in a function/macro and call it from the places where we have to
> calculate number of indexes based on the information in partition descriptor.
> Refactoring existing code might be a separate patch and then add hash
> partitioning case in hash partitioning patch.
>

Make sense, added get_partition_bound_num_indexes() to get number of index
elements in 0001 & get_greatest_modulus() as name suggested to get the greatest
modulus of the hash partition bound in 0002.

> +        int        dim = hash_part? 2 : partnatts;
> Call the variable as natts_per_datum or just natts?
>

natts represents the number of attributes, but for the hash partition bound we
are not dealing with the attribute so that I have used short-form of dimension,
thoughts?

> +                                    hash_part? true : key->parttypbyval[j],
> +                                    key->parttyplen[j]);
> parttyplen is the length of partition key attribute, whereas what you want here
> is the length of type of modulus and remainder. Is that correct? Probably we
> need some special handling wherever parttyplen and parttypbyval is used e.g. in
> call to partition_bounds_equal() from build_joinrel_partition_info().
>

Unless I am missing something, I don't think we should worry about parttyplen
because in the datumCopy() when the datatype is pass-by-value then typelen
is ignored.

Regards,
Amul

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Tue, Oct 10, 2017 at 3:42 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> On Tue, Oct 10, 2017 at 3:32 PM, amul sul <sulamul@gmail.com> wrote:
>
>>> +                                    hash_part? true : key->parttypbyval[j],
>>> +                                    key->parttyplen[j]);
>>> parttyplen is the length of partition key attribute, whereas what you want here
>>> is the length of type of modulus and remainder. Is that correct? Probably we
>>> need some special handling wherever parttyplen and parttypbyval is used e.g. in
>>> call to partition_bounds_equal() from build_joinrel_partition_info().
>>>
>>
>> Unless I am missing something, I don't think we should worry about parttyplen
>> because in the datumCopy() when the datatype is pass-by-value then typelen
>> is ignored.
>
> That's true, but it's ugly, passing typbyvalue of one type and len of other.
>

How about the attached patch(0003)?
Also, the dim variable is renamed to natts.

Regards,
Amul

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Fri, Oct 13, 2017 at 3:00 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-10-12 17:27:52 -0400, Robert Haas wrote:
>> On Thu, Oct 12, 2017 at 4:20 PM, Andres Freund <andres@anarazel.de> wrote:
>> >> In other words, it's not utterly fixed in stone --- we invented
>> >> --load-via-partition-root primarily to cope with circumstances that
>> >> could change hash values --- but we sure don't want to be changing it
>> >> with any regularity, or for a less-than-excellent reason.
>> >
>> > Yea, that's what I expected. It'd probably good for somebody to run
>> > smhasher or such on the output of the combine function (or even better,
>> > on both the 32 and 64 bit variants) in that case.
>>
>> Not sure how that test suite works exactly, but presumably the
>> characteristics in practice will depend the behavior of the hash
>> functions used as input the combine function - so the behavior could
>> be good for an (int, int) key but bad for a (text, date) key, or
>> whatever.
>
> I don't think that's true, unless you have really bad hash functions on
> the the component hashes. A hash combine function can't really do
> anything about badly hashed input, what you want is that it doesn't
> *reduce* the quality of the hash by combining.
>

I tried to get suggested SMHasher[1] test result for the hash_combine
for 32-bit and 64-bit version.

SMHasher works on hash keys of the form {0}, {0,1}, {0,1,2}... up to
N=255, using 256-N as the seed, for the hash_combine testing we
needed two hash value to be combined, for that, I've generated 64
and 128-bit hash using cityhash functions[2] for the given smhasher
key then split in two part to test 32-bit and 64-bit hash_combine
function respectively.   Attached patch for SMHasher code changes &
output of 32-bit and 64-bit hash_combine testing. Note that I have
skipped speed test this test which is irrelevant here.

By referring other hash function results [3], we can see that hash_combine
test results are not bad either.

Do let me know if current testing is not good enough or if you want me to do
more testing, thanks.

1] https://github.com/aappleby/smhasher
2] https://github.com/aappleby/smhasher/blob/master/src/CityTest.cpp
3] https://github.com/rurban/smhasher/tree/master/doc

Regards,
Amul

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

24 October 2017, 14:21:27

On Thu, Oct 12, 2017 at 6:38 PM, amul sul <sulamul@gmail.com> wrote:
> On Thu, Oct 12, 2017 at 6:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Oct 10, 2017 at 7:07 AM, amul sul <sulamul@gmail.com> wrote:
>>> How about the attached patch(0003)?
>>> Also, the dim variable is renamed to natts.
>>
>> I'm not sure I believe this comment:
>>
>> +        /*
>> +         * We arrange the partitions in the ascending order of their modulus
>> +         * and remainders.  Also every modulus is factor of next larger
>> +         * modulus.  This means that the index of a given partition is same as
>> +         * the remainder of that partition.  Also entries at (remainder + N *
>> +         * modulus) positions in indexes array are all same for (modulus,
>> +         * remainder) specification for any partition.  Thus datums array from
>> +         * both the given bounds are same, if and only if their indexes array
>> +         * will be same.  So, it suffices to compare indexes array.
>> +         */
>>
>> I am particularly not sure that I believe that the index of a
>> partition must be the same as the remainder.  It doesn't seem like
>> that would be true when there is more than one modulus or when some
>> partitions are missing.
>>
>
> Looks like an explanation by the comment is not good enough, will think on this.
>
> Here are the links for the previous discussion:
> 1] https://postgr.es/m/CAFjFpRfHqSGBjNgJV2p%2BC4Yr5Qxvwygdsg4G_VQ6q9NTB-i3MA%40mail.gmail.com
> 2] https://postgr.es/m/CAFjFpRdeESKFkVGgmOdYvmD3d56-58c5VCBK0zDRjHrkq_VcNg%40mail.gmail.com
>
I have modified the comment little bit, now let me explain the theory behind it.

rd_partdesc->boundinfo->indexes array stores an index in rd_partdesc->oids
array corresponding to a given partition falls at the positions. And position in
indexes array is decided using remainder + N * modulus_of_that_partition
(where N = 0,1,2,..,).

For the case where the same modulus, the remainder will be 0,1,2,..,
and the index of that partition will be at 0,1,2,..,. (N=0).

For the case where more than one modulus then an index of a partition oid in the
oids array could be stored at the multiple places in indexes array if
its modulus is < greatest_modulus amongst bound (where N = 0,1,2,..,).

For example, partition bound (Modulus, remainder) = p1(2,0), p2(4,1),
p3(8,3), p4(8,7) Oids array [p1,p2,p3,p4] sorted by Modulus and then
by remainder and indexes array [0, 1, 0, 3, 0, 1, 0, 4] size of indexes
array is greatest_modulus.

In other word, if a partition index in oids array in the indexes array is
stored multiple times, then the lowest of the differences between them
is the modulus of that partition.  In above case for the partition p1, index
in oids array stored at 0,2,4,6. You can see lowest is the remainder and
minimum difference is the modulus of p1.

Since indexes arrays in both the bounds are same, for a given index in oids
array, the positions where it falls is same for both bounds. One can argue that
two different moduli could have the same remainder position, which is
not allowed
because that will cause partition overlap error at creation and also we have a
restriction on modulus that each modulus in the hash partition bound should be
the factor of next modulus.

> [....]
>
>> +static uint64
>> +mix_hash_value(int nkeys, Datum *hash_array, bool *isnull)
>>
> How about combining high 32 bits and the low 32 bits separately as shown below?
>
> static inline uint64
> hash_combine64(uint64 a, uint64 b)
> {
>     return (((uint64) hash_combine((uint32) a >> 32, (uint32) b >> 32) << 32)
>             | hash_combine((unit32) a, (unit32) b));
> }
>
I have used hash_combine64 function suggested by Andres [1].

>[....]
>> Have you checked how well the tests you've added cover the code you've
>> added?  What code is not covered by the tests, and is there any way to
>> cover it?
>>
> Will try to get gcov report for this patch.
>
Tests in the attached patch covers almost all the code expect few[2].

Updated patch attached.

1] https://postgr.es/m/20171012194353.3nealiykmjura4bi%40alap3.anarazel.de
2] Refer gcov_output.txt attachment.

Regards,
Amul Sul

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

 On Sun, Oct 29, 2017 at 12:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Oct 24, 2017 at 1:21 PM, amul sul <sulamul@gmail.com> wrote:
>> Updated patch attached.
>
> This patch needs a rebase.

Sure, thanks a lot for your review.

>
> It appears that satisfies_hash_func is declared incorrectly in
> pg_proc.h.  ProcedureCreate seems to think that provariadic should be
> ANYOID if the type of the last element is ANYOID, ANYELEMENTOID if the
> type of the last element is ANYARRAYOID, and otherwise the element
> type corresponding to the array type.   But here you have the last
> element as int4[] but provariadic is any.

Actually, int4[] is also inappropriate type as we have started using a 64bit
hash function.  We need something int8[] which is not available, so that I
have used ANYARRAYOID in the attached patch(0004).

> I wrote the following query
> to detect problems of this type, and I think we might want to just go
> ahead and add this to the regression test suite, verifying that it
> returns no rows:
>
> select oid::regprocedure, provariadic::regtype, proargtypes::regtype[]
> from pg_proc where provariadic != 0
> and case proargtypes[array_length(proargtypes, 1)-1]
>     when 2276 then 2276 -- any -> any
>     when 2277 then 2283 -- anyarray -> anyelement
>     else (select t.oid from pg_type t where t.typarray =
> proargtypes[array_length(proargtypes, 1)-1]) end
>     != provariadic;
>

Added in 0001 patch.

> The simple fix is change provariadic to int4 and call it good.  It's
> tempting to go the other way and actually make it
> satisfies_hash_partition(int4, int4, variadic "any"), passing the
> column values directly and letting satisfies_hash_partition doing the
> hashing itself.  Any arguments that had a partition key type different
> from the column type would have a RelabelType node placed on top of
> the column, so that get_fn_expr_argtype would return the partition key
> type.  Then, the function could look up the hash function for that
> type and call it directly on the value.  That way, we'd be doing only
> one function call instead of many, and the partition constraint would
> look nicer in \d+ output, too.  :-)  On the other hand, that would
> also mean that we'd have to look up the extended hash function every
> time through this function, though maybe that could be prevented by
> using fn_extra to cache FmgrInfos for all the hash functions on the
> first time through.  I'm not sure how that would compare in terms of
> speed with what you have now, but maybe it's worth trying.
>

One advantage of current implementation is that we can see which hash
function are used for the each partitioning column and also we don't need to
worry about user specified opclass and different input types.

Something similar I've tried in my initial patch version[1], but I have missed
user specified opclass handling for each partitioning column.  Do you want me
to handle opclass using RelabelType node? I am afraid that, that would make
the \d+ output more horrible than the current one if non-default opclass used.

> The second paragraph of the CREATE TABLE documentation for PARTITION
> OF needs to be updated like this: "The form with <literal>IN</literal>
> is used for list partitioning, the form with <literal>FROM</literal>
> and <literal>TO</literal> is used for range partitioning, and the form
> with <literal>WITH</literal> is used for hash partitioning."
>

Fixed in the attached version(0004).

> The CREATE TABLE documentation says "When using range partitioning,
> the partition key can include multiple columns or expressions (up to
> 32,"; this should be changed to say "When using range or hash
> partitioning".
>

Fixed in the attached version(0004).

> -      expression.  If no B-tree operator class is specified when creating a
> -      partitioned table, the default B-tree operator class for the
> datatype will
> -      be used.  If there is none, an error will be reported.
> +      expression.  If no operator class is specified when creating a
> partitioned
> +      table, the default operator class of the appropriate type (btree for list
> +      and range partitioning, hash for hash partitioning) will be used.  If
> +      there is none, an error will be reported.
> +     </para>
> +
> +     <para>
> +      Since hash operator class provides only equality, not ordering, collation
> +      is not relevant for hash partitioning. The behaviour will be unaffected
> +      if a collation is specified.
> +     </para>
> +
> +     <para>
> +      Hash partitioning will use support function 2 routines from the operator
> +      class. If there is none, an error will be reported.  See <xref
> +      linkend="xindex-support"> for details of operator class support
> +      functions.
>
> I think we should rework this a little more heavily.  I suggest the
> following, starting after "a single column or expression":
>
> <para>
> Range and list partitioning require a btree operator class, while hash
> partitioning requires a hash operator class.  If no operator class is
> specified explicitly, the default operator class of the appropriate
> type will be used; if no default operator class exists, an error will
> be raised.  When hash partitioning is used, the operator class used
> must implement support function 2 (see <xref linkend="xindex-support">
> for details).
> </para>
>

Thanks again, added in the attached version(0004).

> I think we can leave out the part about collations.  It's possibly
> worth a longer explanation here at some point: for range partitioning,
> collation can affect which rows go into which partitions; for list
> partitioning, it can't, but it can affect the order in which
> partitions are expanded (which is a can of worms I'm not quite ready
> to try to explain in user-facing documentation); for hash
> partitioning, it makes no difference at all.  Although at some point
> we may want to document this, I think it's a job for a separate patch,
> since (1) the existing documentation doesn't document the precise
> import of collations on existing partitioning types and (2) I'm not
> sure that CREATE TABLE is really the best place to explain this.
>

Okay.

> The example commands for creating a hash-partitioned table are missing
> spaces between WITH and the parenthesis which follows.
>

Fixed in the attached version(0004).

> In 0003, the changes to partition_bounds_copy claim that I shouldn't
> worry about the fact that typlen is set to 4 because datumCopy won't
> use it for a pass-by-value datatype, but I think that calling
> functions with incorrect arguments and hoping that they ignore them
> and therefore nothing bad happens doesn't sound like a very good idea.
> Fortunately, I think the actual code is fine; I think we just need to
> change the comments.  For hash partitioning, the datums array always
> contains two integers, which are of type int4, which is indeed a
> pass-by-value type of length 4 (note that if we were using int8 for
> the modulus and remainder, we'd need to set byval to FLOAT8PASSBYVAL).
> I would just write this as:
>
> if (hash_part)
> {
>     typlen = sizeof(int32); /* always int4 */
>     byval = true;           /* int4 is pass-by-value */
> }
>

Fixed in the attached version (now patch number is 0005).

> +       for (i = 0; i < nkeys; i++)
> +       {
> +               if (!isnull[i])
> +                       rowHash = hash_combine64(rowHash,
> DatumGetUInt64(hash_array[i]));
> +       }
>
> Excess braces.
>

Fixed in the attached version(0004).

> I think it might be better to inline the logic in mix_hash_value()
> into each of the two callers.  Then, the callers wouldn't need Datum
> hash_array[PARTITION_MAX_KEYS]; they could just fold each new hash
> value into a uint64 value.  That seems likely to be slightly faster
> and I don't see any real downside.
>

Fixed in the attached version(0004).

> rhaas=# create table natch (a citext, b text) partition by hash (a);
> ERROR:  XX000: missing support function 2(16398,16398) in opfamily 16437
> LOCATION:  RelationBuildPartitionKey, relcache.c:954
>
> It shouldn't be possible to reach an elog() from SQL, and this is not
> a friendly error message.
>

How about an error message in the attached patch(0004)?


1] https://postgr.es/m/CAAJ_b96AQBAxSQ2mxnTmx9zXh79GdP_dQWv0aupjcmz+jpiGjw@mail.gmail.com

Regards,
Amul

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Tue, Oct 31, 2017 at 10:17 AM, amul sul <sulamul@gmail.com> wrote:
> On Tue, Oct 31, 2017 at 9:54 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Oct 30, 2017 at 5:52 PM, amul sul <sulamul@gmail.com> wrote:
>>> Actually, int4[] is also inappropriate type as we have started using a 64bit
>>> hash function.  We need something int8[] which is not available, so that I
>>> have used ANYARRAYOID in the attached patch(0004).
>>
>> I don't know why you think int8[] is not available.
>>
>> rhaas=# select 'int8[]'::regtype;
>>  regtype
>> ----------
>>  bigint[]
>> (1 row)
>>
>
> I missed _int8, was searching for INT8ARRAYOID in pg_type.h, my bad.
>

Fixed in the 0003 patch.

>>>>[....]
>>> Something similar I've tried in my initial patch version[1], but I have missed
>>> user specified opclass handling for each partitioning column.  Do you want me
>>> to handle opclass using RelabelType node? I am afraid that, that would make
>>> the \d+ output more horrible than the current one if non-default opclass used.
>>
>> Maybe we should just pass the OID of the partition (or both the
>> partition and the parent, so we can get the lock ordering right?)
>> instead.
>>
> Okay, will try this.
>

In 0005, I rewrote satisfies_hash_partition, to accept parent id, modulus and
remainder as before, and the column values directly. This function opens parent
relation to get its PartitionKey which has extended hash function information in
a partsupfunc array, using this it will calculates a hash for the partition key.
Also, it will copy this partsupfunc array into function memory context so that
we don't need to open parent relation again and again in the subsequent function
call to get extended hash functions information (e.g. bulk insert).

In \d+ partition constraint will be :
satisfies_hash_partition('16384'::oid, 2, 0, a, b)
where 16384 is parent relid, 2 is modulus, 0 is remainder and 'a' &
'b' are partition
column.

In the earlier version partition constraint was (i.e. without 0005 patch):
satisfies_hash_partition(2, 0,
hashint4extended(a,'8816678312871386365'::bigint),
                         hashtextextended(b, '8816678312871386365'::bigint))


I did small performance test using a copy command to load 100,000,000 records
and a separate insert command for each record to load 2,00,000 records and
result are as follow:

+---------+-----------------+--------------------+
| Command | With 0005 patch | Without 0005 patch |
+---------+-----------------+--------------------+
| COPY    | 63.719 seconds  | 64.925 seconds     |
+---------+-----------------+--------------------+
| INSERT  | 179.21 seconds  | 174.89 seconds     |
+---------+-----------------+--------------------+

Although partition constraints become more simple, there isn't any performance
gain with 0005 patch. Also I am little skeptic about logic in 0005 where we
copied extended hash function info from the partition key, what if parent is
changed while we are using it? Do we need to keep lock on parent until commit in
satisfies_hash_partition?

Regards,
Amul

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Wed, Nov 1, 2017 at 6:16 AM, amul sul <sulamul@gmail.com> wrote:
> Fixed in the 0003 patch.

I have committed this patch set with the attached adjustments.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

hash-adjustments.patch

Re: [HACKERS] [POC] hash partitioning

From

amul sul

Date:

10 November 2017, 08:38:05

On Fri, Nov 10, 2017 at 4:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Nov 1, 2017 at 6:16 AM, amul sul <sulamul@gmail.com> wrote:
>> Fixed in the 0003 patch.
>
> I have committed this patch set with the attached adjustments.
>

Thanks a lot for your support & a ton of thanks to all reviewer.

Regards,
Amul


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers