Re: [HACKERS] path toward faster partition pruning - Mailing list pgsql-hackers

From David Rowley
Subject Re: [HACKERS] path toward faster partition pruning
Date
Msg-id CAKJS1f-4a7+BMGrmv+jzNVFOZ2HhOJmSNx-jOeKX-_bK82hi9g@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] path toward faster partition pruning  (Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>)
Responses Re: [HACKERS] path toward faster partition pruning
List pgsql-hackers
On 19 February 2018 at 22:19, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Attached updated patches.  Thanks again!

Thanks for making those changes.  I've made another pass of v28 and
have a few more comments.

The patch is starting to look good, but there are some new changes in
recent versions which still don't look quite right.

1. This does not fully make sense:

/*
 * Remove the indexes of partitions excluded due to each of
 * those partitions' *all* of allowed datums appearing in
 * keys->ne_datums, that is compared to the partition key
 * using <> operator.
 */

 "each of those partitions' *all* of allowed" is not correct.

Maybe better to write:

/*
 * Remove the indexes of any partitions which cannot possibly
 * contain rows matching the clauses due to key->ne_datums containing
 * all datum values which are allowed in the given partition.  This
 * is only possible to do in LIST partitioning as it's the only
 * partitioning strategy which allows the specification of exact values.
 */

2. Mine does not, but some compilers may complain about
get_partitions_for_keys() result variable being uninitalised in
get_partitions_for_keys. Probably the easiest fix would be to just
assign to NULL in the default case.

3. Did you mean to put this Assert() inside the loop?

memset(keyisnull, false, sizeof(keyisnull));
i = -1;
while ((i = bms_next_member(keys->keyisnull, i)) >= 0)
{
keys->n_eqkeys++;
keyisnull[i] = true;
}
Assert(i < partnatts);

i will always be -2 at the end of the loop. Seems like a useless
Assert in its current location.

4. Can you add a comment here to say: "Note: LIST partitioning only
supports a single partition key, therefore this function requires no
looping over the partition keys"

/*
 * get_partitions_for_keys_list
 * Return partitions of a list partitioned table for requested keys
 *
 * This interprets the keys and looks up partitions in the partition bound
 * descriptor using the list partitioning semantics.
 */

5. The following comment contains a bit of duplication to the comment
which comes after it. Maybe the following:

/*
* If the query is looking for null keys, there can only be one such
* partition.  Return the same if one exists.
*/

can be changed to:

/* Handle clauses requesting a NULL valued partition key */

6. The following comment does not quite make sense:

/* Exactly matching datum exists. */

Probably better to write:

/* An exact matching datum exists. */

7. "If it's not equal (<)" I think you mean (>), not (<), in:

 * The bound at minoff is <= minkeys, given the way
 * partition_list_bsearch() works.  If it's not equal (<), then
 * increment minoff to make it point to the datum on the right
 * that necessarily satisfies minkeys.  Also do the same if it is
 * equal but minkeys is exclusive.

However, the comment is a bit clumsy. Maybe the following is better?

/*
 * partition_list_bsearch returning a positive number means that
 * minkeys[0] must be greater than or equal to the smallest datum.
 * If we didn't find an exact matching datum (!is_equal) or if the
 * operator used was non-inclusive (>), then in both of these
 * cases we're not interested in the datum pointed to by minoff,
 * but we may start getting matches in the partition which the
 * next datum belongs to, so point to that one instead.  (This may
 * be beyond the last datum in the array, but we'll detect that
 * later.)
 */

8. The following comment could be improved:

* minkeys is greater than the datums of all non-default partitions,
* meaning there isn't one to return.  Return the default partition if
* one exists.

how about:

* The value of minkeys[0] is greater than all of the datums we have
* partitions for.  The only possible partition that could contain a
* match is the default partition.  Return that, if it exists.

9. The following could also be improved:

* The bound at maxoff is <= maxkeys, given the way
* partition_list_bsearch works.  If the bound at maxoff exactly
* matches maxkey (is_equal), but the maxkey is exclusive, then
* decrement maxoff to point to the bound on the left.

how about:

* partition_list_bsearch returning a positive number means that
* maxkeys[0] must be greater than or equal to the smallest datum.
* If the match found is an equal match, but the operator used is
* non-inclusive of that value (<), then the partition belonging
* to maxoff cannot match, so we'll decrement maxoff to point to
* the partition belonging to the previous datum.  We might end up
* decrementing maxoff down to -1, but we'll handle that later.

10. Can you append "  This may not technically be true for some data
types (e.g. integer types), however, we currently lack any sort of
infrastructure to provide us with proofs that would allow us to do
anything smarter here." to:

* For range queries, always include the default list partition,
* because list partitions divide the key space in a discontinuous
* manner, not all values in the given range will have a partition
* assigned.

11. get_partitions_for_keys_range seems to prefer to do "minoff -= 1",
but get_partitions_for_keys_list likes to "minoff--", can this be made
the same? Personally, I like -- over -= 1 as it's shorter. Although I
do remember having an argument with my university professor about
this. He claimed -= 1 was clearer... I'm still unsure what he found so
confusing about -- ...

12. The following code could be optimised a little for the case when
there's no default:

/*
* There may exist a range of values unassigned to any non-default
* partition between the datums at minoff and maxoff.
*/
for (i = minoff; i <= maxoff; i++)
{
if (boundinfo->indexes[i] < 0)
{
include_def = true;
break;
}
}

/*
* Since partition keys with nulls are mapped to the default range
* partition, we must include the default partition if some keys
* *could* be null.
*/
if (bms_num_members(keys->keyisnotnull) < partnatts)
include_def = true;

if (include_def && partition_bound_has_default(boundinfo))
result = bms_add_member(result, boundinfo->default_index);

return result;

Maybe something more like:

if (!partition_bound_has_default(boundinfo))
    return result;

/*
* There may exist a range of values unassigned to any non-default
* partition between the datums at minoff and maxoff.
*/
for (i = minoff; i <= maxoff; i++)
{
    if (boundinfo->indexes[i] < 0)
        return bms_add_member(result, boundinfo->default_index);
}

/*
* Since partition keys with nulls are mapped to the default range
* partition, we must include the default partition if some keys
* *could* be null.
*/
if (bms_num_members(keys->keyisnotnull) < partnatts)
    return bms_add_member(result, boundinfo->default_index);

Which is saves a bit of needless work when there's no default to add,
and also saves a few lines, including the line where you declare the
include_def variable.

13. Variable name:

Bitmapset  *partitioned_rels_bms = NULL;

This should likely be called partitioned_relids, and be of type Relids
instead of Bitmapset.

14. This line removal seems surplus. It should probably be fixed
independently of this patch.

  parent_roots[appinfo->child_relid] = subroot;
-
  continue;

15. I'm unsure how safe the following code is:

while ((parent_rti = bms_first_member(partitioned_rels_bms)) >= 0)
    partitioned_rels = lappend_int(partitioned_rels, parent_rti);

You're now putting this list in ascending order of relid, but some
code in set_plan_refs assumes the root partition is the first element:

root->glob->rootResultRelations =
lappend_int(root->glob->rootResultRelations,
linitial_int(splan->partitioned_rels));

By luck, the first element might today be the root due to the way we
expand the inheritance hierarchy, but I think this code is wrong to
rely on that.

I'm not really a fan of having the root partition be the first element
in the List. I would much rather see a Relids type and a special Index
field for the root, but that might be more changes that you'd like to
make here. I just don't think what you have now is correct.

16. This should probably return Relids rather than Bitmapset *.

Bitmapset *
prune_append_rel_partitions(PlannerInfo *root, RelOptInfo *rel)

Please update the mention of Bitmapset in the comment at the top of
the function too.

17. hmm, my patch did palloc(), not palloc0(). My original patch was
broken and missed this, but v2 got it.

context->clauseinfo = partclauseinfo =
palloc0(sizeof(PartitionClauseInfo));

There's no need to palloc0() here. You're setting all the fields to
zero just below. As far as I understand it, it's only Node types that
we have to go through the rigmarole of doing both.

18. I tentatively agree with you having changed the continue to break
in the following:

/* We can't use any volatile value to prune partitions. */
if (contain_volatile_functions((Node *) valueexpr))
break;

I believe it's not wrong to break here, but keep in mind you're
testing valueexpr rather than something with the OpExpr itself. The
reason I'm not saying this is wrong is that if the valueexpr is
volatile then it cannot possibly match another partition key anyway,
so there's likely no point in continuing to look for another match...
You should likely write a comment to explain this a bit. I think all
of the other places you've changed to break look fine. The
ScalarArrayOpExpr volatile function test is fine to break from since
the operands cannot be reversed in that case, so rightop certainly
can't match a partition key.

19. I might have caused this, but there's no such variable as 'cur'

/* cur is more restrictive, so replace the existing. */

20. Is there a difference between
partition_bound_has_default(context->boundinfo) and
context->has_default_part? Any reason for both? Your code uses both. I
don't yet understand why has_default_part was added.

That's all I have for now.

It's getting close. Good work!

I keep having to turn up my strictness level each time I review.  I'd
like to be able to turn it up all the way earlier, but I fear I may
still be reviewing v1 if I'd done that :-)

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


pgsql-hackers by date:

Previous
From: "Daniel Verite"
Date:
Subject: Re: NEXT VALUE FOR sequence
Next
From: Peter Eisentraut
Date:
Subject: SHA-2 functions