Re: [HACKERS] [POC] hash partitioning - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: [HACKERS] [POC] hash partitioning
Date
Msg-id CAMp0ubeo3fzzEfiE1vmc1AJkkRPxLnZQoOASeu6cCcco-c+9zw@mail.gmail.com
Whole thread Raw
In response to [HACKERS] [POC] hash partitioning  (Yugo Nagata <nagata@sraoss.co.jp>)
Responses Re: [HACKERS] [POC] hash partitioning
List pgsql-hackers
On Tue, Feb 28, 2017 at 6:33 AM, Yugo Nagata <nagata@sraoss.co.jp> wrote:
> In this patch, user can specify a hash function USING. However,
> we migth need default hash functions which are useful and
> proper for hash partitioning.

I suggest that we consider the hash functions more carefully. This is
(effectively) an on-disk format so it can't be changed easily later.

1. Consider a partition-wise join of two hash-partitioned tables. If
that's a hash join, and we just use the hash opclass, we immediately
lose some useful bits of the hash function. Same for hash aggregation
where the grouping key is the partition key.

To fix this, I think we need to include a salt in the hash API. Each
level of hashing can choose a random salt.

2. Consider a partition-wise join where the join keys are varchar(10)
and char(10). We can't do that join if we just use the existing hash
strategy, because 'foo' = 'foo       ' should match, but those values
have different hashes when using the standard hash opclass.

To fix this, we need to be smarter about normalizing values at a
logical level before hashing. We can take this to varying degrees,
perhaps even normalizing an integer to a numeric before hashing so
that you can do a cross-type join on int=numeric.

Furthermore, we need catalog metadata to indicate which hash functions
are suitable for which cross-type comparisons. Or, to put it the other
way, which typecasts preserve the partitioning.

3. We might want to use a hash function that is a little slower that
is more resistant to collisions. We may even want to use a 64-bit
hash.


My opinion is that we should work on this hashing infrastructure
first, and then support the DDL. If we get the hash functions right,
that frees us up to create better plans, with better push-downs, which
will be good for parallel query.

Regards,    Jeff Davis



pgsql-hackers by date:

Previous
From: Kevin Grittner
Date:
Subject: Re: [HACKERS] Transition tables for triggers on foreign tables and views
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] Bug in prepared statement cache invalidation?