Thread: [HACKERS] Hash Functions

[HACKERS] Hash Functions

From

Jeff Davis

Date:

12 May 2017, 07:08:29

https://www.postgresql.org/message-id/CAMp0ubeo3fzzEfiE1vmc1AJkkRPxLnZQoOASeu6cCcco-c+9zw@mail.gmail.com

In that thread, I pointed out some important considerations for the
hash functions themselves. This is a follow-up, after I looked more
carefully.

1. The hash functions as they exist today aren't portable -- they can
return different results on different machines. That means using these
functions for hash partitioning would yield different contents for the
same partition on different architectures (and that's bad, considering
they are logical partitions and not some internal detail).

The core hashing algorithm is based on 32-bit chunks, so it's only
portable if the input is an int32 (or array of int32s). That's not
true for varchar (etc.) or numeric (which is based on an array of
int16s). There is a hack for int8, but see issue #2 below.

We could try to mark portable vs. unportable hash functions, but it
seems quite valuable to be able to logically partition on varchar, so
I think we should have some portable answer there. Another annoyance
is that we would have to assume container types are unportable, or
make them inherit the portability of the underlying type's hash
function.

We could revert 26043592 (copied Tom because that was his patch) to
make hash_any() go back to being portable -- do we know what that
speedup actually was? Maybe the benefit is smaller on newer
processors? Another option is to try to do some combination of
byteswapping and word-at-a-time, which might be better than
byte-at-a-time if the byteswapping is done with a native instruction.

2. hashint8() is a little dubious. If the input is positive, it XORs
the high bits; if negative, it XORs the complement of the high bits.
That works for compatibility with hashint2/4, but it does not cause
good hash properties[1]. I prefer that we either (a) upcast int2/4 to
int8 before hashing and then hash the whole 64 bits; or (b) if the
value is within range, downcast to int4, otherwise hash the whole 64
bits.

3. We should downcast numeric to int4/8 before hashing if it's in
range, so that it can be a part of the same opfamily as int2/4/8.

4. text/varchar should use hashbpchar() so that they can be part of
the same opfamily. Trailing blanks seem unlikely to be significant for
most real-world data anyway.

5. For salts[2], I don't think it's too hard to support them in an
optional way. We just allow the function to be a two-argument function
with a default. Places that care about specifying the salt may do so
if the function has pronargs==2, otherwise it just gets the default
value. If we have salts, I don't think having 64-bit hashes is very
important. If we run out of bits, we can just salt the hash function
differently and get more hash bits. This is not urgent and I believe
we should just implement salts when and if some algorithm needs them.

Regards,   Jeff Davis

[1] You can a kind of mirroring in the hash outputs indicating bad mixing:

postgres=# select hashint8((2^32-100)::int8);hashint8
----------41320869
(1 row)

postgres=# select hashint8((-2^32-100)::int8);hashint8
----------41320869
(1 row)

postgres=# select hashint8((2^32-101)::int8); hashint8
--------------1214482326
(1 row)

postgres=# select hashint8((-2^32-101)::int8); hashint8
--------------1214482326
(1 row)


[2] Not sure I'm using the term "salt" properly. I really just mean a
way to affect the initial state, which I think is good enough for our
purposes.

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

12 May 2017, 20:05:56

On Fri, May 12, 2017 at 12:08 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> 1. The hash functions as they exist today aren't portable -- they can
> return different results on different machines. That means using these
> functions for hash partitioning would yield different contents for the
> same partition on different architectures (and that's bad, considering
> they are logical partitions and not some internal detail).

Hmm, yeah, that is bad.

> We could revert 26043592 (copied Tom because that was his patch) to
> make hash_any() go back to being portable -- do we know what that
> speedup actually was? Maybe the benefit is smaller on newer
> processors? Another option is to try to do some combination of
> byteswapping and word-at-a-time, which might be better than
> byte-at-a-time if the byteswapping is done with a native instruction.

With regard to portability, I find that in 2009, according to Tom, we
had "already agreed" that it was dispensible:

http://postgr.es/m/23832.1234214526@sss.pgh.pa.us

I was not able to find where that was agreed.  On performance, I found this:

https://www.postgresql.org/message-id/20081104202655.GP18362@it.is.rice.edu

It says at the end: "The average time to reindex the table using our
current hash_any() without the separate mix()/final() was 1696ms and
1482ms with the separate mix()/final() stages giving almost 13% better
performance for this stupid metric."

> 5. For salts[2], I don't think it's too hard to support them in an
> optional way. We just allow the function to be a two-argument function
> with a default. Places that care about specifying the salt may do so
> if the function has pronargs==2, otherwise it just gets the default
> value. If we have salts, I don't think having 64-bit hashes is very
> important. If we run out of bits, we can just salt the hash function
> differently and get more hash bits. This is not urgent and I believe
> we should just implement salts when and if some algorithm needs them.

The potential problem with that is that the extra argument might slow
down the hash functions enough to matter.  Argument unpacking is not
free, and the hashing logic itself will get more complex.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

Andres Freund

Date:

12 May 2017, 20:12:34


On May 12, 2017 10:05:56 AM PDT, Robert Haas <robertmhaas@gmail.com> wrote:
>On Fri, May 12, 2017 at 12:08 AM, Jeff Davis <pgsql@j-davis.com> wrote:
>> 1. The hash functions as they exist today aren't portable -- they can
>> return different results on different machines. That means using
>these
>> functions for hash partitioning would yield different contents for
>the
>> same partition on different architectures (and that's bad,
>considering
>> they are logical partitions and not some internal detail).
>
>Hmm, yeah, that is bad.

Given that a lot of data types have a architecture dependent representation, it seems somewhat unrealistic and
expensiveto have a hard rule to keep them architecture agnostic.   And if that's not guaranteed, then I'm doubtful it
makessense as a soft rule either. 

Andres

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

12 May 2017, 20:17:27

On Fri, May 12, 2017 at 1:12 PM, Andres Freund <andres@anarazel.de> wrote:
> Given that a lot of data types have a architecture dependent representation, it seems somewhat unrealistic and
expensiveto have a hard rule to keep them architecture agnostic.   And if that's not guaranteed, then I'm doubtful it
makessense as a soft rule either. 

That's a good point, but the flip side is that, if we don't have such
a rule, a pg_dump of a hash-partitioned table on one architecture
might fail to restore on another architecture.  Today, I believe that,
while the actual database cluster is architecture-dependent, a pg_dump
is architecture-independent.  Is it OK to lose that property?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

Tom Lane

Date:

12 May 2017, 20:34:52

Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, May 12, 2017 at 1:12 PM, Andres Freund <andres@anarazel.de> wrote:
>> Given that a lot of data types have a architecture dependent representation, it seems somewhat unrealistic and
expensiveto have a hard rule to keep them architecture agnostic.   And if that's not guaranteed, then I'm doubtful it
makessense as a soft rule either. 

> That's a good point, but the flip side is that, if we don't have such
> a rule, a pg_dump of a hash-partitioned table on one architecture
> might fail to restore on another architecture.  Today, I believe that,
> while the actual database cluster is architecture-dependent, a pg_dump
> is architecture-independent.  Is it OK to lose that property?

I'd vote that it's not, which means that this whole approach to hash
partitioning is unworkable.  I agree with Andres that demanding hash
functions produce architecture-independent values will not fly.

Maintaining such a property for float8 (and the types that depend on it)
might be possible if you believe that nobody ever uses anything but IEEE
floats, but we've never allowed that as a hard assumption before.

Even architecture dependence isn't the whole scope of the problem.
Consider for example dumping a LATIN1-encoded database and trying
to reload it into a UTF8-encoded database.  People will certainly
expect that to be possible, and do you want to guarantee that the
hash of a text value is encoding-independent?
        regards, tom lane

Re: [HACKERS] Hash Functions

From

Joe Conway

Date:

12 May 2017, 20:35:45

On 05/12/2017 10:17 AM, Robert Haas wrote:
> On Fri, May 12, 2017 at 1:12 PM, Andres Freund wrote:
>> Given that a lot of data types have a architecture dependent
>> representation, it seems somewhat unrealistic and expensive to have
>> a hard rule to keep them architecture agnostic.   And if that's not
>> guaranteed, then I'm doubtful it makes sense as a soft rule
>> either.
>
> That's a good point, but the flip side is that, if we don't have
> such a rule, a pg_dump of a hash-partitioned table on one
> architecture might fail to restore on another architecture.  Today, I
> believe that, while the actual database cluster is
> architecture-dependent, a pg_dump is architecture-independent.  Is it
> OK to lose that property?


Not from where I sit.

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

12 May 2017, 21:23:14

On Fri, May 12, 2017 at 1:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'd vote that it's not, which means that this whole approach to hash
> partitioning is unworkable.  I agree with Andres that demanding hash
> functions produce architecture-independent values will not fly.

If we can't produce architecture-independent hash values, then what's
the other option?

One alternative would be to change the way that we dump and restore
the data.  Instead of dumping the data with the individual partitions,
dump it all out for the parent and let tuple routing sort it out at
restore time.  Of course, this isn't very satisfying because now
dump-and-restore hasn't really preserved the state of the database;
indeed, you could make it into a hard failure by creating a foreign
key referencing a partition hash partition.  After dump-and-restore,
the row ends up in some other partition and the foreign key can't be
recreated because the relationship no longer holds.  This isn't
limited to foreign keys, either; similar problems could be created
with CHECK constraints or other per-table properties that can vary
between one child and another.

I basically think it's pretty futile to suppose that we can get away
with having a dump and restore move rows around between partitions
without having that blow up in some cases.

> Maintaining such a property for float8 (and the types that depend on it)
> might be possible if you believe that nobody ever uses anything but IEEE
> floats, but we've never allowed that as a hard assumption before.

I don't know how standard that is.  Is there any hardware that
anyone's likely to be using that doesn't?  TBH, I don't really care if
support for obscure, nearly-dead platforms like VAX or whatever don't
quite work with hash-partitioned tables.  In practice, PostgreSQL only
sorta works on that kind of platform anyway; there are far bigger
problems than this.  On the other hand, if there are servers being
shipped in 2017 that don't use IEEE floats, that's another problem.

What about integers?  I think we're already assuming two's-complement
arithmetic, which I think means that the only problem with making the
hash values portable for integers is big-endian vs. little-endian.
That's sounds solveable-ish.

> Even architecture dependence isn't the whole scope of the problem.
> Consider for example dumping a LATIN1-encoded database and trying
> to reload it into a UTF8-encoded database.  People will certainly
> expect that to be possible, and do you want to guarantee that the
> hash of a text value is encoding-independent?

No, I think that's expecting too much.  I'd be just fine telling
people that if you hash-partition on a text column, it may not load
into a database with another encoding.  If you care about that, don't
use hash-partitioning, or don't change the encoding, or dump out the
partitions one by one and reload all the roads into the parent table
for re-routing, solving whatever problems come up along the way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

Tom Lane

Date:

12 May 2017, 21:45:47

Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, May 12, 2017 at 1:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'd vote that it's not, which means that this whole approach to hash
>> partitioning is unworkable.  I agree with Andres that demanding hash
>> functions produce architecture-independent values will not fly.

> If we can't produce architecture-independent hash values, then what's
> the other option?

Forget hash partitioning.  There's no law saying that that's a good
idea and we have to have it.  With a different set of constraints,
maybe we could do it, but I think the existing design decisions have
basically locked it out --- and I doubt that hash partitioning is so
valuable that we should jettison other desirable properties to get it.

> One alternative would be to change the way that we dump and restore
> the data.  Instead of dumping the data with the individual partitions,
> dump it all out for the parent and let tuple routing sort it out at
> restore time.  Of course, this isn't very satisfying because now
> dump-and-restore hasn't really preserved the state of the database;
> indeed, you could make it into a hard failure by creating a foreign
> key referencing a partition hash partition.

Yeah, that isn't really appetizing at all.  If we were doing physical
partitioning below the user-visible level, we could make it fly.
But the existing design makes the partition boundaries user-visible
which means we have to insist that the partitioning rule is immutable
(in the strongest sense of being invariant across all installations
you could possibly wish to transport data between).

Now, we already have some issues of that sort with range partitioning;
if say you try to transport data to another system with a different
sort ordering, you have probably got problems.  But that issue already
existed with the old manual partitioning approach, ie if you have a
CHECK constraint like "(col >= 'foo' && col < 'gob')" then a transfer
to such a system could fail already.  So I'm okay with that.  But it
seems like hash partitioning is going to introduce new, exciting, and
hard-to-document-or-avoid portability gotchas.  Do we really need
to go there?
        regards, tom lane

Re: [HACKERS] Hash Functions

From

Kenneth Marshall

Date:

12 May 2017, 22:00:19

On Fri, May 12, 2017 at 02:23:14PM -0400, Robert Haas wrote:
> 
> What about integers?  I think we're already assuming two's-complement
> arithmetic, which I think means that the only problem with making the
> hash values portable for integers is big-endian vs. little-endian.
> That's sounds solveable-ish.
> 

xxhash produces identical hashes independent for big-endian and little-
endian.

https://github.com/Cyan4973/xxHash

Regards,
Ken

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

12 May 2017, 22:38:02

On Fri, May 12, 2017 at 2:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah, that isn't really appetizing at all.  If we were doing physical
> partitioning below the user-visible level, we could make it fly.
> But the existing design makes the partition boundaries user-visible
> which means we have to insist that the partitioning rule is immutable
> (in the strongest sense of being invariant across all installations
> you could possibly wish to transport data between).

I think you're right.

> Now, we already have some issues of that sort with range partitioning;
> if say you try to transport data to another system with a different
> sort ordering, you have probably got problems.  But that issue already
> existed with the old manual partitioning approach, ie if you have a
> CHECK constraint like "(col >= 'foo' && col < 'gob')" then a transfer
> to such a system could fail already.  So I'm okay with that.  But it
> seems like hash partitioning is going to introduce new, exciting, and
> hard-to-document-or-avoid portability gotchas.  Do we really need
> to go there?

That is a good question.  I think it basically amounts to this
question: is hash partitioning useful, and if so, for what?

Range and list partitioning inherently provide management benefits.
For example, if you range-partition your data by year, then when you
want to get rid of the oldest year worth of data, you can drop the
entire partition at once, which is likely to be much faster than a
DELETE statement.  But hash partitioning provide no such benefits
because, by design, the distribution of the data among the partitions
is essentially random.  Dropping one partition will not usually be a
useful thing to do because the rows in that partition are logically
unconnected.  So, if you have a natural way of partitioning a table by
range or list, that's probably going to be superior to hash
partitioning, but sometimes there isn't an obviously useful way of
breaking up the data, but you still want to partition it in order to
reduce the size of the individual tables.

That, in turn, allows maintenance operations to be performed each
partition separately.  For example, suppose your table is big enough
that running CLUSTER or VACUUM FULL on it takes eight hours, but you
can't really afford an eight-hour maintenance window when the table
gets bloated.  However, you can afford (on Sunday nights when activity
is lowest) a window of, say, one hour.  Well, if you hash-partition
the table, you can CLUSTER or VACUUM FULL one partition a week for N
weeks until you get to them all.  Similarly, if you need to create an
index, you can build it on one partition at a time.  You can even add
the index to one partition to see how well it works -- for example,
does it turn too many HOT updates into non-HOT updates, causing bloat?
-- and try it out before you go do it across the board.  And an
unpartitioned table can only accommodate one VACUUM process at a time,
but a partitioned table can have one per partition.  Partitioning a
table also means that you don't need a single disk volume large enough
to hold the whole thing; instead, you can put each partition in a
separate tablespace or (in some exciting future world where PostgreSQL
looks more like a distributed system) perhaps even on another server.

For a table that is a few tens of gigabytes, none of this amounts to a
hill of beans, but for a table that is a few tens of terabytes, the
time it takes to perform administrative operations can become a really
big problem.  A good example is, say, a table full of orders.  Imagine
a high-velocity business like Amazon or UPS that has a constant stream
of new orders, or a mobile app that has a constant stream of new user
profiles.  If that data grows fast enough that the table needs to be
partitioned, how should it be done?  It's often desirable to create
partitions of about equal size and about equal hotness, and
range-partitioning on the creation date or order ID number means
continually creating new partitions, and having all of the hot data in
the same partition.

In my experience, it is *definitely* the case that users with very
large tables are experiencing significant pain right now.  The freeze
map changes were a good step towards ameliorating some of that pain,
but I think hash partitioning is another step in that direction, and I
think there will be other applications as well.  Even for users who
don't have data that large, there are also interesting
query-performance implications of hash partitioning.  Joins between
very large tables tend to get implemented as merge joins, which often
means sorting all the data, which is O(n lg n) and not easy to
parallelize.  But if those very large tables happened to be compatibly
partitioned on the join key, you could do a partitionwise join per the
patch Ashutosh proposed, which would be cheaper and easier to do in
parallel.

Maybe a shorter argument for hash partitioning is that not one but two
different people proposed patches for it within months of the initial
partitioning patch going in.  When multiple people are thinking about
implementing the same feature almost immediately after the
prerequisite patches land, that's a good clue that it's a desirable
feature.  So I think we should try to solve the problems, rather than
giving up.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

Peter Eisentraut

Date:

12 May 2017, 23:12:22

On 5/12/17 14:23, Robert Haas wrote:
> One alternative would be to change the way that we dump and restore
> the data.  Instead of dumping the data with the individual partitions,
> dump it all out for the parent and let tuple routing sort it out at
> restore time.

I think this could be a pg_dump option.  One way it dumps out the
partitions, and another way it recomputes the partitions.  I think that
could be well within pg_dump's mandate.

(cough ... logical replication ... cough)

> Of course, this isn't very satisfying because now
> dump-and-restore hasn't really preserved the state of the database;

That depends no whether you consider the partitions to be a user-visible
or an internal detail.  The current approach is sort of in the middle,
so it makes sense to allow the user to interpret it either way depending
on need.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Hash Functions

From

Alvaro Herrera

Date:

13 May 2017, 01:13:52

Peter Eisentraut wrote:
> On 5/12/17 14:23, Robert Haas wrote:
> > One alternative would be to change the way that we dump and restore
> > the data.  Instead of dumping the data with the individual partitions,
> > dump it all out for the parent and let tuple routing sort it out at
> > restore time.
> 
> I think this could be a pg_dump option.  One way it dumps out the
> partitions, and another way it recomputes the partitions.  I think that
> could be well within pg_dump's mandate.

I was thinking the same, but enable that option automatically for hash
partitioning.  Each partition is still dumped separately, but instead of
referencing the specific partition in the TABLE DATA object, reference
the parent table.  This way, the restore can still be parallelized, but
tuple routing is executed for each tuple.

> (cough ... logical replication ... cough)

I think for logical replication the tuple should appear as being in the
parent table, not the partition.  No?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Hash Functions

From

Peter Eisentraut

Date:

13 May 2017, 01:38:55

On 5/12/17 18:13, Alvaro Herrera wrote:
> I think for logical replication the tuple should appear as being in the
> parent table, not the partition.  No?

Logical replication replicates base table to base table.  How those
tables are tied together into a partitioned table or an inheritance tree
is up to the system catalogs on each side.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Hash Functions

From

David Fetter

Date:

13 May 2017, 02:36:26

On Fri, May 12, 2017 at 06:38:55PM -0400, Peter Eisentraut wrote:
> On 5/12/17 18:13, Alvaro Herrera wrote:
> > I think for logical replication the tuple should appear as being in the
> > parent table, not the partition.  No?
> 
> Logical replication replicates base table to base table.  How those
> tables are tied together into a partitioned table or an inheritance tree
> is up to the system catalogs on each side.

This seems like a totally reasonable approach to pg_dump, especially
in light of the fact that logical replication already (and quite
reasonably) does it this way.  Hard work has been done to make
tuple-routing cheap, and this is one of the payoffs.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

13 May 2017, 04:56:30

On Fri, May 12, 2017 at 7:36 PM, David Fetter <david@fetter.org> wrote:
> On Fri, May 12, 2017 at 06:38:55PM -0400, Peter Eisentraut wrote:
>> On 5/12/17 18:13, Alvaro Herrera wrote:
>> > I think for logical replication the tuple should appear as being in the
>> > parent table, not the partition.  No?
>>
>> Logical replication replicates base table to base table.  How those
>> tables are tied together into a partitioned table or an inheritance tree
>> is up to the system catalogs on each side.
>
> This seems like a totally reasonable approach to pg_dump, especially
> in light of the fact that logical replication already (and quite
> reasonably) does it this way.  Hard work has been done to make
> tuple-routing cheap, and this is one of the payoffs.

Cheap isn't free, though.  It's got a double-digit percentage overhead
rather than a large-multiple-of-the-runtime overhead as triggers do,
but people still won't want to pay it unnecessarily, I think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

Andres Freund

Date:

13 May 2017, 06:06:46

On 2017-05-12 21:56:30 -0400, Robert Haas wrote:
> Cheap isn't free, though.  It's got a double-digit percentage overhead
> rather than a large-multiple-of-the-runtime overhead as triggers do,
> but people still won't want to pay it unnecessarily, I think.

That should be partiall addressable with reasonable amounts of
engineering though.  Efficiently computing the target partition in a
hash-partitioned table can be implemented very efficiently, and adding
infrastructure for multiple bulk insert targets in copy should be quite
doable as well. It's also work that's generally useful, not just for
backups.

The bigger issue to me here wrt pg_dump is that partitions can restored
in parallel, but that'd probably not work as well if dumped
separately. Unless we'd do the re-routing on load, rather than when
dumping - which'd also increase cache locality, by most of the time
(same architecture/encoding/etc) having one backend insert into the same
partition.

Greetings,

Andres Freund

Re: [HACKERS] Hash Functions

From

Amit Kapila

Date:

13 May 2017, 07:52:10

On Sat, May 13, 2017 at 1:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, May 12, 2017 at 2:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Maybe a shorter argument for hash partitioning is that not one but two
> different people proposed patches for it within months of the initial
> partitioning patch going in.  When multiple people are thinking about
> implementing the same feature almost immediately after the
> prerequisite patches land, that's a good clue that it's a desirable
> feature.  So I think we should try to solve the problems, rather than
> giving up.
>

Can we think of defining separate portable hash functions which can be
used for the purpose of hash partitioning?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

13 May 2017, 17:29:09

On Sat, May 13, 2017 at 12:52 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Can we think of defining separate portable hash functions which can be
> used for the purpose of hash partitioning?

I think that would be a good idea.  I think it shouldn't even be that
hard.  By data type:

- Integers.  We'd need to make sure that we get the same results for
the same value on big-endian and little-endian hardware, and that
performance is good on both systems.  That seems doable.

- Floats.  There may be different representations in use on different
hardware, which could be a problem.  Tom didn't answer my question
about whether any even-vaguely-modern hardware is still using non-IEEE
floats, which I suspect means that the answer is "no".  If every bit
of hardware we are likely to find uses basically the same
representation of the same float value, then this shouldn't be hard.
(Also, even if this turns out to be hard for floats, using a float as
a partitioning key would be a surprising choice because the default
output representation isn't even unambiguous; you need
extra_float_digits for that.)

- Strings.  There's basically only one representation for a string.
If we assume that the hash code only needs to be portable across
hardware and not across encodings, a position for which I already
argued upthread, then I think this should be manageable.

- Everything Else.  Basically, everything else is just a composite of
that stuff, I think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

Jeff Davis

Date:

13 May 2017, 20:51:27

On Fri, May 12, 2017 at 10:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Maintaining such a property for float8 (and the types that depend on it)
> might be possible if you believe that nobody ever uses anything but IEEE
> floats, but we've never allowed that as a hard assumption before.

This is not such a big practical problem (for me at least) because
hashing of floats is of dubious value.

> Even architecture dependence isn't the whole scope of the problem.
> Consider for example dumping a LATIN1-encoded database and trying
> to reload it into a UTF8-encoded database.  People will certainly
> expect that to be possible, and do you want to guarantee that the
> hash of a text value is encoding-independent?

That is a major problem. In an ideal world, we could make that work
with something like ucol_getSortKey(), but we don't require ICU, and
we can't mix getSortKey() with strxfrm(), or even strxfrm() results
from different platforms.

I don't think it's correct to hash the code points, either, because
strings may be considered equal in a locale even if the code points
aren't identical. But I don't think postgres lives up to that standard
currently.

But hash partitioning is too valuable to give up on entirely. I think
we should consider supporting a limited subset of types for now with
something not based on the hash am.

Regards,   Jeff Davis

Re: [HACKERS] Hash Functions

From

Tom Lane

Date:

13 May 2017, 20:57:15

Robert Haas <robertmhaas@gmail.com> writes:
> On Sat, May 13, 2017 at 12:52 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Can we think of defining separate portable hash functions which can be
>> used for the purpose of hash partitioning?

> I think that would be a good idea.  I think it shouldn't even be that
> hard.  By data type:

> - Integers.  We'd need to make sure that we get the same results for
> the same value on big-endian and little-endian hardware, and that
> performance is good on both systems.  That seems doable.

> - Floats.  There may be different representations in use on different
> hardware, which could be a problem.  Tom didn't answer my question
> about whether any even-vaguely-modern hardware is still using non-IEEE
> floats, which I suspect means that the answer is "no".  If every bit
> of hardware we are likely to find uses basically the same
> representation of the same float value, then this shouldn't be hard.
> (Also, even if this turns out to be hard for floats, using a float as
> a partitioning key would be a surprising choice because the default
> output representation isn't even unambiguous; you need
> extra_float_digits for that.)

> - Strings.  There's basically only one representation for a string.
> If we assume that the hash code only needs to be portable across
> hardware and not across encodings, a position for which I already
> argued upthread, then I think this should be manageable.

Basically, this is simply saying that you're willing to ignore the
hard cases, which reduces the problem to one of documenting the
portability limitations.  You might as well not even bother with
worrying about the integer case, because porting between little-
and big-endian systems is surely far less common than cases you've
already said you're okay with blowing off.

That's not an unreasonable position to take, perhaps; doing better
than that is going to be a lot more work and it's not very clear
how much real-world benefit results.  But I can't follow the point
of worrying about endianness but not encoding.
        regards, tom lane

Re: [HACKERS] Hash Functions

From

Jeff Davis

Date:

13 May 2017, 21:14:16

On Fri, May 12, 2017 at 11:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Forget hash partitioning.  There's no law saying that that's a good
> idea and we have to have it.  With a different set of constraints,
> maybe we could do it, but I think the existing design decisions have
> basically locked it out --- and I doubt that hash partitioning is so
> valuable that we should jettison other desirable properties to get it.

A lot of the optimizations that can make use of hash partitioning
could also make use of range partitioning. But let me defend hash
partitioning:

* hash partitioning requires fewer decisions by the user
* naturally balances data and workload among partitions in most cases
* easy to match with a degree of parallelism

But with range partitioning, you can have situations where different
tables have different distributions of data. If you partition to
balance the data between partitions in both cases, then that makes
partition-wise join a lot harder because the boundaries don't line up.
If you make the boundaries line up to do partition-wise join, the
partitions might have wildly different amounts of data in them. Either
way, it makes parallelism harder.

Even without considering joins, range partitioning could force you to
make a choice between balancing the data and balancing the workload.
If you are partitioning based on date, then a lot of the workload will
be on more recent partitions. That's desirable sometimes (e.g. for
vacuum) but not always desirable for parallelism.

Hash partitioning doesn't have these issues and goes very nicely with
parallel query.

Regards,   Jeff Davis

Re: [HACKERS] Hash Functions

From

Jeff Davis

Date:

14 May 2017, 02:01:17

On Fri, May 12, 2017 at 12:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> That is a good question.  I think it basically amounts to this
> question: is hash partitioning useful, and if so, for what?

Two words: parallel query. To get parallelism, one of the best
approaches is dividing the data, then doing as much work as possible
before combining it again. If you have hash partitions on some key,
then you can do partition-wise joins or partition-wise aggregation on
that key in parallel with no synchronization/communication overhead
(until the final result).

You've taken postgres pretty far in this direction already; hash
partitioning will take it one step further by allowing more pushdowns
and lower sync/communication costs.

Some of these things can be done with range partitioning, too, but see
my other message here:

https://www.postgresql.org/message-id/CAMp0ubfNMSGRvZh7N7TRzHHN5tz0ZeFP13Aq3sv6b0H37fdcPg%40mail.gmail.com

Regards,   Jeff Davis

Re: [HACKERS] Hash Functions

From

Andres Freund

Date:

14 May 2017, 02:08:29

On 2017-05-13 10:29:09 -0400, Robert Haas wrote:
> On Sat, May 13, 2017 at 12:52 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Can we think of defining separate portable hash functions which can be
> > used for the purpose of hash partitioning?
> 
> I think that would be a good idea.  I think it shouldn't even be that
> hard.  By data type:
> 
> - Integers.  We'd need to make sure that we get the same results for
> the same value on big-endian and little-endian hardware, and that
> performance is good on both systems.  That seems doable.
> 
> - Floats.  There may be different representations in use on different
> hardware, which could be a problem.  Tom didn't answer my question
> about whether any even-vaguely-modern hardware is still using non-IEEE
> floats, which I suspect means that the answer is "no".  If every bit
> of hardware we are likely to find uses basically the same
> representation of the same float value, then this shouldn't be hard.
> (Also, even if this turns out to be hard for floats, using a float as
> a partitioning key would be a surprising choice because the default
> output representation isn't even unambiguous; you need
> extra_float_digits for that.)
> 
> - Strings.  There's basically only one representation for a string.
> If we assume that the hash code only needs to be portable across
> hardware and not across encodings, a position for which I already
> argued upthread, then I think this should be manageable.
> 
> - Everything Else.  Basically, everything else is just a composite of
> that stuff, I think.

I seriously doubt that's true.  A lot of more complex types have
internal alignment padding and such.  Consider e.g. something like
jsonb, hstore, or postgis types - you *can* convert them to something
that's unambiguous, but it's going to be fairly expensive.  Essentially
you'd have to something like calling the output function, and then
hashing the result of that.  And hash-partitioning is particularly
interesting for e.g. large amounts of points in a geospatial scenario,
because other types of partitioning are quite hard to maintain.

- Andres

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

14 May 2017, 06:44:22

On Sat, May 13, 2017 at 7:08 PM, Andres Freund <andres@anarazel.de> wrote:
> I seriously doubt that's true.  A lot of more complex types have
> internal alignment padding and such.

True, but I believe we require those padding bytes to be zero.  If we
didn't, then hstore_hash would be broken already.

> Consider e.g. something like
> jsonb, hstore, or postgis types - you *can* convert them to something
> that's unambiguous, but it's going to be fairly expensive.

I'm fuzzy on what you think we'd need to do.

> Essentially
> you'd have to something like calling the output function, and then
> hashing the result of that.

I really don't see why we'd have to go to nearly that length.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

Andres Freund

Date:

14 May 2017, 06:47:39

On May 13, 2017 8:44:22 PM PDT, Robert Haas <robertmhaas@gmail.com> wrote:
>On Sat, May 13, 2017 at 7:08 PM, Andres Freund <andres@anarazel.de>
>wrote:
>> I seriously doubt that's true.  A lot of more complex types have
>> internal alignment padding and such.
>
>True, but I believe we require those padding bytes to be zero.  If we
>didn't, then hstore_hash would be broken already.

It'll be differently sized on different platforms.  So everyone will have to write hash functions that look at each
memberindividually, rather than hashing the entire struct at once.  And for each member you'll have to use a type
specifichash function... 

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

14 May 2017, 07:11:15

On Sat, May 13, 2017 at 1:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Basically, this is simply saying that you're willing to ignore the
> hard cases, which reduces the problem to one of documenting the
> portability limitations.  You might as well not even bother with
> worrying about the integer case, because porting between little-
> and big-endian systems is surely far less common than cases you've
> already said you're okay with blowing off.
>
> That's not an unreasonable position to take, perhaps; doing better
> than that is going to be a lot more work and it's not very clear
> how much real-world benefit results.  But I can't follow the point
> of worrying about endianness but not encoding.

Encoding is a user choice, not a property of the machine.  Or, looking
at it from another point of view, the set of values that can be
represented by an int4 is the same whether they are represented in
big-endian form or in little-endian form, but the set of values that
are representable changes when you switch encodings.  You could argue
that text-under-LATIN1 and text-under-UTF8 aren't really the same data
type at all.  It's one thing to say "you can pick up your data and
move it to a different piece of hardware and nothing will break".
It's quite another thing to say "you can pick up your data and convert
it to a different encoding and nothing will break".  The latter is
generally false already.  Maybe LATIN1 -> UTF8 is no-fail, but what
about UTF8 -> LATIN1 or SJIS -> anything?  Based on previous mailing
list discussions, I'm under the impression that it is sometimes
debatable how a character in one encoding should be converted to some
other encoding, either because it's not clear whether there is a
mapping at all or it's unclear what mapping should be used.  See,
e.g., 2dbbf33f4a95cdcce66365bcdb47c885a8858d3c, or
https://www.postgresql.org/message-id/1739a900-30ab-f48e-aec4-2b35475ecf02%402ndquadrant.com
where it was discussed that being able to convert encoding A ->
encoding B does not guarantee the ability to perform the reverse
conversion.

Arguing that a given int4 value should hash to the same value on every
platform seems like a request that is at least superficially
reasonable, if possibly practically tricky in some cases.  Arguing
that every currently supported encoding should hash every character
the same way when they don't all have the same set of characters and
the mappings between them are occasionally debatable is asking for the
impossible.  I certainly don't want to commit to a design for hash
partitioning that involves a compatibility break any time somebody
changes any encoding conversion in the system, even if a hash function
that involved translating every character to some sort of universal
code point before hashing it didn't seem likely to be horribly slow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

14 May 2017, 07:23:46

On Sat, May 13, 2017 at 11:47 PM, Andres Freund <andres@anarazel.de> wrote:
> It'll be differently sized on different platforms.  So everyone will have to write hash functions that look at each
memberindividually, rather than hashing the entire struct at once.  And for each member you'll have to use a type
specifichash function... 

Well, that's possibly kind of annoying, but it still sounds like
pretty mechanical work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

Greg Stark

Date:

14 May 2017, 22:59:09

On 13 May 2017 at 10:29, Robert Haas <robertmhaas@gmail.com> wrote:
> - Floats.  There may be different representations in use on different
> hardware, which could be a problem.  Tom didn't answer my question
> about whether any even-vaguely-modern hardware is still using non-IEEE
> floats, which I suspect means that the answer is "no".

Fwiw the answer to that is certainly no. The only caveat is that some
platforms have not entirely complete implementations of IEEE missing
corner cases such as denormalized values but I don't think that would
be something that would be changed with a different hash function
though.

Personally while I would like to avoid code that actively crashes or
fails basic tests on Vax even I don't think we need to worry about
replication or federated queries in a heterogeneous environment where
some servers are Vaxen and some are modern hardware. That seems a bit
far-fetched.

-- 
greg

Re: [HACKERS] Hash Functions

From

Andres Freund

Date:

14 May 2017, 23:06:03

On 2017-05-14 15:59:09 -0400, Greg Stark wrote:
> Personally while I would like to avoid code that actively crashes or
> fails basic tests on Vax

I personally vote for simply refusing to run/compile on non-IEEE
platforms, including VAX.  The benefit of even trying to get that right,
not to speak of actually testing, seems just not there.

> even I don't think we need to worry about
> replication or federated queries in a heterogeneous environment where
> some servers are Vaxen and some are modern hardware. That seems a bit
> far-fetched.

Imo there's little point in trying to delineate some subset that we want
to support on platforms that are 20 years out of date.  Either we do, or
don't. And since there's no point aside of some intellectual
curiosity...

On the other hand, I don't believe my opinion that requiring hashing to
be portable is unrealistic, is meaningfully affected by disallowing
non-IEEE floats.

Greetings,

Andres Freund

Re: [HACKERS] Hash Functions

From

Peter Geoghegan

Date:

15 May 2017, 00:07:34

On Sat, May 13, 2017 at 9:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> The latter is
> generally false already.  Maybe LATIN1 -> UTF8 is no-fail, but what
> about UTF8 -> LATIN1 or SJIS -> anything?  Based on previous mailing
> list discussions, I'm under the impression that it is sometimes
> debatable how a character in one encoding should be converted to some
> other encoding, either because it's not clear whether there is a
> mapping at all or it's unclear what mapping should be used.

The express goal of the Unicode consortium is to replace all existing
encodings with Unicode. My personal opinion is that a Unicode
monoculture would be a good thing, provided reasonable differences can
be accommodated. So, it might be that there is ambiguity about how one
codepoint can be converted to another in another encoding, but that's
because encodings like SJIS and BIG5 are needlessly ambiguous. It has
nothing to do with cultural preferences leaving the question
undecidable (at least by a panel of natural language experts), and
everything to do with these regional character encoding systems being
objectively bad. They richly deserve to die, and are in fact dying.

Encoding actually *is* a property of the machine, even though regional
encodings obfuscate things. There is a reason why MacOS and Java use
UTF-16 rather than UTF-8, and there is a reason why the defacto
standard on the web is UTF-8, and those reasons are completely
technical. AFAICT, whatever non-technical reasons remain are actually
technical debt in disguise.

Where this leaves hash partitioning, I cannot say.

-- 
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/

Re: [HACKERS] Hash Functions

From

Thomas Munro

Date:

15 May 2017, 01:08:48

On Mon, May 15, 2017 at 7:59 AM, Greg Stark <stark@mit.edu> wrote:
> On 13 May 2017 at 10:29, Robert Haas <robertmhaas@gmail.com> wrote:
>> - Floats.  There may be different representations in use on different
>> hardware, which could be a problem.  Tom didn't answer my question
>> about whether any even-vaguely-modern hardware is still using non-IEEE
>> floats, which I suspect means that the answer is "no".
>
> Fwiw the answer to that is certainly no. The only caveat is that some
> platforms have not entirely complete implementations of IEEE missing
> corner cases such as denormalized values but I don't think that would
> be something that would be changed with a different hash function
> though.

Well... along with the Intel/IEEE-754 and VAX formats, there is a
third floating point format that is/was in widespread use: IBM hex
float[1].  It's base 16 rather than base 2, and might be the default
on some IBM operating systems[2] for the C float and double types (but
fortunately not xlc for AIX or Linux, and I have no clue about i/OS).
This is probably irrelevant because it looks like people aren't
running PostgreSQL on z/OS right now[3], but unlike VAXen these
systems are alive and well so I just wanted to respond to the
implication on this thread that the whole world is a VAX or an IEEE
system :-)  People really use this... I happen to know a shop that has
petabytes of IBM hex float data.

(IBM hardware also supports base 10 floats, but they show up as
different types in C so not relevant.)

[1] https://en.wikipedia.org/wiki/IBM_Floating_Point_Architecture
[2] https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.1.0/com.ibm.zos.v2r1.cbcux01/flotcop.htm#flotcop
[3] https://www.postgresql.org/message-id/flat/BLU437-SMTP4B3FF36035D8A3C3816D49C160%40phx.gbl

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] Hash Functions

From

Tom Lane

Date:

15 May 2017, 01:25:08

Andres Freund <andres@anarazel.de> writes:
> On 2017-05-14 15:59:09 -0400, Greg Stark wrote:
>> Personally while I would like to avoid code that actively crashes or
>> fails basic tests on Vax

> I personally vote for simply refusing to run/compile on non-IEEE
> platforms, including VAX.

The point of wanting that is not because anybody thinks that VAX per
se is an interesting platform anymore.  The point is to avoid locking
ourselves into narrow assumptions that we may need to break someday.
Saying "nobody cares about floats other than IEEE floats" is morally
indistinguishable from saying "nobody cares about running on any
platform except Windows", which was a depressingly common opinion
back in the nineties or so.

It may well be that we can get away with saying "we're not going
to make it simple to move hash-partitioned tables with float
partition keys between architectures with different float
representations".  But there's a whole lot of daylight between that
and denying any support for float representations other than the
currently-most-popular one.
        regards, tom lane

Re: [HACKERS] Hash Functions

From

Andres Freund

Date:

15 May 2017, 01:29:00

On 2017-05-14 18:25:08 -0400, Tom Lane wrote:
> It may well be that we can get away with saying "we're not going
> to make it simple to move hash-partitioned tables with float
> partition keys between architectures with different float
> representations".  But there's a whole lot of daylight between that
> and denying any support for float representations other than the
> currently-most-popular one.

Note that I, IIRC in the mail you responded to, also argued that I don't
think it'd be a good idea to rely on hashfunctions being portable.  The
amount of lock-in that'd create, especially for more complex datatypes,
seems wholly inadvisable.  I still think that dumping tables in a way
they're reloaded via the top-partition (probably one copy statement for
each child partition), and prohibiting incoming fkeys to partitions, is
a better approach to all this.

Greetings,

Andres Freund

Re: [HACKERS] Hash Functions

From

Tom Lane

Date:

15 May 2017, 01:30:51

Peter Geoghegan <pg@bowt.ie> writes:
> The express goal of the Unicode consortium is to replace all existing
> encodings with Unicode. My personal opinion is that a Unicode
> monoculture would be a good thing, provided reasonable differences can
> be accommodated.

Can't help remembering Randall Munroe's take on such things:
https://xkcd.com/927/

I agree that the Far Eastern systems that can't easily be replaced
by Unicode are that way mostly because they're a mess.  But I'm
still of the opinion that locking ourselves into Unicode is a choice
we might regret, far down the road.
        regards, tom lane

Re: [HACKERS] Hash Functions

From

Thomas Munro

Date:

15 May 2017, 01:41:14

On Mon, May 15, 2017 at 10:08 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> [2] https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.1.0/com.ibm.zos.v2r1.cbcux01/flotcop.htm#flotcop

Though looking more closely I see that the default is IEEE in 64 bit
builds, which seems like a good way to kill the older format off.
If/when someone gets PostgreSQL ported to z/OS it probably won't be
because they want to run it on an ancient 32 bit mainframe.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] Hash Functions

From

Peter Geoghegan

Date:

15 May 2017, 03:05:24

On Sun, May 14, 2017 at 3:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I agree that the Far Eastern systems that can't easily be replaced
> by Unicode are that way mostly because they're a mess.  But I'm
> still of the opinion that locking ourselves into Unicode is a choice
> we might regret, far down the road.

It's not a choice that has any obvious upside, so I have no reason to
disagree. My point was only that Robert's contention that "You could
argue that text-under-LATIN1 and text-under-UTF8 aren't really the
same data type at all" seems wrong to me, because PostgreSQL seems to
want to treat encoding as a property of the machine. This is evidenced
by the fact that we expect applications to change client encoding
"transparently". That is, client encoding may be changed without in
any way affecting humans that speak a natural language that is
provided for by the application's client encoding. That's a great
ideal to have, and one that is very close to completely workable.

-- 
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

15 May 2017, 04:22:58

On Sun, May 14, 2017 at 6:29 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-05-14 18:25:08 -0400, Tom Lane wrote:
>> It may well be that we can get away with saying "we're not going
>> to make it simple to move hash-partitioned tables with float
>> partition keys between architectures with different float
>> representations".  But there's a whole lot of daylight between that
>> and denying any support for float representations other than the
>> currently-most-popular one.
>
> Note that I, IIRC in the mail you responded to, also argued that I don't
> think it'd be a good idea to rely on hashfunctions being portable.  The
> amount of lock-in that'd create, especially for more complex datatypes,
> seems wholly inadvisable.  I still think that dumping tables in a way
> they're reloaded via the top-partition (probably one copy statement for
> each child partition), and prohibiting incoming fkeys to partitions, is
> a better approach to all this.

You'd have to prohibit a heck of a lot more than that in order for
this to work 100% reliably.  You'd have to prohibit CHECK constraints,
triggers, rules, RLS policies, and UNIQUE indexes, at the least.  You
might be able to convince me that some of those things are useless
when applied to partitions, but wanting a CHECK constraint that
applies to only one partition seems pretty reasonable (e.g. CHECK that
records for older years are all in the 'inactive' state, or whatever).
I think getting this to work 100% reliably in all cases would require
an impractically large hammer.

Now that's not to say that we shouldn't have a
reload-through-the-top-parent switch; actually, I think that's a good
idea.  I just don't believe that it can ever be a complete substitute
for portable hash functions.  I think almost everybody here agrees
that it isn't necessary to have hash functions that are 100% portable,
e.g. to VAX.  But it would be nice IMHO if you could use an integer
column as the partitioning key and have that be portable between Intel
and POWER.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

Andres Freund

Date:

15 May 2017, 04:35:45

Hi,

On 2017-05-14 21:22:58 -0400, Robert Haas wrote:
> but wanting a CHECK constraint that applies to only one partition
> seems pretty reasonable (e.g. CHECK that records for older years are
> all in the 'inactive' state, or whatever).

On a hash-partitioned table?

> Now that's not to say that we shouldn't have a
> reload-through-the-top-parent switch; actually, I think that's a good
> idea.  I just don't believe that it can ever be a complete substitute
> for portable hash functions.  I think almost everybody here agrees
> that it isn't necessary to have hash functions that are 100% portable,
> e.g. to VAX.  But it would be nice IMHO if you could use an integer
> column as the partitioning key and have that be portable between Intel
> and POWER.

I'm not saying it can't work for any datatype, I just think it'd be a
very bad idea to make it work for any non-trivial ones. The likelihood
of reguarly breaking or preventing us from improving things seems high.
I'm not sure that having a design where this most of the time works for
some datatypes is a good one.

Greetings,

Andres Freund

Re: [HACKERS] Hash Functions

From

Bruce Momjian

Date:

15 May 2017, 06:00:13

On Sun, May 14, 2017 at 01:06:03PM -0700, Andres Freund wrote:
> On 2017-05-14 15:59:09 -0400, Greg Stark wrote:
> > Personally while I would like to avoid code that actively crashes or
> > fails basic tests on Vax
> 
> I personally vote for simply refusing to run/compile on non-IEEE
> platforms, including VAX.  The benefit of even trying to get that right,
> not to speak of actually testing, seems just not there.

Do we even know that floats are precise enough to determine the
partition.  For example, if you have 6.000000001, is it possible for
that to be 5.9999999 on some systems?  Are IEEE systems all the same for
these values?  I would say we should disallow any approximate date type
for partitioning completely.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: [HACKERS] Hash Functions

From

Jeff Davis

Date:

15 May 2017, 17:32:30

On Sun, May 14, 2017 at 8:00 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Do we even know that floats are precise enough to determine the
> partition.  For example, if you have 6.000000001, is it possible for
> that to be 5.9999999 on some systems?  Are IEEE systems all the same for
> these values?  I would say we should disallow any approximate date type
> for partitioning completely.

I'm inclined in this direction, as well. Hash partitioning is mostly
useful for things that are likely to be join keys or group keys, and
floats aren't. Same for complex user-defined types.

The real problem here is what Tom pointed out: that we would have
trouble hashing strings in an encoding-insensitive way. Strings are
useful as join/group keys, so it would be painful to not support them.

Perhaps there's some kind of compromise, like a pg_dump mode that
reloads through the root. Or maybe hash partitions are really a
"semi-logical" partitioning that the optimizer understands, but where
things like per-partition check constraints don't make sense.

Regards,   Jeff Davis

Regards,   Jeff Davis

Re: [HACKERS] Hash Functions

From

Jeff Davis

Date:

15 May 2017, 17:48:14

On Sun, May 14, 2017 at 6:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> You'd have to prohibit a heck of a lot more than that in order for
> this to work 100% reliably.  You'd have to prohibit CHECK constraints,
> triggers, rules, RLS policies, and UNIQUE indexes, at the least.  You
> might be able to convince me that some of those things are useless
> when applied to partitions, but wanting a CHECK constraint that
> applies to only one partition seems pretty reasonable (e.g. CHECK that
> records for older years are all in the 'inactive' state, or whatever).
> I think getting this to work 100% reliably in all cases would require
> an impractically large hammer.

The more I think about it the more I think hash partitions are
"semi-logical". A check constraint on a single partition of a
range-partitioned table makes sense, but it doesn't make sense on a
single partition of a hash-partitioned table.

I think hash partitioning is mainly useful for parallel query (so the
optimizer needs to know about it) and maintenance commands (as you
listed in another email). But those don't need hash portability.

FKs are a little more interesting, but I actually think they still
work regardless of hash portability. If the two sides are in the same
hash opfamily, they should hash the same and it's fine. And if they
aren't, the FK has no hope of working regardless of hash portability.

This would mean we need to reload through the root as Andres and
others suggested, and disable a lot of logical partitioning
capabilities. I'd be a little worried about what we do with
attaching/detaching, though.

Regards,   Jeff Davis

Re: [HACKERS] Hash Functions

From

Bruce Momjian

Date:

15 May 2017, 17:50:26

On Mon, May 15, 2017 at 07:32:30AM -0700, Jeff Davis wrote:
> On Sun, May 14, 2017 at 8:00 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > Do we even know that floats are precise enough to determine the
> > partition.  For example, if you have 6.000000001, is it possible for
> > that to be 5.9999999 on some systems?  Are IEEE systems all the same for
> > these values?  I would say we should disallow any approximate date type
> > for partitioning completely.
> 
> I'm inclined in this direction, as well. Hash partitioning is mostly
> useful for things that are likely to be join keys or group keys, and
> floats aren't. Same for complex user-defined types.
> 
> The real problem here is what Tom pointed out: that we would have
> trouble hashing strings in an encoding-insensitive way. Strings are
> useful as join/group keys, so it would be painful to not support them.

Well, since we can't mix encodings in the same column, why can't we just
hash the binary representation of the string?  My point is that wish
hashing we aren't comparing one string with another, right?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: [HACKERS] Hash Functions

From

David Fetter

Date:

15 May 2017, 18:34:18

On Mon, May 15, 2017 at 07:48:14AM -0700, Jeff Davis wrote:
> This would mean we need to reload through the root as Andres and
> others suggested,

One refinement of this would be to traverse the partition tree,
stopping at the first place where the next branch has hash partitions,
or at any rate types which have no application-level significance, and
load from there.  This could allow for parallelizing where it's
portable to do so.

Level                   Table    Partition Type
------------------------------------------------
Base table:             Log     (N/A)
Next partition:         Year    (range)
Next partition:         Month   (range)
Next partition:         Day     (range) <---- Data gets loaded no lower than here
Next partition:         *       (hash)

That last, any below it, doesn't have a specific name because they're
just an implementation detail, i.e. none has any application-level
meaning.

> and disable a lot of logical partitioning capabilities. I'd be a
> little worried about what we do with attaching/detaching, though.

Attaching and detaching partitions only makes sense for partitions
whose partition keys have application-level meaning anyway.

Does it make sense at this point to separate our partitions into two
categories, those which have can significance to applications, and
those which can't?

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

15 May 2017, 22:26:02

On Sun, May 14, 2017 at 9:35 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-05-14 21:22:58 -0400, Robert Haas wrote:
>> but wanting a CHECK constraint that applies to only one partition
>> seems pretty reasonable (e.g. CHECK that records for older years are
>> all in the 'inactive' state, or whatever).
>
> On a hash-partitioned table?

No, probably not.  But do we really want the rules for partitioned
tables to be different depending on the kind of partitioning in use?

> I'm not saying it can't work for any datatype, I just think it'd be a
> very bad idea to make it work for any non-trivial ones. The likelihood
> of reguarly breaking or preventing us from improving things seems high.
> I'm not sure that having a design where this most of the time works for
> some datatypes is a good one.

I think you might be engaging in undue pessimism here, but I suspect
we need to actually try doing the work before we know how it will turn
out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

Mark Dilger

Date:

15 May 2017, 22:32:29

> On May 15, 2017, at 7:48 AM, Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Sun, May 14, 2017 at 6:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> You'd have to prohibit a heck of a lot more than that in order for
>> this to work 100% reliably.  You'd have to prohibit CHECK constraints,
>> triggers, rules, RLS policies, and UNIQUE indexes, at the least.  You
>> might be able to convince me that some of those things are useless
>> when applied to partitions, but wanting a CHECK constraint that
>> applies to only one partition seems pretty reasonable (e.g. CHECK that
>> records for older years are all in the 'inactive' state, or whatever).
>> I think getting this to work 100% reliably in all cases would require
>> an impractically large hammer.
>
> The more I think about it the more I think hash partitions are
> "semi-logical". A check constraint on a single partition of a
> range-partitioned table makes sense, but it doesn't make sense on a
> single partition of a hash-partitioned table.

That depends on whether the user gets to specify the hash function, perhaps
indirectly by specifying a user defined opfamily.  I can imagine clever hash
functions that preserve certain properties of the incoming data, and check
constraints in development versions of the database that help verify the hash
is not violating those properties.

That's not to say such hash functions must be allowed in the hash partitioning
implementation; just that it does make sense if you squint and look a bit sideways
at it.

Mark Dilger

Re: [HACKERS] Hash Functions

From

David Fetter

Date:

15 May 2017, 23:04:52

On Mon, May 15, 2017 at 03:26:02PM -0400, Robert Haas wrote:
> On Sun, May 14, 2017 at 9:35 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2017-05-14 21:22:58 -0400, Robert Haas wrote:
> >> but wanting a CHECK constraint that applies to only one partition
> >> seems pretty reasonable (e.g. CHECK that records for older years
> >> are all in the 'inactive' state, or whatever).
> >
> > On a hash-partitioned table?
> 
> No, probably not.  But do we really want the rules for partitioned
> tables to be different depending on the kind of partitioning in use?

As the discussion has devolved here, it appears that there are, at
least conceptually, two fundamentally different classes of partition:
public, which is to say meaningful to DB clients, and "private", used
for optimizations, but otherwise opaque to DB clients.

Mashing those two cases together appears to cause more problems than
it solves.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: [HACKERS] Hash Functions

From

Jeff Davis

Date:

16 May 2017, 18:10:39

On Mon, May 15, 2017 at 1:04 PM, David Fetter <david@fetter.org> wrote:
> As the discussion has devolved here, it appears that there are, at
> least conceptually, two fundamentally different classes of partition:
> public, which is to say meaningful to DB clients, and "private", used
> for optimizations, but otherwise opaque to DB clients.
>
> Mashing those two cases together appears to cause more problems than
> it solves.

I concur at this point. I originally thought hash functions might be
made portable, but I think Tom and Andres showed that to be too
problematic -- the issue with different encodings is the real killer.

But I also believe hash partitioning is important and we shouldn't
give up on it yet.

That means we need to have a concept of hash partitions that's
different from range/list partitioning. The terminology
"public"/"private" does not seem appropriate. Logical/physical or
external/internal might be better.

With hash partitioning:
* User only specifies number of partitions of the parent table; does
not specify individual partition properties (modulus, etc.)
* Dump/reload goes through the parent table (though we may provide
options so pg_dump/restore can optimize this)
* We could provide syntax to adjust the number of partitions, which
would be expensive but still useful sometimes.
* All DDL should be on the parent table, including check constraints,
FKs, unique constraints, exclusion constraints, indexes, etc. - Unique and exclusion constraints would only be
permittedif the

keys are a superset of the partition keys. - FKs would only be permitted if the two table's partition schemes
match and the keys are members of the same hash opfamily (this could
be relaxed slightly, but it gets a little confusing if so)
* No attach/detach of partitions
* All partitions have the same permissions
* Individual partitions would only be individually-addressable for
maintenance (like reindex and vacuum), but not for arbitrary queries - perhaps also COPY for bulk loading/dumping, in
casewe get clients

smart enough to do their own hashing.

The only real downside is that it could surprise users -- why can I
add a CHECK constraint on my range-partitioned table but not the
hash-partitioned one? We should try to document this so users don't
find that out too far along. As long as they aren't surprised, I think
users will understand why these aren't quite the same concepts.

Regards,    Jeff Davis

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

16 May 2017, 18:29:00

On Tue, May 16, 2017 at 11:10 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> With hash partitioning:
> * User only specifies number of partitions of the parent table; does
> not specify individual partition properties (modulus, etc.)
> * Dump/reload goes through the parent table (though we may provide
> options so pg_dump/restore can optimize this)
> * We could provide syntax to adjust the number of partitions, which
> would be expensive but still useful sometimes.
> * All DDL should be on the parent table, including check constraints,
> FKs, unique constraints, exclusion constraints, indexes, etc.
>   - Unique and exclusion constraints would only be permitted if the
> keys are a superset of the partition keys.
>   - FKs would only be permitted if the two table's partition schemes
> match and the keys are members of the same hash opfamily (this could
> be relaxed slightly, but it gets a little confusing if so)
> * No attach/detach of partitions
> * All partitions have the same permissions
> * Individual partitions would only be individually-addressable for
> maintenance (like reindex and vacuum), but not for arbitrary queries
>   - perhaps also COPY for bulk loading/dumping, in case we get clients
> smart enough to do their own hashing.

I don't really find this a very practical design.  If the table
partitions are spread across different relfilenodes, then those
relfilenodes have to have separate pg_class entries and separate
indexes, and those indexes also need to have separate pg_class
entries.  Otherwise, nothing works.  And if they do have separate
pg_class entries, then the partitions have to have their own names,
and likewise for their indexes, and a dump-and-reload has to preserve
those names.  If it doesn't, and those objects get new system-assigned
names after the dump-and-reload, then dump restoration can fail when a
system-assigned name collides with an existing name that is first
mentioned later in the dump.

If we had the ability to have anonymous pg_class entries -- relations
that have no names -- then maybe it would be possible to make
something like what you're talking about work.  But that does not seem
easy to do.  There's a unique index on (relname, relnamespace) for
good reason, and we can't make it partial on a system catalog.  We
could make the relname column allow nulls, but that would add overhead
to any code that needs to access the relation name, and there's a fair
amount of that.

Similarly, if we had the ability to associate multiple relfilenodes
with a single relation, and if index entries could point to
<which-relfilenode, block, offset> rather than just <block, offset>,
then we could also make this work.  But either of those things would
require significant re-engineering and would have downsides in other
cases.

If Java has portable hash functions, why can't we?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

David Fetter

Date:

16 May 2017, 19:59:43

On Tue, May 16, 2017 at 08:10:39AM -0700, Jeff Davis wrote:
> On Mon, May 15, 2017 at 1:04 PM, David Fetter <david@fetter.org> wrote:
> > As the discussion has devolved here, it appears that there are, at
> > least conceptually, two fundamentally different classes of partition:
> > public, which is to say meaningful to DB clients, and "private", used
> > for optimizations, but otherwise opaque to DB clients.
> >
> > Mashing those two cases together appears to cause more problems than
> > it solves.
> 
> I concur at this point. I originally thought hash functions might be
> made portable, but I think Tom and Andres showed that to be too
> problematic -- the issue with different encodings is the real killer.
> 
> But I also believe hash partitioning is important and we shouldn't
> give up on it yet.
> 
> That means we need to have a concept of hash partitions that's
> different from range/list partitioning. The terminology
> "public"/"private" does not seem appropriate. Logical/physical or
> external/internal might be better.

I'm not attached to any particular terminology.

> With hash partitioning:
> * User only specifies number of partitions of the parent table; does
> not specify individual partition properties (modulus, etc.)

Maybe this is over-thinking it, but I'm picturing us ending up with
something along the lines of:
   PARTITION BY INTERNAL(EXPRESSION)   [ WITH ( [PARAMETERS] [, AS, APPROPRIATE] ) ]

i.e. it's not clear that we should wire in "number of partitions" as a
parameter.

In a not that distant future, ANALYZE and similar could have a say in
determining both the "how" and the "whether" of partitioning.

> * Dump/reload goes through the parent table (though we may provide
> options so pg_dump/restore can optimize this)

Would it be simplest to default to routing through the immediate
ancestor for now?

It occurs to me that with the opaque partition system we're designing
here, internal partitions would necessarily be leaves in the tree.

> * We could provide syntax to adjust the number of partitions, which
> would be expensive but still useful sometimes.

Yep.  I suspect that techniques for this are described in literature,
and possibly even in code bases.  Any pointers?

> * All DDL should be on the parent table, including check constraints,
> FKs, unique constraints, exclusion constraints, indexes, etc.

Necessarily.

>   - Unique and exclusion constraints would only be permitted if the
> keys are a superset of the partition keys.

"Includes either all of the partition expression or none of it,"
maybe?

>   - FKs would only be permitted if the two table's partition schemes
> match and the keys are members of the same hash opfamily (this could
> be relaxed slightly, but it gets a little confusing if so)

Relaxing sounds like a not-in-the-first-cut feature, and subtle.

> * No attach/detach of partitions

Since they're opaque, this is the only sane thing.

> * All partitions have the same permissions

Since they're opaque, this is the only sane thing.

> * Individual partitions would only be individually-addressable for
> maintenance (like reindex and vacuum), but not for arbitrary queries

Since they're opaque, this is the only sane thing.

>   - perhaps also COPY for bulk loading/dumping, in case we get clients
> smart enough to do their own hashing.

This is appealing from a resource allocation point of view in the
sense of deciding where the hash computing resources are spent.  Do we
want something like the NOT VALID/VALIDATE infrastructure to support
it?

> The only real downside is that it could surprise users -- why can I
> add a CHECK constraint on my range-partitioned table but not the
> hash-partitioned one? We should try to document this so users don't
> find that out too far along. As long as they aren't surprised, I think
> users will understand why these aren't quite the same concepts.

+1

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: [HACKERS] Hash Functions

From

Jeff Davis

Date:

16 May 2017, 23:25:23

On Tuesday, May 16, 2017, Robert Haas <robertmhaas@gmail.com> wrote:
> I don't really find this a very practical design. If the table
> partitions are spread across different relfilenodes, then those
> relfilenodes have to have separate pg_class entries and separate
> indexes, and those indexes also need to have separate pg_class
> entries. Otherwise, nothing works. And if they do have separate
> pg_class entries, then the partitions have to have their own names,
> and likewise for their indexes, and a dump-and-reload has to preserve
> those names. If it doesn't, and those objects get new system-assigned
> names after the dump-and-reload, then dump restoration can fail when a
> system-assigned name collides with an existing name that is first
> mentioned later in the dump.

Why can't hash partitions be stored in tables the same way as we do TOAST? That should take care of the naming problem.

> If Java has portable hash functions, why can't we?

Java standardizes on a particular unicode encoding (utf-16). Are you suggesting that we do the same? Or is there another solution that I am missing?

Regards,
Jeff Davis

Re: [HACKERS] Hash Functions

From

Peter Eisentraut

Date:

17 May 2017, 02:36:07

On 5/16/17 11:10, Jeff Davis wrote:
> I concur at this point. I originally thought hash functions might be
> made portable, but I think Tom and Andres showed that to be too
> problematic -- the issue with different encodings is the real killer.

I think it would be OK that if you want to move a hash-partitioned table
to a database with a different encoding, you have to do dump/restore
through the parent table.  This is quite similar to what you have to do
now if you want to move a range-partitioned table to a database with a
different locale.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Hash Functions

From

Amit Langote

Date:

17 May 2017, 05:18:04

On 2017/05/17 5:25, Jeff Davis wrote:
> On Tuesday, May 16, 2017, Robert Haas <robertmhaas@gmail.com> wrote:
>> I don't really find this a very practical design.  If the table
>> partitions are spread across different relfilenodes, then those
>> relfilenodes have to have separate pg_class entries and separate
>> indexes, and those indexes also need to have separate pg_class
>> entries.  Otherwise, nothing works.  And if they do have separate
>> pg_class entries, then the partitions have to have their own names,
>> and likewise for their indexes, and a dump-and-reload has to preserve
>> those names.  If it doesn't, and those objects get new system-assigned
>> names after the dump-and-reload, then dump restoration can fail when a
>> system-assigned name collides with an existing name that is first
>> mentioned later in the dump.
> 
> Why can't hash partitions be stored in tables the same way as we do TOAST?
> That should take care of the naming problem.

There is only one TOAST table per relation though, containing all of the
relation's TOASTed data.  Doesn't the multiplicity of hash partitions pose
a problem?  Will a hash partition of given name end up with the same
subset of data in the target database as it did in the source database?  I
suppose it won't matter though if we make hash partitions an
implementation detail of hash partitioned tables to the extent you
described in your earlier email [1], whereby it doesn't matter to an
application which partition contains what subset of the table's data.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAMp0ubcQ3VYdU1kNUCOmpj225U4hk6ZEoaUVeReP8h60p%2Bmv1Q%40mail.gmail.com

Re: [HACKERS] Hash Functions

From

Ashutosh Bapat

Date:

17 May 2017, 09:18:06

On Tue, May 16, 2017 at 8:40 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Mon, May 15, 2017 at 1:04 PM, David Fetter <david@fetter.org> wrote:
>> As the discussion has devolved here, it appears that there are, at
>> least conceptually, two fundamentally different classes of partition:
>> public, which is to say meaningful to DB clients, and "private", used
>> for optimizations, but otherwise opaque to DB clients.
>>
>> Mashing those two cases together appears to cause more problems than
>> it solves.
>
> I concur at this point. I originally thought hash functions might be
> made portable, but I think Tom and Andres showed that to be too
> problematic -- the issue with different encodings is the real killer.
>
> But I also believe hash partitioning is important and we shouldn't
> give up on it yet.
>
> That means we need to have a concept of hash partitions that's
> different from range/list partitioning. The terminology
> "public"/"private" does not seem appropriate. Logical/physical or
> external/internal might be better.
>
> With hash partitioning:
> * User only specifies number of partitions of the parent table; does
> not specify individual partition properties (modulus, etc.)

a well distributed integer column doesn't even need to be hashed, a
simple modulo works with it. If we are going towards "implicit" (yet
another name) partitioning, we could choose the strategy based on the
data type of the partition key, not just hash it always. Although, we
might end up hashing it in most of the cases.

> * Dump/reload goes through the parent table (though we may provide
> options so pg_dump/restore can optimize this)

Probably you imply immediate hash partitioned parent, but just let me
clarify it a bit. We support multi-level partitioning with each
partitioned table anywhere in the partitioning hierarchy choosing any
partitioning scheme. So, we can have range partitioned table as a
partition of a hash partitioned table or for that matter, a non-hash
partitioned table which is somewhere in the hiearchy rooted at the
hash partitioned table. So, for range/list even hash partitions that
are grand-children of a hash partitioned table, we will need to route
dump/reload through that hash partitioned table i.e. route it through
the topmost hash-partitioned table.

> * We could provide syntax to adjust the number of partitions, which
> would be expensive but still useful sometimes.
> * All DDL should be on the parent table, including check constraints,
> FKs, unique constraints, exclusion constraints, indexes, etc.

i.e. topmost hash partitioned table as explained above.

>   - Unique and exclusion constraints would only be permitted if the
> keys are a superset of the partition keys.

Do you think this constraint apply even after we support global
indexes? Isn't this applicable to all the partitioning strategies?

>   - FKs would only be permitted if the two table's partition schemes
> match and the keys are members of the same hash opfamily (this could
> be relaxed slightly, but it gets a little confusing if so)
> * No attach/detach of partitions

It will be good, if we can support this for maintenance purpose. If a
partition goes bad, we could replace it with its copy somewhere, using
attach and detach without affecting the whole table. Now does that
mean, that we will need to support some form of pg_dump/copy with
special flag to create copies of individual partitions? I think that
proposal has already been floated.

> * All partitions have the same permissions

Why's that?

>
> The only real downside is that it could surprise users -- why can I
> add a CHECK constraint on my range-partitioned table but not the
> hash-partitioned one? We should try to document this so users don't
> find that out too far along. As long as they aren't surprised, I think
> users will understand why these aren't quite the same concepts.
>

For a transparent hash (non-transparent in the sense of what Mark
Dilger proposed), any constraint other than implicit partitioning
constraint is applicable to the whole table or it's not applicable at
all. So, better if user adds it on the parent hash table. So, yes,
with this reasoning, we could document this fact.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

17 May 2017, 21:16:30

On Tue, May 16, 2017 at 4:25 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> Why can't hash partitions be stored in tables the same way as we do TOAST?
> That should take care of the naming problem.

Hmm, yeah, something like that could be done, but every place where
you are currently allowed to refer to a partition by name would have
to be be changed to accept some other syntax for specifying the
partition.  Even with all the things you propose to disallow, things
like CLUSTER, VACUUM, ANALYZE, etc. would still have to be accepted on
individual partitions.  I suspect even CREATE INDEX and DROP INDEX
would need to be accepted on individual partitions, because an index
on one partition somehow becomes bloated while the corresponding
indexes on other partitions are OK, you'll want to create a
replacement index concurrently and drop the old one.  Of course, for
similar reasons, you'd need some way for \d on the parent to display
information on indexes on all the children, and all of that output
would have to frobbed to use whatever syntax is now required, in lieu
of a name, to use an individual partition.  Error messages would have
to be adjusted in probably quite a few places to use the new notation,
too.  And on and on.  It's not impossible to do, but we could end up
chasing down loose ends for a very long time.

Beyond that, I think it's a bad idea to make hash partitions behave
completely differently from list and range partitions.  That's a lot
of code extra code to maintain, and a lot of extra notional complexity
for users, for really not a lot of benefit.  I think we're taking a
significant but not overwhelming problem -- our current hash functions
aren't portable -- and through a chain of logical links eventually
ending up with the conclusion that the design of partitioning needs to
be totally overhauled.  I want to resist that conclusion.  I'm not
saying that the problem isn't a problem, or that there's not some
logic to each step in the chain, but it's not that hard to blow a
small problem up into a huge one by assuming the worst possible
consequences or the tightest possible requirements at each step.
http://tvtropes.org/pmwiki/pmwiki.php/Main/ForWantOfANail is not an
argument for stricter regulation of the blacksmith industry.

>> If Java has portable hash functions, why can't we?
>
> Java standardizes on a particular unicode encoding (utf-16). Are you
> suggesting that we do the same? Or is there another solution that I am
> missing?

Well, I've already said several times (and Peter Eisentraut has
agreed) that we don't really need the hash functions to be portable
across encodings.  I think there are at least three good arguments for
that position.

First, as Peter Geoghegan points out, the word is increasingly
standardizing on Unicode, and that trend seems likely to continue.  I
strongly suspect that UTF-8 is the most common database encoding by a
wide margin.  There may occasionally be reasons to avoid it if, for
example, you're using one of the Eastern languages that doesn't play
entirely nicely with UTF-8, or if you happen to be storing a large
number of characters that can be represented in a single byte in some
other encoding but which require multiple bytes in UTF-8, but for an
awful lot of people UTF-8 just works and there's no need to think any
further.  So, a lot of people will never hit the problem of needing to
migrate a database between encodings because they'll just use UTF-8.

Second, if the previous argument turns out to be wrong and the world
abandons UTF-8 in favor of some new and better system (UTF-9?), or if
users frequently want to make some other encoding conversion like
Tom's original example of LATIN1 -> UTF-8, we've already got a
proposed workaround for that case which seems like it will work just
fine.  Just dump your data with pg_dump
--insert-hash-partitioned-data-into-parent and reload on the new
system.  This isn't absolutely guaranteed to work if you've done
something silly that will make the load of a particular row work on
one partition and fail on some other one, but you probably won't do
that because it would be dumb.  Also, it will be a bit slower than a
regular dump-and-reload cycle because tuple routing isn't free.
Neither of these problems really sound very bad.  If we're going to
start fixing things that could cause database migrations/upgrades to
occasionally fail in corner cases or run more slowly than expected,
there's a long list of things that you can do in your DDL that will
make pg_upgrade bomb out, and many of them are things that bite users
with some regularity (e.g. tablespaces inside the data directory or
other tablespaces, dependencies on system objects that are changed in
the new version, ALTER USER .. SET ROLE).  For whatever reason, we
haven't viewed those warts as really high-priority items in need of
fixing; in some cases, instead of actually trying to improve
usability, we've all but mocked the people reporting those issues for
having the temerity to do configure the system in a way that we think
isn't very smart.  How can we then turn around and say "it's
absolutely unacceptable for there to be any case where dump and reload
fails even if there's a workaround switch for pg_dump that will almost
always solve the problem"?  That's holding hash partitioning to a
stricter standard than our existing features, which seems unjustified.

Third, the fact that we support multiple encodings ought to be a
strong point of PostgreSQL, not an excuse for running away from
features.  This same issue came up with JSON support, and the question
was, well, what do we do about the places where JSON assumes that
everything is UTF-8, like in \uXXXX escape sequences, given that we
support multiple encodings?  We could have answered that question by
giving up and saying that JSON support is just too hard in our
multi-encoding environment, but we found some solution that let us
move forward.  I think the result is that JSON is not really quite
per-spec unless you are using UTF-8 as your encoding, but I am
confident that it's better to have made that trade-off than to just
not support JSON at all.  Similarly, I think that if we use the fact
that we support multiple encodings either as an excuse for why we
shouldn't have hash partitioning at all or for why the design needs to
be vastly more complex than would otherwise be required, we've made a
liability of what should be an asset.  Tom's willing to make those
arguments, but he doesn't think hash partitioning is worthwhile in the
first place.  If we believe it is worthwhile (and I do), then we
should be looking for a set of design decisions that make implementing
it reasonably practical, maybe with some trade-offs, rather than a set
of decisions that increase the complexity to the point where it will
take us 5-10 years to unravel all the problems.

Also, it's worth noting that our decision here doesn't need to be
monolithic.  If we want, we can ship one opfamily that hashes strings
in the obvious way - i.e. fast but not portable across encodings - and
another that converts the input string to UTF-8 and then hashes the
result.  The latter will be somewhat slow, but it will be portable
across encodings, and I don't have any problem at all with providing
it if somebody wants to do the legwork.  It'll also fail in any
encodings that can't be converted to UTF-8, but it seems a bit much to
insist that hash partitioning must solve a problem that the Unicode
consortium hasn't managed to overcome.  Anyway, if we do something
like that, then users can choose whether they want the portability for
which Tom advocates or the speed which I believe most users will
prefer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

Tom Lane

Date:

17 May 2017, 21:35:17

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, May 16, 2017 at 4:25 PM, Jeff Davis <pgsql@j-davis.com> wrote:
>> Why can't hash partitions be stored in tables the same way as we do TOAST?
>> That should take care of the naming problem.

> Hmm, yeah, something like that could be done, but every place where
> you are currently allowed to refer to a partition by name would have
> to be be changed to accept some other syntax for specifying the
> partition.

Uh ... toast tables have regular names, and can be specified in commands
just like any other table.  I don't see why these "auto" partition tables
couldn't be handled the same way.

> Beyond that, I think it's a bad idea to make hash partitions behave
> completely differently from list and range partitions.

I think the question is whether we are going to make a distinction between
logical partitions (where the data division rule makes some sense to the
user) and physical partitions (where it needn't).  I think it might be
perfectly reasonable for those to behave differently.
        regards, tom lane

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

17 May 2017, 22:10:53

On Wed, May 17, 2017 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, May 16, 2017 at 4:25 PM, Jeff Davis <pgsql@j-davis.com> wrote:
>>> Why can't hash partitions be stored in tables the same way as we do TOAST?
>>> That should take care of the naming problem.
>
>> Hmm, yeah, something like that could be done, but every place where
>> you are currently allowed to refer to a partition by name would have
>> to be be changed to accept some other syntax for specifying the
>> partition.
>
> Uh ... toast tables have regular names, and can be specified in commands
> just like any other table.  I don't see why these "auto" partition tables
> couldn't be handled the same way.

Really?  That seems like a huge usability fail to me.  If somebody
wants to create an index on one partition of a hash-partitioned table,
or reindex an index, do you really want them to have to dig out an
internal name to do it?  And how exactly would you dump and restore
the partitions and their indexes?  It's true that there are some
operations that can be performed directly on a TOAST table, but the
saving grace is that you usually don't need to do any of them.  That
won't be true here.

>> Beyond that, I think it's a bad idea to make hash partitions behave
>> completely differently from list and range partitions.
>
> I think the question is whether we are going to make a distinction between
> logical partitions (where the data division rule makes some sense to the
> user) and physical partitions (where it needn't).  I think it might be
> perfectly reasonable for those to behave differently.

I don't think I'd like to go so far as to say that it's unreasonable,
but I certainly wouldn't say I'm optimistic about such a design.  I do
not think that it is going to work to conceal from the user that the
partitions are really separate tables with their own indexes.  I also
think that trying to make such a thing work is just going to lead to a
lot of time and energy spent trying to paper over problems that are
basically self-inflicted, and that papering over those problems won't
really end up having any value for users.

Remember, the chain of reasoning here is something like:

1. To handle dump-and-reload the way we partitioning does today, hash
functions would need to be portable across encodings.
2. That's impractically difficult.
3. So let's always load data through the top-parent.
4. But that could fail due to e.g. a UNIQUE index on an individual
child, so let's try to prohibit all of the things that could be done
to an individual partition that could cause a reload failure.
5. And then for good measure let's hide the existence of the partitions, too.

Every step in that chain of logic has a certain sense to it, but none
of them are exactly water-tight.  #1 is basically a value judgement:
would people rather (a) have faster hash functions, or (b) would they
rather be able to port a database to a different encoding without
having rows move between hash functions?  The statement is only true
if you think it's the latter, but I tend to think it's the former.  #2
is a judgement that the performance characteristics of
as-yet-unwritten portable hashing will be so bad that nobody could
possibly be satisfied with it.  #3 is a great idea as an optional
behavior, but it's only a strict necessity if you're totally committed
to #1 and #2.  It also has some performance cost, which makes it
somewhat undesirable as a default behavior.  #4 is *probably* a
necessary consequence of #3.  I don't know what the argument for #5 is
unless it's that #4 isn't hard enough already.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

Jeff Davis

Date:

18 May 2017, 08:53:01

On Wed, May 17, 2017 at 12:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> 1. To handle dump-and-reload the way we partitioning does today, hash
> functions would need to be portable across encodings.
> 2. That's impractically difficult.
> 3. So let's always load data through the top-parent.
> 4. But that could fail due to e.g. a UNIQUE index on an individual
> child, so let's try to prohibit all of the things that could be done
> to an individual partition that could cause a reload failure.
> 5. And then for good measure let's hide the existence of the partitions, too.

That is one thread of logic, but out of the discussion also
highlighted some of the consequences of treating hash partitions like
range/list partitions.

For instance, it makes little sense to have individual check
constraints, indexes, permissions, etc. on a hash-partitioned table.
It doesn't mean that we should necessarily forbid them, but it should
make us question whether combining range and hash partitions is really
the right design.

Regards,    Jeff Davis

Re: [HACKERS] Hash Functions

From

Jeff Davis

Date:

18 May 2017, 09:05:15

On Wed, May 17, 2017 at 11:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think the question is whether we are going to make a distinction between
> logical partitions (where the data division rule makes some sense to the
> user) and physical partitions (where it needn't).  I think it might be
> perfectly reasonable for those to behave differently.

Agreed. To summarize my perspective:

* hash partitioning offers a nice way to divide the data for later
processing by parallel query
* range partitioning is good for partition elimination
(constraint_exclusion) and separating hot/cold data (e.g. partitioning
on date)
* both offer some maintenance benefits (e.g. reindex one partition at
a time), though range partitioning seems like it offers better
flexibility here in some cases

I lean toward separating the concepts, but Robert is making some
reasonable arguments and I could be convinced.

Regards,   Jeff Davis

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

18 May 2017, 17:11:15

On Thu, May 18, 2017 at 1:53 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> For instance, it makes little sense to have individual check
> constraints, indexes, permissions, etc. on a hash-partitioned table.
> It doesn't mean that we should necessarily forbid them, but it should
> make us question whether combining range and hash partitions is really
> the right design.

I think that it definitely makes sense to have individual indexes on a
hash-partitioned table.  If you didn't, then as things stand today,
you'd have no indexes at all, which can't be good.  In the future, we
might have some system where an index created on the parent cascades
down to all of the children, but even then, you might want to REINDEX
just one of those child indexes, or better yet, create a replacement
index concurrently and then drop the old one concurrently.  You might
also want to add the same sort of new index to every partition, but
not in a single operation - for reasons of load, length of maintenance
window, time for which a snapshot is held open, etc.

I agree that separate constraints and permissions on hash partitions
don't make much sense.  To a lesser extent, that's true of other kinds
of partitioning as well.  I mean, there is probably some use case for
setting separate permissions on a range-partitioned table, but it's a
pretty thin use case.  It certainly seems possible that many users
would prefer a rule that enforces uniform permissions across the
entire partitioning hierarchy.  This is one of the key things that had
to be decided in regard to the partitioning implementation we now
have: for which things should we enforce uniformity, and for which
things should we allow diversity?  I advocated for enforcing
uniformity only in areas where we could see a clear advantage to it,
which led to the fairly minimal approach of enforcing only that we had
no multiple inheritance and no extra columns in the children, but
that's certainly an arguable position.  Other people argued for more
restrictions, I believe out of a desire to create more administrative
simplicity, but there is a risk of cutting yourself off from useful
configurations there, and it seems very difficult to me to draw a hard
line between what is useful and what is useless.

For example, consider a hash-partitioned table.  Could it make sense
to have some but not all partitions be unlogged?  I think it could.
Suppose you have a cluster of machines each of which has a replica of
the same hash-partitioned table.  Each server uses logged tables for
the partitions for which it is the authoritative source of
information, and unlogged tables for the others.  In the event of
crash, the data for any tables that are lost are replicated from the
master for that machine.  I can think of some disadvantages of that
design, but I can think of some advantages, too, and I think it's
pretty hard to say that nobody should ever want to do it.  And if it's
legitimate to want to do that, then what if I want to use
trigger-based replication rather than logical replication?  Then I
might need triggers on some partitions but not all, or maybe different
triggers on different partitions.

Even for a permissions grant, suppose my production system is having
some problem that can't be replicated on the test data set.  Is it
reasonable to want to give a trusted developer access to a slice, but
not all of, my production data?  I could allow them access to just one
partition.  Maybe not a common desire, but is that enough reason to
ban it?  I'd say it's arguable.  I don't think that there are bright
lines around any of this stuff.  My experience with this area has led
me to give up on the idea of complete uniformity as impractical, and
instead look at it from the perspective of "what do we absolutely have
to ban in order for this to be sane?".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[HACKERS] Hash Functions

From

Jeff Davis

Date:

19 May 2017, 09:36:39

On Thursday, May 18, 2017, Robert Haas <robertmhaas@gmail.com> wrote:
> My experience with this area has led
> me to give up on the idea of complete uniformity as impractical, and
> instead look at it from the perspective of "what do we absolutely have
> to ban in order for this to be sane?".

I could agree to something like that. Let's explore some of the challenges there and potential solutions:

1. Dump/reload of hash partitioned data.

Falling back to restore-through-the-root seems like a reasonable answer here. Moving to a different encoding is not an edge case, but it's not common either, so a performance penalty seems acceptable. I'm not immediately sure how we'd implement this in pg_dump/restore, so I'd feel a little more comfortable if I saw a sketch.

2. Having a lot of hash partitions would be cumbersome

The user would need to create and manage each partition, and try to do global operations in a sane way. The normal case would probably involve scripts to do things like add an index to all partitions, or a column. Many partitions would also just pollute the namespace unless you remember to put them in a separate schema (yes, it's easy, but most people will still forget). Some syntax sugar would go a long way here.

3. The user would need to specify details they really don't care about for each partition.

Things like "modulus 16, remainder 0", "modulus 16, remainder 1" are tedious boilerplate. And if the user makes a mistake, then 1/16 of inserts start failing. Probably would be caught during testing, but not exactly a good user experience. I'm not thrilled about this, considering that all the user really wants is 16 partitions, but it's not the end of the world.

4. Detach is a foot-gun

If you detach a partition, random inserts will start failing. Not thrilled about this, but a hapless user would accept most of the blame if they stumble over it. Another way of saying this is with hash partitioning you really need the whole set for the table to be online at all. But we can't really enforce that, because it would limit some of the flexibility that you have in mind.

Stepping back, your approach might be closer to the general postgres philosophy of allowing the user to assemble from spare parts first, then a few releases later we offer some pre-built subassemblies, and a few releases later we make the typical cases work out of the box. I'm fine with it as long as we don't paint ourselves into a corner.

Of course we still have work to do on the hash functions. We should solve at least the most glaring portability problems, and try to harmonize the hash opfamilies. If you agree, I can put together a patch or two.

Regards,

Jeff Davis

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

19 May 2017, 17:09:40

On Fri, May 19, 2017 at 2:36 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> I could agree to something like that. Let's explore some of the challenges
> there and potential solutions:
>
> 1. Dump/reload of hash partitioned data.
>
> Falling back to restore-through-the-root seems like a reasonable answer
> here. Moving to a different encoding is not an edge case, but it's not
> common either, so a performance penalty seems acceptable. I'm not
> immediately sure how we'd implement this in pg_dump/restore, so I'd feel a
> little more comfortable if I saw a sketch.

Right, I think this needs some investigation.  I can't whip up a
sketch on short notice, but I'll see if someone else at EnterpriseDB
can work on it unless somebody else wants to take a crack at it.

> 2. Having a lot of hash partitions would be cumbersome
>
> The user would need to create and manage each partition, and try to do
> global operations in a sane way. The normal case would probably involve
> scripts to do things like add an index to all partitions, or a column. Many
> partitions would also just pollute the namespace unless you remember to put
> them in a separate schema (yes, it's easy, but most people will still
> forget). Some syntax sugar would go a long way here.

I agree.  Adding a column already cascades to all children, and
there's a proposal to make the same thing true for indexes.  See
discussion beginning at
http://postgr.es/m/c8fe4f6b-ff46-aae0-89e3-e936a35f0cfd@postgrespro.ru

I haven't had time to review the code posted there yet, but I would
like to see something along the lines discussed there committed to
v11, and hopefully also something around foreign keys.  It should be
possible to create an outbound foreign key on a foreign table and have
that cascade to all children.  Conversely,  it should also be possible
to create a foreign key referencing a partitioned table provided that
the foreign key references the partitioning key, and that there's a
unique index on those same columns on every partition.  (Referencing a
foreign key that is not the partitioning key will have to wait for
global indexes, I think.)

These things are useful not just for hash partitioning, but also for
list and range partitioning, and we'll save a lot of work if we can
use the same infrastructure for both cases.

> 3. The user would need to specify details they really don't care about for
> each partition.
>
> Things like "modulus 16, remainder 0", "modulus 16, remainder 1" are tedious
> boilerplate. And if the user makes a mistake, then 1/16 of inserts start
> failing. Probably would be caught during testing, but not exactly a good
> user experience. I'm not thrilled about this, considering that all the user
> really wants is 16 partitions, but it's not the end of the world.

As I said on the hash partitioning thread, I think the way to handle
this is to get the basic feature in first, then add some convenience
syntax to auto-create the partitions.  That shouldn't be very
difficult to implement; I just didn't want to complicate things more
than necessary for the first version.  The same issue arose when
discussing list and range partitioning: Oracle has syntax like ours,
but with the ability (requirement?) to create the partitions in the
initial CREATE TABLE statement.  However, I wanted to make sure we had
the "inconvenient" syntax fully working and fully tested before we
added that, because I believe pg_dump needs to dump everything out the
long way.

> 4. Detach is a foot-gun
>
> If you detach a partition, random inserts will start failing. Not thrilled
> about this, but a hapless user would accept most of the blame if they
> stumble over it. Another way of saying this is with hash partitioning you
> really need the whole set for the table to be online at all. But we can't
> really enforce that, because it would limit some of the flexibility that you
> have in mind.

Yes, I agree with all of that.  I don't think it's really going to be
a problem in practice.  The only reason to detach a hash partition is
if you want to split it, and we may eventually have convenience syntax
to do that in an automated (i.e. less error-prone) way.  If somebody
does it manually and without a plan for putting back a replacement
partition, they may be sad, but if somebody puts a CHECK (false)
constraint on their table, they may be sad about that, too.  It's more
important to allow for flexibility than to prohibit every stupid thing
somebody might try to do.  Also, documentation helps.  We've got a
chapter on partitioning and it can be expanded to discuss these kinds
of issues.

> Stepping back, your approach might be closer to the general postgres
> philosophy of allowing the user to assemble from spare parts first, then a
> few releases later we offer some pre-built subassemblies, and a few releases
> later we make the typical cases work out of the box. I'm fine with it as
> long as we don't paint ourselves into a corner.

That's basically my thinking here.  Right now, our partitioning is
primitive in numerous ways, and so the rough edges are pretty visible.
However, I believe that with careful design we can file down many of
those rough edges over time.  Now, it's probably never going to be
quite as smooth as if the system had been designed for partitions from
the ground up, or at least not any time in the foreseeable future, but
I think it can still be very good.  If we add the above-referenced
logic for index and foreign key handling, convenience syntax for
creating partitions along with tables and for data-movement
operations, better partition pruning in the planner, run-time
partition-pruning in the executor, partitionwise join and aggregate,
MergeAppend -> Append strength reduction when the required sortorder
matches the partition order, UPDATE tuple routing, default partitions,
and so on, I believe that partitioning will go from "eh, that's better
than inheritance" to "hey, that's actually really good".  There is a
ton of work to do there, but I think it is all doable within the
current infrastructure, and all of those things can (and in a number
of cases already are) being worked on as separate patches, so we can
get into each release what is ready and push off to the next release
what isn't.  As we go, we'll have fewer and fewer cases where
partitioning a table regresses performance and more and more cases
where the stuff you want to do just works.

Probably the toughest nut to crack is global indexes.  An argument was
made on this very mailing list a number of years ago that nobody
should want global indexes because $REASONS, but I was a little
skeptical of that argument at the time and it's been clear to me that
EnterpriseDB's customers, at least, do not in any way accept those
arguments.  It works on Oracle or other systems they use, and they
find it useful enough that they're unwilling to be without it.  If
PostgreSQL provides the same functionality, they'll use it for more
things than if it doesn't.  I find myself more than a bit intimidated
at the prospect of actually trying to make this work, but I've been
convinced that people won't stop asking until it does.

> Of course we still have work to do on the hash functions. We should solve at
> least the most glaring portability problems, and try to harmonize the hash
> opfamilies. If you agree, I can put together a patch or two.

I definitely agree with solving the portability problems to the extent
that we can reasonably do so.  I think adding more cross-type hash
opfamilies is a mildly good thing: I don't object to it, it probably
makes sense to do at the same time as any other hashing changes we
want to make, and it's better than not doing it.  At the same time, I
wouldn't walk over hot coals for it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

01 June 2017, 20:59:42

On Fri, May 12, 2017 at 1:35 PM, Joe Conway <mail@joeconway.com> wrote:
>> That's a good point, but the flip side is that, if we don't have
>> such a rule, a pg_dump of a hash-partitioned table on one
>> architecture might fail to restore on another architecture.  Today, I
>> believe that, while the actual database cluster is
>> architecture-dependent, a pg_dump is architecture-independent.  Is it
>> OK to lose that property?
>
> Not from where I sit.

It was pointed out at PGCon that we've actually already crossed this
Rubicon.  If you create a range-partitioned table, put a bunch of data
into it, and then try to reload it on another system with a different
set of encoding definitions, the proper partition for some row might
be different.  That would result in the dump failing to reload with a
complaint about the partition key being violated.  And, in fact, you
could have the exact same issue on earlier releases which don't have
partitioning, because a CHECK constraint of the form (a >= 'something'
AND b < 'something else') could be true under one encoding and false
under another, and you could define such a constraint on any release
(from this millienium, anyway).

I'm not actually aware of an instance where this has bitten anyone,
even though it seems like it certainly could have and maybe should've
gotten somebody at some point.  Has anyone else?  I think it's a
reasonable guess that such problems will become more common with the
advent of partitioning and more common still as we continue to improve
partitioning, because people who otherwise would have given up on
PostgreSQL -- or at least on partitioning -- will actually try to use
it in earnest and then hit this problem.  However, my guess is that it
will still be pretty rare, and that having an optional
--dump-partition-data-with-parent flag that can be used when it crops
up will be an adequate workaround for most people.  Of course, that is
just my opinion.

So now I think that the right way to think about the issues around
hash partitioning is as a new variant of a problem that we already
have rather than an altogether new problem.  IMHO, the right questions
are:

1. Are the new problems worse than the old ones?

2. What could we do about it?

On #1, I'd say tentatively yes.  The problem of portability across
encodings is probably not new.  Suppose you have a table partitioned
by range, either using the new range partitioning or using the old
table inheritance method and CHECK constraints.  If you move that
table to a different encoding, will the collation behavior you get
under the new encoding match the collation behavior you got under the
old encoding?  The documentation says: "Also, a collation is tied to a
character set encoding (see Section 22.3). The same collation name may
exist for different encodings", which makes it sound like it is
possible but not guaranteed.  Even if the same collation name happens
to exist, there's no guarantee it behaves the same way under the new
encoding, and given our experiences with glibc so far, I'd bet against
it.  If glibc doesn't even think strcoll() and strxfrm() need to agree
with each other for the same collation, or that minor releases
shouldn't whack the behavior around, there doesn't seem to be room for
optimism about the possibility that they carefully preserve behavior
across similarly-named collations on different encodings.  On the
other hand, collation rules probably tend to vary only around the
edges, so there's a reasonably good chance that even if the collation
rules change when you switch encodings, every row will still get put
into the same partition as before.  If we implement hashing for hash
partitioning in some trivial way like hashing the raw bytes, that will
most certainly not be true -- *most* rows will move to a different
partition when you switch encodings.

Furthermore, neither range nor list partitioning depends on properties
of the hardware, like how wide integers are, or whether they are
stored big-endian.  A naive approach to hash partitioning would depend
on those things.  That's clearly worse.

On #2, I'll attempt to list the approaches that have been proposed so far:

1. Don't implement hash partitioning (Tom Lane).  I don't think this
proposal will attract many votes.

2. Add an option like --dump-partition-data-with-parent.  I'm not sure
who originally proposed this, but it seems that everybody likes it.
What we disagree about is the degree to which it's sufficient.  Jeff
Davis thinks it doesn't go far enough: what if you have an old
plain-format dump that you don't want to hand-edit, and the source
database is no longer available?  Most people involved in the
unconference discussion of partitioning at PGCon seemed to feel that
wasn't really something we should be worry about too much.  I had been
taking that position also, more or less because I don't see that there
are better alternatives.  For instance, Jeff proposed having the COPY
command specify both the parent and the child and providing a run-time
option of some kind to decide which table name gets used, but I think
that would be a fairly unpleasant syntax wart with some risk of
breaking other cases (e.g. what if you want to restore a single child
on a system where the parent doesn't exist?).  Someone else in the
room had a different take on why we shouldn't worry about this
problem, which I'll try to summarize: "Well, encoding conversions are
already so painful that if you're laboring under any illusion that
it's all just going to work, you're wrong, and this isn't going to
make anything materially worse."

3. Implement portable hash functions (Jeff Davis or me, not sure
which).  Andres scoffed at this idea, but I still think it might have
legs.  Coming up with a hashing algorithm for integers that produces
the same results on big-endian and little-endian systems seems pretty
feasible, even with the additional constraint that it should still be
fast.   Coming up with a hashing algorithm that produces the same
results on every various encodings of the same characters is
definitely feasible in any case where we know how to do encoding
conversion among all relevant encodings, but it is probably hard to
make fast.  Those two things also solve different parts of the
problem; one is insulating the user from a difference in hardware
architecture, while the other is insulating the user from a difference
in user-selected settings.  I think that the first of those things is
more important than the second, because it's easier to change your
settings than it is to change your hardware.  Also, I think that there
may be room for allowing for some user choice on this point; if people
are willing to write multiple hash opfamilies with different
portability-performance trade-offs, we can offer them all, either in
core or contrib, and users can pick the ones that have the properties
they want.  My personal guess is that most people will prefer the fast
hash functions over the ones that solve their potential future
migration problems, but, hey, options are good.

I think that some combination of #2 and #3 is probably as well as
we're going to do here, and personally I think we can make that pretty
good.  Users who use PostgreSQL 11 to do hash partitioning using a
composite key consisting of one float8 column and one SJIS-encoded
text column on a VAX and attach to each partition a CHECK constraint
that happens to pass only because of the vagaries of which rows end up
in which partitions, and who then try to migrate to  Linux64 with a
numeric column and a UTF-8 encoded text column with the same CHECK
constraints will have some problems to work out.  However, for the
vast majority of people, reloading the data through the top-parent is
going to be good enough, and the more portable we can make the hash
functions, the fewer people will need to do even that much.  Of
course, if there are other things we can do for a reasonable
investment of effort to make things better still, great!  I'm open to
suggestions.  But I don't think we need to panic about this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

Andres Freund

Date:

01 June 2017, 21:25:22

On 2017-06-01 13:59:42 -0400, Robert Haas wrote:
> I'm not actually aware of an instance where this has bitten anyone,
> even though it seems like it certainly could have and maybe should've
> gotten somebody at some point.  Has anyone else?

Two comments: First, citus has been doing hash-partitiong and
append/range partitioning for a while now, and I'm not aware of anyone
being bitten by this (although there've been plenty other things ;)),
even though there've been cases upgrading to different collation &
encodings.  Secondly, I think that's to a significant degree caused by
the fact that in practice people way more often partition on types like
int4/int8/date/timestamp/uuid rather than text - there's rarely good
reasons to do the latter.

> Furthermore, neither range nor list partitioning depends on properties
> of the hardware, like how wide integers are, or whether they are
> stored big-endian.  A naive approach to hash partitioning would depend
> on those things.  That's clearly worse.

I don't think our current int4/8 hash functions depend on
FLOAT8PASSBYVAL.

> 3. Implement portable hash functions (Jeff Davis or me, not sure
> which).  Andres scoffed at this idea, but I still think it might have
> legs.  Coming up with a hashing algorithm for integers that produces
> the same results on big-endian and little-endian systems seems pretty
> feasible, even with the additional constraint that it should still be
> fast.

Just to clarify: I don't think it's a problem to do so for integers and
most other simple scalar types. There's plenty hash algorithms that are
endianess independent, and the rest is just a bit of care.  Where I see
a lot more issues is doing so for more complex types like arrays, jsonb,
postgis geometry/geography types and the like, where the fast and simple
implementation is to just hash the entire datum - and that'll very
commonly not be portable at all due to padding and type wideness
differences.

> My personal guess is that most people will prefer the fast
> hash functions over the ones that solve their potential future
> migration problems, but, hey, options are good.

I'm pretty sure that will be the case.  I'm not sure that adding
infrastructure to allow for something that nobody will use in practice
is a good idea.  If there ends up being demand for it, we can still go there.

I think that the number of people that migrate between architectures is
low enough that this isn't going to be a very common issue.  Having some
feasible way around this is important, but I don't think we should
optimize heavily for it by developing new infrastructure / complicating
experience for the 'normal' uses.

Greetings,

Andres Freund

Re: [HACKERS] Hash Functions

From

Joe Conway

Date:

02 June 2017, 00:45:29

On 06/01/2017 11:25 AM, Andres Freund wrote:
> On 2017-06-01 13:59:42 -0400, Robert Haas wrote:
>> My personal guess is that most people will prefer the fast
>> hash functions over the ones that solve their potential future
>> migration problems, but, hey, options are good.
>
> I'm pretty sure that will be the case.  I'm not sure that adding
> infrastructure to allow for something that nobody will use in practice
> is a good idea.  If there ends up being demand for it, we can still go there.
>
> I think that the number of people that migrate between architectures is
> low enough that this isn't going to be a very common issue.  Having some
> feasible way around this is important, but I don't think we should
> optimize heavily for it by developing new infrastructure / complicating
> experience for the 'normal' uses.

+1

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

Re: [HACKERS] Hash Functions

From

Jeff Davis

Date:

02 June 2017, 08:24:54

On Thu, Jun 1, 2017 at 10:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> 1. Are the new problems worse than the old ones?
>
> 2. What could we do about it?

Exactly the right questions.

1. For range partitioning, I think it's "yes, a little". As you point
out, there are already some weird edge cases -- the main way range
partitioning would make the problem worse is simply by having more
users.

But for hash partitioning I think the problems will become more
substantial. Different encodings, endian issues, etc. will be a
headache for someone, and potentially a bad day if they are urgently
trying to restore on a new machine. We should remember that not
everyone is a full-time postgres DBA, and users might reasonably think
that the default options to pg_dump[all] will give them a portable
dump.

2. I basically see two approaches to solve the problem: (a) Tom suggested at PGCon that we could have a GUC that
automatically causes inserts to the partition to be re-routed through
the parent. We could discuss whether to always route through the
parent, or do a recheck on the partition constrains and only reroute
tuples that will fail it. If the user gets into trouble, the worst
that would happen is a helpful error message telling them to set the
GUC. I like this idea. (b) I had suggested before that we could make the default text dump
(and the default output from pg_restore, for consistency) route
through the parent. Advanced users would dump with -Fc, and pg_restore
would support an option to do partition-wise loading. To me, this is
simpler, but users might forget to use (or not know about) the
pg_restore option and then it would load more slowly. Also, the ship
is sailing on range partitioning, so we might prefer option (a) just
to avoid making any changes.

I am fine with either option.

> 2. Add an option like --dump-partition-data-with-parent.  I'm not sure
> who originally proposed this, but it seems that everybody likes it.
> What we disagree about is the degree to which it's sufficient.  Jeff
> Davis thinks it doesn't go far enough: what if you have an old
> plain-format dump that you don't want to hand-edit, and the source
> database is no longer available?  Most people involved in the
> unconference discussion of partitioning at PGCon seemed to feel that
> wasn't really something we should be worry about too much.  I had been
> taking that position also, more or less because I don't see that there
> are better alternatives.

If the suggestions above are unacceptable, and we don't come up with
anything better, then of course we have to move on. I am worrying now
primarily because now is the best time to worry; I don't expect any
horrible outcome.

> 3. Implement portable hash functions (Jeff Davis or me, not sure
> which).  Andres scoffed at this idea, but I still think it might have
> legs.

I think it reduces the problem, which has value, but it's hard to make
it rock-solid.

> make fast.  Those two things also solve different parts of the
> problem; one is insulating the user from a difference in hardware
> architecture, while the other is insulating the user from a difference
> in user-selected settings.  I think that the first of those things is
> more important than the second, because it's easier to change your
> settings than it is to change your hardware.

Good point.

Regards,    Jeff Davis

Re: [HACKERS] Hash Functions

From

Jeff Davis

Date:

02 June 2017, 08:30:19

On Thu, Jun 1, 2017 at 11:25 AM, Andres Freund <andres@anarazel.de> wrote:
> Secondly, I think that's to a significant degree caused by
> the fact that in practice people way more often partition on types like
> int4/int8/date/timestamp/uuid rather than text - there's rarely good
> reasons to do the latter.

Once we support more pushdowns to partitions, the only question is:
what are your join keys and what are your grouping keys?

Text is absolutely a normal join key or group key. Consider joins on a
user ID or grouping by a model number.

Regards,    Jeff Davis

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

02 June 2017, 15:47:53

On Fri, Jun 2, 2017 at 1:24 AM, Jeff Davis <pgsql@j-davis.com> wrote:
> 1. For range partitioning, I think it's "yes, a little". As you point
> out, there are already some weird edge cases -- the main way range
> partitioning would make the problem worse is simply by having more
> users.

I agree.

> But for hash partitioning I think the problems will become more
> substantial. Different encodings, endian issues, etc. will be a
> headache for someone, and potentially a bad day if they are urgently
> trying to restore on a new machine. We should remember that not
> everyone is a full-time postgres DBA, and users might reasonably think
> that the default options to pg_dump[all] will give them a portable
> dump.

I agree to an extent.  I think the problem will be worse for hash
partitioning but I might disagree with you on how much worse.  I think
that most people don't do encoding conversions very often, and that
those who do know (or should know) enough to expect trouble.  I think
most people do endian-ness conversions almost never, but since that's
a matter of hardware not configuration I'd like to paper over that
case if we can.

> 2. I basically see two approaches to solve the problem:
>   (a) Tom suggested at PGCon that we could have a GUC that
> automatically causes inserts to the partition to be re-routed through
> the parent. We could discuss whether to always route through the
> parent, or do a recheck on the partition constrains and only reroute
> tuples that will fail it. If the user gets into trouble, the worst
> that would happen is a helpful error message telling them to set the
> GUC. I like this idea.

Yeah, that's not crazy.  I find it a bit surprising in terms of the
semantics, though.  SET
when_i_try_to_insert_into_a_specific_partition_i_dont_really_mean_it =
true?

>   (b) I had suggested before that we could make the default text dump
> (and the default output from pg_restore, for consistency) route
> through the parent. Advanced users would dump with -Fc, and pg_restore
> would support an option to do partition-wise loading. To me, this is
> simpler, but users might forget to use (or not know about) the
> pg_restore option and then it would load more slowly. Also, the ship
> is sailing on range partitioning, so we might prefer option (a) just
> to avoid making any changes.

I think this is a non-starter.  The contents of the dump shouldn't
depend on the format chosen; that is bound to confuse somebody.  I
also do not wish to inflict a speed penalty on the users of
plain-format dumps.

>> 2. Add an option like --dump-partition-data-with-parent.  I'm not sure
>> who originally proposed this, but it seems that everybody likes it.
>> What we disagree about is the degree to which it's sufficient.  Jeff
>> Davis thinks it doesn't go far enough: what if you have an old
>> plain-format dump that you don't want to hand-edit, and the source
>> database is no longer available?  Most people involved in the
>> unconference discussion of partitioning at PGCon seemed to feel that
>> wasn't really something we should be worry about too much.  I had been
>> taking that position also, more or less because I don't see that there
>> are better alternatives.
>
> If the suggestions above are unacceptable, and we don't come up with
> anything better, then of course we have to move on. I am worrying now
> primarily because now is the best time to worry; I don't expect any
> horrible outcome.

OK.

>> 3. Implement portable hash functions (Jeff Davis or me, not sure
>> which).  Andres scoffed at this idea, but I still think it might have
>> legs.
>
> I think it reduces the problem, which has value, but it's hard to make
> it rock-solid.

I agree.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

Joe Conway

Date:

02 June 2017, 17:19:11

On 06/02/2017 05:47 AM, Robert Haas wrote:
> On Fri, Jun 2, 2017 at 1:24 AM, Jeff Davis <pgsql@j-davis.com> wrote:
>> 2. I basically see two approaches to solve the problem:
>>   (a) Tom suggested at PGCon that we could have a GUC that
>> automatically causes inserts to the partition to be re-routed through
>> the parent. We could discuss whether to always route through the
>> parent, or do a recheck on the partition constrains and only reroute
>> tuples that will fail it. If the user gets into trouble, the worst
>> that would happen is a helpful error message telling them to set the
>> GUC. I like this idea.
>
> Yeah, that's not crazy.  I find it a bit surprising in terms of the
> semantics, though.  SET
> when_i_try_to_insert_into_a_specific_partition_i_dont_really_mean_it =
> true?

Maybe SET partition_tuple_retry = true;
-or- SET partition_tuple_reroute = true;
?

I like the idea of only rerouting when failing constraints although I
can envision where there might be use cases where you essentially want
to re-partition and therefore reroute everything, leading to:
 SET partition_tuple_reroute = (none | error | all);

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

02 June 2017, 19:32:53

On Fri, Jun 2, 2017 at 10:19 AM, Joe Conway <mail@joeconway.com> wrote:
>> Yeah, that's not crazy.  I find it a bit surprising in terms of the
>> semantics, though.  SET
>> when_i_try_to_insert_into_a_specific_partition_i_dont_really_mean_it =
>> true?
>
> Maybe
>   SET partition_tuple_retry = true;
> -or-
>   SET partition_tuple_reroute = true;
> ?
>
> I like the idea of only rerouting when failing constraints although I
> can envision where there might be use cases where you essentially want
> to re-partition and therefore reroute everything, leading to:
>
>   SET partition_tuple_reroute = (none | error | all);

Personally, I think it's more elegant to make this a pg_dump option
than to make it a server GUC, but I'm not going to spend time fighting
the server GUC idea if other people like it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

04 August 2017, 00:09:41

On Thu, Jun 1, 2017 at 2:25 PM, Andres Freund <andres@anarazel.de> wrote:
> Just to clarify: I don't think it's a problem to do so for integers and
> most other simple scalar types. There's plenty hash algorithms that are
> endianess independent, and the rest is just a bit of care.

Do you have any feeling for which of those endianness-independent hash
functions might be a reasonable choice for us?

https://github.com/markokr/pghashlib implements various hash functions
for PostgreSQL, and claims that, of those implemented, crc32, Jenkins,
lookup3be and lookup3le, md5, and siphash24 are endian-independent.

An interesting point here is that Jeff Davis asserted in the original
post on this thread that our existing hash_any() wasn't portable, but
our current hash_any seems to be the Jenkins algorithm -- so I'm
confused. Part of the problem seems to be that, according to
https://en.wikipedia.org/wiki/Jenkins_hash_function there are 4 of
those. I don't know whether the one in pghashlib is the same one
we've implemented.

Kennel Marshall suggested xxhash as an endian-independent algorithm
upthread. Code for that is available under a 2-clause BSD license.

PostgreSQL page checksums use an algorithm based on, but not exactly,
FNV-1a. See storage/checksum_impl.h. The comments there say this
algorithm was chosen with speed in mind. Our version is not
endian-independent because it folds in 4-byte integers rather than
1-byte integers, but plain old FNV-1a *is* endian-independent and
could be used.

We also have an implementation of CRC32C in core - see port/pg_crc32.h
and port/pg_crc32c_sb8.c. It's not clear to me whether this is
Endian-independent or not, although there is stuff that depends on
WORDS_BIGENDIAN, so, uh, maybe?

Some other possibly-interesting links:
https://research.neustar.biz/2011/12/29/choosing-a-good-hash-function-part-2/
http://greenrobot.org/essentials/features/performant-hash-functions-for-java/comparison-of-hash-functions/
https://www.strchr.com/hash_functions

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

Andres Freund

Date:

04 August 2017, 00:32:35

Hi,

On 2017-08-03 17:09:41 -0400, Robert Haas wrote:
> On Thu, Jun 1, 2017 at 2:25 PM, Andres Freund <andres@anarazel.de> wrote:
> > Just to clarify: I don't think it's a problem to do so for integers and
> > most other simple scalar types. There's plenty hash algorithms that are
> > endianess independent, and the rest is just a bit of care.
> 
> Do you have any feeling for which of those endianness-independent hash
> functions might be a reasonable choice for us?

Not a strong / very informed one, TBH.

I'm not convinced it's worth trying to achieve this in the first place,
now that we "nearly" have the insert-via-parent feature. Isn't this a
lot of work, for very little practical gain? Having to select that when
switching architectures still seems unproblematic. People will just
about never switch endianess, so even a tiny performance & effort
overhead doesn't seem worth it to me.

Leaving that aside:

> https://github.com/markokr/pghashlib implements various hash functions
> for PostgreSQL, and claims that, of those implemented, crc32, Jenkins,
> lookup3be and lookup3le, md5, and siphash24 are endian-independent.

> An interesting point here is that Jeff Davis asserted in the original
> post on this thread that our existing hash_any() wasn't portable, but
> our current hash_any seems to be the Jenkins algorithm -- so I'm
> confused.  Part of the problem seems to be that, according to
> https://en.wikipedia.org/wiki/Jenkins_hash_function there are 4 of
> those.  I don't know whether the one in pghashlib is the same one
> we've implemented.

IIUC lookup3be/le from Marko's hashlib just has a endianess conversion
added.  I'd personally not go for jenkins3, it's not particularly fast,
nor does it balance that out w/ being cryptographicaly secure.

> Kennel Marshall suggested xxhash as an endian-independent algorithm
> upthread.  Code for that is available under a 2-clause BSD license.

Yea, that'd have been one of my suggestions too. Especially as I still
want to implement better compression using lz4, and that'll depend on
xxhash in turn.

> PostgreSQL page checksums use an algorithm based on, but not exactly,
> FNV-1a.  See storage/checksum_impl.h.  The comments there say this
> algorithm was chosen with speed in mind.  Our version is not
> endian-independent because it folds in 4-byte integers rather than
> 1-byte integers, but plain old FNV-1a *is* endian-independent and
> could be used.

Non-SIMDed (which we hope to achieve with our implementation, which is
why we have separate compiler flags for that file) implementations of
FNV are, to my knowledge, not particularly fast. And the SIMD tricks
are, to my knowledge, not really applicable to the case at hand here. So
I'm not a fan of choosing FNV.

> We also have an implementation of CRC32C in core - see port/pg_crc32.h
> and port/pg_crc32c_sb8.c.  It's not clear to me whether this is
> Endian-independent or not, although there is stuff that depends on
> WORDS_BIGENDIAN, so, uh, maybe?

The results should be endian independent. It depends on WORDS_BIGENDIAN
because we need different pre-computed tables depending on endianess.

Greetings,

Andres Freund

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

04 August 2017, 00:43:44

On Thu, Aug 3, 2017 at 5:32 PM, Andres Freund <andres@anarazel.de> wrote:
>> Do you have any feeling for which of those endianness-independent hash
>> functions might be a reasonable choice for us?
>
> Not a strong / very informed one, TBH.
>
> I'm not convinced it's worth trying to achieve this in the first place,
> now that we "nearly" have the insert-via-parent feature. Isn't this a
> lot of work, for very little practical gain? Having to select that when
> switching architectures still seems unproblematic. People will just
> about never switch endianess, so even a tiny performance & effort
> overhead doesn't seem worth it to me.

I kind of agree with you.  There are some advantages of being
endian-independent, like maybe your hash partitioning is really across
multiple shards, and not all the shards are the same machine
architecture, but it's not going to come up for most people.

For me, the basic point here is that we need a set of hash functions
for hash partitioning that are different than what we use for hash
indexes and hash joins -- otherwise when we hash partition a table and
create hash indexes on each partition, those indexes will have nasty
clustering.  Partitionwise hash joins will have similar problems.  So,
a new set of hash functions specifically for hash partitioning is
quite desirable.

Given that, if it's not a big problem to pick ones that have the
portability properties at least some people want, I'd be inclined to
do it.  If it results in a noticeable slowdown on macrobenchmarks,
then not so much, but otherwise, I'd rather do what people are asking
for than spend time arguing about it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

Andres Freund

Date:

04 August 2017, 00:50:05

Hi,

On 2017-08-03 17:43:44 -0400, Robert Haas wrote:
> For me, the basic point here is that we need a set of hash functions
> for hash partitioning that are different than what we use for hash
> indexes and hash joins -- otherwise when we hash partition a table and
> create hash indexes on each partition, those indexes will have nasty
> clustering.  Partitionwise hash joins will have similar problems.  So,
> a new set of hash functions specifically for hash partitioning is
> quite desirable.

Couldn't that just as well solved by being a bit smarter with an IV? I
doubt we want to end up with different hashfunctions for sharding,
partitioning, hashjoins (which seems to form a hierarchy). Having a
working hash-combine function, or even better a hash API that can
continue to use the hash's internal state, seems a more scalable
solution.

Greetings,

Andres Freund

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

04 August 2017, 00:57:37

On Thu, Aug 3, 2017 at 5:50 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-08-03 17:43:44 -0400, Robert Haas wrote:
>> For me, the basic point here is that we need a set of hash functions
>> for hash partitioning that are different than what we use for hash
>> indexes and hash joins -- otherwise when we hash partition a table and
>> create hash indexes on each partition, those indexes will have nasty
>> clustering.  Partitionwise hash joins will have similar problems.  So,
>> a new set of hash functions specifically for hash partitioning is
>> quite desirable.
>
> Couldn't that just as well solved by being a bit smarter with an IV? I
> doubt we want to end up with different hashfunctions for sharding,
> partitioning, hashjoins (which seems to form a hierarchy). Having a
> working hash-combine function, or even better a hash API that can
> continue to use the hash's internal state, seems a more scalable
> solution.

That's another way to go, but it requires inventing a way to thread
the IV through the hash opclass interface.  That's actually sort of a
problem anyway.  Maybe I ought to have started with the question of
how we're going to make that end of things work.  We could:

- Invent a new hash_partition AM that doesn't really make indexes but
supplies hash functions for hash partitioning.
- Add a new, optional support function 2 to the hash AM that takes a
value of the type *and* an IV as an argument.
- Something else.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

Andres Freund

Date:

04 August 2017, 01:08:23

On 2017-08-03 17:57:37 -0400, Robert Haas wrote:
> On Thu, Aug 3, 2017 at 5:50 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2017-08-03 17:43:44 -0400, Robert Haas wrote:
> >> For me, the basic point here is that we need a set of hash functions
> >> for hash partitioning that are different than what we use for hash
> >> indexes and hash joins -- otherwise when we hash partition a table and
> >> create hash indexes on each partition, those indexes will have nasty
> >> clustering.  Partitionwise hash joins will have similar problems.  So,
> >> a new set of hash functions specifically for hash partitioning is
> >> quite desirable.
> >
> > Couldn't that just as well solved by being a bit smarter with an IV? I
> > doubt we want to end up with different hashfunctions for sharding,
> > partitioning, hashjoins (which seems to form a hierarchy). Having a
> > working hash-combine function, or even better a hash API that can
> > continue to use the hash's internal state, seems a more scalable
> > solution.
> 
> That's another way to go, but it requires inventing a way to thread
> the IV through the hash opclass interface.

Only if we really want to do it really well :P. Using a hash_combine()
like

/** Combine two hash values, resulting in another hash value, with decent bit* mixing.** Similar to boost's
hash_combine().*/
static inline uint32
hash_combine(uint32 a, uint32 b)
{a ^= b + 0x9e3779b9 + (a << 6) + (a >> 2);return a;
}

between hash(IV) and the hashfunction should do the trick (the IV needs
to hashed once, otherwise the bit mix is bad).


> That's actually sort of a
> problem anyway.  Maybe I ought to have started with the question of
> how we're going to make that end of things work.

+1 one for that plan.


> We could:
> 
> - Invent a new hash_partition AM that doesn't really make indexes but
> supplies hash functions for hash partitioning.
> - Add a new, optional support function 2 to the hash AM that takes a
> value of the type *and* an IV as an argument.
> - Something else.

Not arguing for it, but one option could also have pg_type.hash*
function(s).

One thing that I think might be advisable to think about is that we're
atm stuck with a relatively bad hash function for hash indexes (and hash
joins/aggs), and we should probably evolve it at some point. At the same
time there's currently people out there relying on the current hash
functions remaining stable.

Greetings,

Andres Freund

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

04 August 2017, 01:47:02

On Thu, Aug 3, 2017 at 6:08 PM, Andres Freund <andres@anarazel.de> wrote:
>> That's another way to go, but it requires inventing a way to thread
>> the IV through the hash opclass interface.
>
> Only if we really want to do it really well :P. Using a hash_combine()
> like
>
> /*
>  * Combine two hash values, resulting in another hash value, with decent bit
>  * mixing.
>  *
>  * Similar to boost's hash_combine().
>  */
> static inline uint32
> hash_combine(uint32 a, uint32 b)
> {
>         a ^= b + 0x9e3779b9 + (a << 6) + (a >> 2);
>         return a;
> }
>
> between hash(IV) and the hashfunction should do the trick (the IV needs
> to hashed once, otherwise the bit mix is bad).

That seems pretty lame, although it's sufficient to solve the
immediate problem, and I have to admit to a certain predilection for
things that solve the immediate problem without creating lots of
additional work.

>> We could:
>>
>> - Invent a new hash_partition AM that doesn't really make indexes but
>> supplies hash functions for hash partitioning.
>> - Add a new, optional support function 2 to the hash AM that takes a
>> value of the type *and* an IV as an argument.
>> - Something else.
>
> Not arguing for it, but one option could also have pg_type.hash*
> function(s).

True.  That is a bit less configurable because you can't then have
multiple functions for the same type.  Going through the opclass
interface means you can have hash_portable_ops and hash_fast_ops and
let people choose.  But this would be easy to implement and enough for
most people in practice.

> One thing that I think might be advisable to think about is that we're
> atm stuck with a relatively bad hash function for hash indexes (and hash
> joins/aggs), and we should probably evolve it at some point. At the same
> time there's currently people out there relying on the current hash
> functions remaining stable.

That to me is a bit of a separate problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

16 August 2017, 18:57:57

On Thu, Aug 3, 2017 at 6:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> That seems pretty lame, although it's sufficient to solve the
> immediate problem, and I have to admit to a certain predilection for
> things that solve the immediate problem without creating lots of
> additional work.

After some further thought, I propose the following approach to the
issues raised on this thread:

1. Allow hash functions to have a second, optional support function,
similar to what we did for btree opclasses in
c6e3ac11b60ac4a8942ab964252d51c1c0bd8845.  The second function will
have a signature of (opclass_datatype, int64) and should return int64.
The int64 argument is a salt.  When the salt is 0, the low 32 bits of
the return value should match what the existing hash support function
returns.  Otherwise, the salt should be used to perturb the hash
calculation.  This design kills two birds with one stone: it gives
callers a way to get 64-bit hash values if they want them (which
should make Tom happy, and we could later think about plugging it into
hash indexes) and it gives us a way of turning a single hash function
into many (which should allow us to prevent hash indexes or hash
tables built on a hash-partitioned table from having a heavily
lopsided distribution, and probably will also make people who are
interested in topics like Bloom filters happy).

2. Introduce a new hash opfamilies here which are more faster, more
portable, and/or better in other ways than the ones we have today.
Given our current rather simplistic notion of a "default" opclass,
there doesn't seem to be an easy to make whatever we introduce here
the default for hash partitioning while keeping the existing default
for other purposes.  That should probably be fixed at some point.
However, given the amount of debate this topic has generated, it also
doesn't seem likely that we'd actually wish to decide on a different
default in the v11 release cycle, so I don't think there's any rush to
figure out exactly how we want to fix it.  Focusing on introducing the
new opfamilies at all is probably a better use of time, IMHO.

Unless anybody strongly objects, I'm going to write a patch for #1 (or
convince somebody else to do it) and leave #2 for someone else to
tackle if they wish.  In addition, I'll tackle (or convince someone
else to tackle) the project of adding that second optional support
function to every hash opclass in the core repository.  Then Amul can
update the core hash partitioning patch to use the new infrastructure
when it's available and fall back to the existing method when it's
not, and I think we'll be in reasonably good shape.

Objections to this plan of attack?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

Tom Lane

Date:

16 August 2017, 19:38:21

Robert Haas <robertmhaas@gmail.com> writes:
> After some further thought, I propose the following approach to the
> issues raised on this thread:

> 1. Allow hash functions to have a second, optional support function,
> similar to what we did for btree opclasses in
> c6e3ac11b60ac4a8942ab964252d51c1c0bd8845.  The second function will
> have a signature of (opclass_datatype, int64) and should return int64.
> The int64 argument is a salt.  When the salt is 0, the low 32 bits of
> the return value should match what the existing hash support function
> returns.  Otherwise, the salt should be used to perturb the hash
> calculation.

+1

> 2. Introduce a new hash opfamilies here which are more faster, more
> portable, and/or better in other ways than the ones we have today.

This part seems, uh, under-defined and/or over-ambitious and/or unrelated
to the problem at hand.  What are the concrete goals?
        regards, tom lane

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

16 August 2017, 20:16:50

On Wed, Aug 16, 2017 at 12:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> After some further thought, I propose the following approach to the
>> issues raised on this thread:
>
>> 1. Allow hash functions to have a second, optional support function,
>> similar to what we did for btree opclasses in
>> c6e3ac11b60ac4a8942ab964252d51c1c0bd8845.  The second function will
>> have a signature of (opclass_datatype, int64) and should return int64.
>> The int64 argument is a salt.  When the salt is 0, the low 32 bits of
>> the return value should match what the existing hash support function
>> returns.  Otherwise, the salt should be used to perturb the hash
>> calculation.
>
> +1

Cool.

>> 2. Introduce a new hash opfamilies here which are more faster, more
>> portable, and/or better in other ways than the ones we have today.
>
> This part seems, uh, under-defined and/or over-ambitious and/or unrelated
> to the problem at hand.  What are the concrete goals?

In my view, it's for the person who proposed a new opclass to say what
goals they're trying to satisfy with that opclass.  A variety of goals
that could justify a new opclass have been proposed upthread --
especially in Jeff Davis's original post (q.v.).  Such goals could
include endian-ness independence, collation-independence, speed,
better bit-mixing, and opfamilies that span more types than our
currently ones.  These goals are probably mutually exclusive, in the
sense that endian-ness independence and collation-independence are
probably pulling in the exact opposite direction as speed, so
conceivably there could be multiple opclasses proposed with different
trade-offs.  I take no position for the present on which of those
would be worth accepting into core.

I agree with you that all of this is basically unrelated to the
problem at hand, if "the problem at hand" means "hash partitioning".
In my mind, there are two really serious issues that have been raised
on that front.  One is the problem of hash joins/aggregates/indexes on
hash-partitioned tables coming out lopsided, and adding an optional
salt will let us fix that problem.  The other is that if you migrate
your data to a different encoding or endianness, you might have
dump/reload problems, but IMHO the already-committed patch for
--load-via-partition-root is as much as really *needs* to be done
there.  I am less skeptical about the idea of endianness-independent
hash functions than he is, but "we can't have
$FEATURE_MANY_PEOPLE_WANT until we solve
$PROBLEM_ANDRES_THINKS_IS_NOT_PRACTICALLY_SOLVABLE" is not a route to
swift victory.

In short, I'm proposing to add a method to seed the existing hash
functions and get 64 bits out of them, and leaving any other potential
improvements to someone who wants to argue for them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

17 August 2017, 00:34:59

On Wed, Aug 16, 2017 at 12:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> After some further thought, I propose the following approach to the
>> issues raised on this thread:
>
>> 1. Allow hash functions to have a second, optional support function,
>> similar to what we did for btree opclasses in
>> c6e3ac11b60ac4a8942ab964252d51c1c0bd8845.  The second function will
>> have a signature of (opclass_datatype, int64) and should return int64.
>> The int64 argument is a salt.  When the salt is 0, the low 32 bits of
>> the return value should match what the existing hash support function
>> returns.  Otherwise, the salt should be used to perturb the hash
>> calculation.
>
> +1

Attached is a quick sketch of how this could perhaps be done (ignoring
for the moment the relatively-boring opclass pushups).  It introduces
a new function hash_any_extended which differs from hash_any() in that
(a) it combines both b and c into the result and (b) it accepts a seed
which is mixed into the initial state if it's non-zero.

Comments?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

hash-any-extended-v1.patch

Re: [HACKERS] Hash Functions

From

Tom Lane

Date:

17 August 2017, 00:58:41

Robert Haas <robertmhaas@gmail.com> writes:
> Attached is a quick sketch of how this could perhaps be done (ignoring
> for the moment the relatively-boring opclass pushups).  It introduces
> a new function hash_any_extended which differs from hash_any() in that
> (a) it combines both b and c into the result and (b) it accepts a seed
> which is mixed into the initial state if it's non-zero.

> Comments?

Hm.  Despite the comment at lines 302-304, I'm not sure that we ought
to do this simply by using "b" as the high order bits.  AFAICS that
exposes little or no additional randomness; in particular it seems
unlikely to meet Jenkins' original design goal that "every 1-bit and
2-bit delta achieves avalanche".  There might be some simple way to
extend the existing code to produce a mostly-independent set of 32 more
bits, but I wonder if we wouldn't be better advised to just keep Jenkins'
code as-is and use some other method entirely for producing the
32 new result bits.

... In fact, on perusing the linked-to page
http://burtleburtle.net/bob/hash/doobs.html
Bob says specifically that taking b and c from this hash does not
produce a fully random 64-bit result.  He has a new one that does,
lookup3.c, but probably he hasn't tried to make that bit-compatible
with the 1997 algorithm.

Your injection of the seed as prepended data seems unassailable from the
randomness standpoint.  But I wonder whether we could do it more cheaply
by xoring the seed into the initial a/b/c values --- it's not very clear
whether those are magic in any interesting way, or just some randomly
chosen constants.

Anyway, I'd certainly suggest that whoever embarks on this for real
spend some time perusing Jenkins' website.
        regards, tom lane

Re: [HACKERS] Hash Functions

From

Kenneth Marshall

Date:

17 August 2017, 01:27:38

On Wed, Aug 16, 2017 at 05:58:41PM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > Attached is a quick sketch of how this could perhaps be done (ignoring
> > for the moment the relatively-boring opclass pushups).  It introduces
> > a new function hash_any_extended which differs from hash_any() in that
> > (a) it combines both b and c into the result and (b) it accepts a seed
> > which is mixed into the initial state if it's non-zero.
> 
> > Comments?
> 
> Hm.  Despite the comment at lines 302-304, I'm not sure that we ought
> to do this simply by using "b" as the high order bits.  AFAICS that
> exposes little or no additional randomness; in particular it seems
> unlikely to meet Jenkins' original design goal that "every 1-bit and
> 2-bit delta achieves avalanche".  There might be some simple way to
> extend the existing code to produce a mostly-independent set of 32 more
> bits, but I wonder if we wouldn't be better advised to just keep Jenkins'
> code as-is and use some other method entirely for producing the
> 32 new result bits.
> 
> ... In fact, on perusing the linked-to page
> http://burtleburtle.net/bob/hash/doobs.html
> Bob says specifically that taking b and c from this hash does not
> produce a fully random 64-bit result.  He has a new one that does,
> lookup3.c, but probably he hasn't tried to make that bit-compatible
> with the 1997 algorithm.
> 

Hi,

The updated hash functions that we currently use are based on Bob Jenkins
lookup3.c and using b as the higher order bits is pretty darn good. I had
lobbied to present the 64-bit b+c hash in the original work for similar
reasons. We are definitely not using a lookup2.c version from 1997.

Regards,
Ken

Re: [HACKERS] Hash Functions

From

Tom Lane

Date:

17 August 2017, 01:41:07

Kenneth Marshall <ktm@rice.edu> writes:
> On Wed, Aug 16, 2017 at 05:58:41PM -0400, Tom Lane wrote:
>> ... In fact, on perusing the linked-to page
>> http://burtleburtle.net/bob/hash/doobs.html
>> Bob says specifically that taking b and c from this hash does not
>> produce a fully random 64-bit result.  He has a new one that does,
>> lookup3.c, but probably he hasn't tried to make that bit-compatible
>> with the 1997 algorithm.

> The updated hash functions that we currently use are based on Bob Jenkins
> lookup3.c and using b as the higher order bits is pretty darn good. I had
> lobbied to present the 64-bit b+c hash in the original work for similar
> reasons. We are definitely not using a lookup2.c version from 1997.

Oh --- I overlooked the bit about "Bob's 2006 update".  Really that
comment block should have been completely rewritten, rather than leaving
the original text there, especially since as it stands there are only
pointers to the old algorithm.
        regards, tom lane

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

18 August 2017, 06:19:17

On Wed, Aug 16, 2017 at 5:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Attached is a quick sketch of how this could perhaps be done (ignoring
> for the moment the relatively-boring opclass pushups).

Here it is with some relatively-boring opclass pushups added.  I just
did the int4 bit; the same thing will need to be done, uh, 35 more
times.  But you get the gist.  No, not that kind of gist.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

add-optional-second-hash-proc-v1.patch

Re: [HACKERS] Hash Functions

From

amul sul

Date:

18 August 2017, 20:12:02

On Fri, Aug 18, 2017 at 8:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Aug 16, 2017 at 5:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Attached is a quick sketch of how this could perhaps be done (ignoring
> for the moment the relatively-boring opclass pushups).

Here it is with some relatively-boring opclass pushups added. I just
did the int4 bit; the same thing will need to be done, uh, 35 more
times. But you get the gist. No, not that kind of gist.

I will work on this.

I have a small query, what if I want a cache entry with extended hash function instead standard one, I might require that while adding hash_array_extended function? Do you think we need to extend lookup_type_cache() as well?

Regards,

Amul

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

18 August 2017, 20:31:44

On Fri, Aug 18, 2017 at 1:12 PM, amul sul <sulamul@gmail.com> wrote:
> I have a small query,  what if I want a cache entry with extended hash
> function instead standard one, I might require that while adding
> hash_array_extended function? Do you think we need to extend
> lookup_type_cache() as well?

Hmm, I thought you had changed the hash partitioning stuff so that it
didn't rely on lookup_type_cache().  You have to look up the function
using the opclass provided in the partition key definition;
lookup_type_cache() will give you the default one for the datatype.
Maybe just use get_opfamily_proc?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

amul sul

Date:

22 August 2017, 15:14:19

On Fri, Aug 18, 2017 at 11:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Aug 18, 2017 at 1:12 PM, amul sul <sulamul@gmail.com> wrote:
> I have a small query, what if I want a cache entry with extended hash
> function instead standard one, I might require that while adding
> hash_array_extended function? Do you think we need to extend
> lookup_type_cache() as well?

Hmm, I thought you had changed the hash partitioning stuff so that it
didn't rely on lookup_type_cache(). You have to look up the function
using the opclass provided in the partition key definition;
lookup_type_cache() will give you the default one for the datatype.
Maybe just use get_opfamily_proc?

Yes, we can do that for

the

partitioning code, but my concern is a little bit

different. I apologize, I wasn't clear enough.

I am trying to extend hash_array & hash_range function. The hash_array and

hash_range function calculates hash by using the respective hash function for

the given argument type (i.e. array/range element type), and those hash

functions are made available in the TypeCacheEntry via lookup_type_cache(). But

in the hash_array & hash_range extended version requires a respective extended

hash function for those element type.

I have added hash_array_extended & hash_range_extended function in the attached

patch 0001, which maintains a local copy of TypeCacheEntry with extended hash

functions. But I am a little bit skeptic about this logic, any

advice/suggestions will be

greatly appreciated.

The logic in the rest of the extended hash functions is same as the standard

one.

Attaching patch 0002 for the reviewer's testing.

Regards,

Amul

Attachment

Re: [HACKERS] Hash Functions

From

amul sul

Date:

29 August 2017, 10:32:03

On Tue, Aug 22, 2017 at 5:44 PM, amul sul <sulamul@gmail.com> wrote:

On Fri, Aug 18, 2017 at 11:01 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Aug 18, 2017 at 1:12 PM, amul sul <sulamul@gmail.com> wrote:
> I have a small query, what if I want a cache entry with extended hash
> function instead standard one, I might require that while adding
> hash_array_extended function? Do you think we need to extend
> lookup_type_cache() as well?

Hmm, I thought you had changed the hash partitioning stuff so that it
didn't rely on lookup_type_cache(). You have to look up the function
using the opclass provided in the partition key definition;
lookup_type_cache() will give you the default one for the datatype.
Maybe just use get_opfamily_proc?

Yes, we can do that for
the
partitioning code, but my concern is a little bit
different. I apologize, I wasn't clear enough.

I am trying to extend hash_array & hash_range function. The hash_array and
hash_range function calculates hash by using the respective hash function for
the given argument type (i.e. array/range element type), and those hash
functions are made available in the TypeCacheEntry via lookup_type_cache(). But
in the hash_array & hash_range extended version requires a respective extended
hash function for those element type.

I have added hash_array_extended & hash_range_extended function in the attached
patch 0001, which maintains a local copy of TypeCacheEntry with extended hash
functions. But I am a little bit skeptic about this logic, any

advice/suggestions will be
greatly appreciated.

Instead, in the attached patch, I have modified lookup_type_cache() to

request it to get extended hash function in the TypeCacheEntry.

For that, I've introduced new flags as TYPECACHE_HASH_EXTENDED_PROC, TYPECACHE_HASH_EXTENDED_PROC_FINFO & TCFLAGS_CHECKED_HASH_EXTENDED_PROC, and additional variables in TypeCacheEntry structure to hold extended hash proc information.

The logic in the rest of the extended hash functions is same as the standard
one.

Same for the hash_array_extended() & hash_range_extended() function as well.

Regards,

Amul

Attachment

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

29 August 2017, 21:18:15

On Tue, Aug 22, 2017 at 8:14 AM, amul sul <sulamul@gmail.com> wrote:
> Attaching patch 0002 for the reviewer's testing.

I think that this 0002 is not something we can think of committing
because there's no guarantee that hash functions will return the same
results on all platforms.  However, what we could and, I think, should
do is hash some representative values of each data type and verify
that hash(x) and hashextended(x, 0) come out the same at least as to
the low-order 32 bits -- and maybe that hashextend(x, 1) comes out
different as to the low-order 32 bits.  The easiest way to do this
seems to be to cast to bit(32).  For example:

SELECT v, hashint4(v)::bit(32) as standard,       hashint4extended(v, 0)::bit(32) as extended0,
hashint4extended(v,1)::bit(32) as extended1
 
FROM (VALUES (0), (1), (17), (42), (550273), (207112489)) x(v)
WHERE hashint4(v)::bit(32) != hashint4extended(v, 0)::bit(32)  OR hashint4(v)::bit(32) = hashint4extended(v,
1)::bit(32);

I suggest adding a version of this for each data type.

From your latest version of 0001, I get:

rangetypes.c:1297:8: error: unused variable 'rngtypid'     [-Werror,-Wunused-variable]       Oid
rngtypid= RangeTypeGetOid(r);
 

I suggest not duplicating the comments from each regular function into
the extended function, but just writing /* Same approach as hashfloat8
*/ when the implementation is non-trivial (no need for this if the
extended function is a single line or the original function had no
comments anyway).

hash_aclitem seems embarrassingly stupid.  I suggest that we make the
extended version slightly less stupid -- i.e. if the seed is non-zero,
actually call hash_any_extended on the sum and pass the seed through.
         * Reset info about hash function whenever we pick up new info about         * equality operator.  This is so
wecan ensure that the hash function         * matches the operator.         */        typentry->flags &=
~(TCFLAGS_CHECKED_HASH_PROC);
+        typentry->flags &= ~(TCFLAGS_CHECKED_HASH_EXTENDED_PROC);

Adjust comment: "about hash function" -> "about hash functions", "hash
functions matches" -> "hash functions match".

+extern Datum
+hash_any_extended(register const unsigned char *k, register int
+                  keylen, uint64 seed);

Formatting.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

amul sul

Date:

30 August 2017, 17:43:34

On Tue, Aug 29, 2017 at 11:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Aug 22, 2017 at 8:14 AM, amul sul <sulamul@gmail.com> wrote:
> Attaching patch 0002 for the reviewer's testing.

I think that this 0002 is not something we can think of committing
because there's no guarantee that hash functions will return the same
results on all platforms. However, what we could and, I think, should
do is hash some representative values of each data type and verify
that hash(x) and hashextended(x, 0) come out the same at least as to
the low-order 32 bits -- and maybe that hashextend(x, 1) comes out
different as to the low-order 32 bits. The easiest way to do this
seems to be to cast to bit(32). For example:

SELECT v, hashint4(v)::bit(32) as standard,
hashint4extended(v, 0)::bit(32) as extended0,
hashint4extended(v, 1)::bit(32) as extended1
FROM (VALUES (0), (1), (17), (42), (550273), (207112489)) x(v)
WHERE hashint4(v)::bit(32) != hashint4extended(v, 0)::bit(32)
OR hashint4(v)::bit(32) = hashint4extended(v, 1)::bit(32);

I suggest adding a version of this for each data type.

Thanks for the suggestion, I have updated 0002-patch accordingly.

Using this I found some strange behaviours as follow:

1) standard and extended0 output for the jsonb_hash case is not same.

2) standard and extended0 output for the hash_range case is not same when input

is int4range(550274, 1550274) other case in the patch are fine. This can be

reproducible with other values as well, for e.g. int8range(1570275, 208112489).

Will look into this tomorrow.

Another case which I want to discuss is, extended and standard version of

hashfloat4, hashfloat8 & hash_numeric function will have the same output for zero

value irrespective of seed value. Do you think we need a fix for this?

From your latest version of 0001, I get:

rangetypes.c:1297:8: error: unused variable 'rngtypid'
[-Werror,-Wunused-variable]
Oid rngtypid = RangeTypeGetOid(r);

Fixed in the attached version.

I suggest not duplicating the comments from each regular function into
the extended function, but just writing /* Same approach as hashfloat8
*/ when the implementation is non-trivial (no need for this if the
extended function is a single line or the original function had no
comments anyway).

Fixed in the attached version.

hash_aclitem seems embarrassingly stupid. I suggest that we make the
extended version slightly less stupid -- i.e. if the seed is non-zero,
actually call hash_any_extended on the sum and pass the seed through.

* Reset info about hash function whenever we pick up new info about
* equality operator. This is so we can ensure that the hash function
* matches the operator.
*/
typentry->flags &= ~(TCFLAGS_CHECKED_HASH_PROC);
+ typentry->flags &= ~(TCFLAGS_CHECKED_HASH_EXTENDED_PROC);

Adjust comment: "about hash function" -> "about hash functions", "hash
functions matches" -> "hash functions match".

Fixed in the attached version.

+extern Datum
+hash_any_extended(register const unsigned char *k, register int
+ keylen, uint64 seed);

Formatting.

Fixed in the attached version.

Thanks !

Regards,

Amul

Attachment

Re: [HACKERS] Hash Functions

From

Robert Haas

Date:

30 August 2017, 18:35:27

On Wed, Aug 30, 2017 at 10:43 AM, amul sul <sulamul@gmail.com> wrote:
> Thanks for the suggestion, I have updated 0002-patch accordingly.
> Using this I found some strange behaviours as follow:
>
> 1) standard and extended0 output for the jsonb_hash case is not same.
> 2) standard and extended0 output for the hash_range case is not same when
> input
>    is int4range(550274, 1550274)  other case in the patch are fine. This can
> be
>    reproducible with other values as well, for e.g. int8range(1570275,
> 208112489).
>
> Will look into this tomorrow.

Those sound like bugs in your patch.  Specifically:

+    /* Same approach as hash_range */
+    result = hash_uint32_extended((uint32) flags, seed);
+    result ^= lower_hash;
+    result = (result << 1) | (result >> 63);
+    result ^= upper_hash;

That doesn't give compatible results.  The easiest thing to do might
be to rotate the high 32 bits and the low 32 bits separately.
JsonbHashScalarValueExtended has the same problem.  Maybe create a
helper function that does something like this (untested):

((x << 1) & UINT64COUNT(0xfffffffefffffffe)) | ((x >> 31) &
UINT64CONST(0x100000001))

> Another case which I want to discuss is, extended and standard version of
> hashfloat4, hashfloat8 & hash_numeric function will have the same output for
> zero
> value irrespective of seed value. Do you think we need a fix for this?

Yes, I think you should return the seed rather than 0 in the cases
where the current code hard-codes a 0 return.  That will give the same
behavior in the seed == 0 case while cheaply getting us something a
bit different when there is a seed.

BTW, you should probably invent and use a PG_RETURN_UINT64 macro in this patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Hash Functions

From

amul sul

Date:

31 August 2017, 15:40:28

On Wed, Aug 30, 2017 at 9:05 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Aug 30, 2017 at 10:43 AM, amul sul <sulamul@gmail.com> wrote:
> Thanks for the suggestion, I have updated 0002-patch accordingly.
> Using this I found some strange behaviours as follow:
>
> 1) standard and extended0 output for the jsonb_hash case is not same.
> 2) standard and extended0 output for the hash_range case is not same when
> input
> is int4range(550274, 1550274) other case in the patch are fine. This can
> be
> reproducible with other values as well, for e.g. int8range(1570275,
> 208112489).
>
> Will look into this tomorrow.

Those sound like bugs in your patch. Specifically:

+ /* Same approach as hash_range */
+ result = hash_uint32_extended((uint32) flags, seed);
+ result ^= lower_hash;
+ result = (result << 1) | (result >> 63);
+ result ^= upper_hash;

Yes, you are correct.

That doesn't give compatible results. The easiest thing to do might
be to rotate the high 32 bits and the low 32 bits separately.
JsonbHashScalarValueExtended has the same problem. Maybe create a
helper function that does something like this (untested):

((x << 1) & UINT64COUNT(0xfffffffefffffffe)) | ((x >> 31) &
UINT64CONST(0x100000001))

This working as expected, also tested by executing the following SQL multiple times:

SELECT v as value, hash_range(v)::bit(32) as standard,

hash_range_extended(v, 0)::bit(32) as extended0,

hash_range_extended(v, 1)::bit(32) as extended1

FROM (VALUES (int8range(floor(random() * 100)::int8, floor(random() * 1000)::int8)),

(int8range(floor(random() * 1000)::int8, floor(random() * 10000)::int8)),

(int8range(floor(random() * 10000)::int8, floor(random() * 100000)::int8)),

(int8range(floor(random() * 10000000)::int8, floor(random() * 100000000)::int8)),

(int8range(floor(random() * 100000000)::int8, floor(random() * 1000000000)::int8))) x(v)

WHERE hash_range(v)::bit(32) != hash_range_extended(v, 0)::bit(32)

OR hash_range(v)::bit(32) = hash_range_extended(v, 1)::bit(32);

> Another case which I want to discuss is, extended and standard version of
> hashfloat4, hashfloat8 & hash_numeric function will have the same output for
> zero
> value irrespective of seed value. Do you think we need a fix for this?

Yes, I think you should return the seed rather than 0 in the cases
where the current code hard-codes a 0 return. That will give the same
behavior in the seed == 0 case while cheaply getting us something a
bit different when there is a seed.

Fixed in the attached version.

BTW, you should probably invent and use a PG_RETURN_UINT64 macro in this patch.

Added i

From

amul sul

Date:

08 September 2017, 10:48:39

On Fri, Sep 1, 2017 at 8:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Aug 31, 2017 at 8:40 AM, amul sul <sulamul@gmail.com> wrote:
> Fixed in the attached version.

I fixed these up a bit and committed them. Thanks.

I think this takes care of adding not only the infrastructure but
support for all the core data types, but I'm not quite sure how to
handle upgrading types in contrib. It looks like citext, hstore, and
several data types provided by isn have hash opclasses, and I think
that there's no syntax for adding a support function to an existing
opclass. We could add that, but I'm not sure how safe it would be.

TBH, I really don't care much about fixing isn, but it seems like
fixing citext and hstore would be worthwhile.

Attached patch proposes the fix for the citext and hstore contrib.

To make it easy to understand I've split these patch in two part. 0001 adds

a new file for the contrib upgrade & renames an existing file to the higher

version, and 0002 is the actual implementation of extended hash function for

that contrib's data type.

Regards,

Amul