Hi,
On 2018-01-29 13:45:10 -0500, Tom Lane wrote:
> Also, the fact that you needed one more ORDER BY in testing on a
> single machine is already proof of my worry about the regression
> tests, and it'll only get worse when testing across a variety
> of machines.
That's an entirely unconvincing argument. A lot of changes to hashtable
code will result in ordering changes, and has done so in the past. Sure
there occasionally was need for a followup ORDER BY addition, but that's
not that bad. The fact that only a single order by was needed when
entirely changing the hash function output (by randomizing it) doesn't
at all seem to suggest that it's a huge problem.
On 2018-01-29 14:11:26 -0500, Tom Lane wrote:
> One other point here is that it's not really clear to me what a randomly
> varying IV is supposed to accomplish. Surely we're not intending that
> it prevents somebody from crafting a data set that causes bad hash
> performance.
I do think we should make that harder.
> If a user with DB access wants to cause a performance
> problem, there are and always will be plenty of other avenues to making
> that happen.
Sure, but that's different from somebody triggering, e.g. by website
input, such behaviour.
> If the idea is that for a data set that otherwise would have
> bad hash performance, choosing a different IV would (almost always) fix
> it, that sounds good but you're ignoring the inverse case: for a data set
> that works fine, there would be some choices of IV that create a problem
> where there was none before. I see no reason to think that the
> probability of the former kind of situation is higher than the latter.
It's easily intentionally triggerable vs. not.
> So I'm on board with using the extended hash functions when available,
> but I'm not convinced that a varying IV buys us anything but trouble.
I think we really need both. A constant IV of e.g. 0 for a single-column
hashtable doesn't really fix much, even though it does improve hash
quality a bit for multi-column indexes. A random IV does address issues
due to a lot of tuples falling into the same bucket, but does not
address actual collisions. Both together are pretty convincing tho.
Greetings,
Andres Freund