Re: Out of Memory errors are frustrating as heck! - Mailing list pgsql-performance

From Tomas Vondra
Subject Re: Out of Memory errors are frustrating as heck!
Date
Msg-id 20190422130415.3xv3p2izpyh7xhad@development
Whole thread Raw
In response to Re: Out of Memory errors are frustrating as heck!  (Gaetano Mendola <mendola@gmail.com>)
List pgsql-performance
On Mon, Apr 22, 2019 at 10:07:52AM +0200, Gaetano Mendola wrote:
>   Batch splitting shouldn't be followed by a hash function change? 

What would be the value? That can help with hash collisions, but that's
not the issue with the data sets discussed in this thread. The issue
reported originally is about underestimates, and the sample data set has
a large number of duplicate values (a single value representing ~10% of
the data set). Neither of those issues is about hash collisions.

The data set I used to demonstrate how the algorithms work is pretty
perfect, with uniform distribution and no hash collisions.

Furthermore, I don't think we can just change the hash function, for a
couple of technical reasons.

Firstly, it's not like we totally redistribute the whole dataset from N
old batches to (2*N) new ones. By using the same 32-bit hash value and
cconsidering one extra bit, the tuples either stay in the same batch
(when the new bit is 0) or move to a single new batch (when it's 1). So
each batch is split in 1/2. By changing the hash function this would no
longer be true, and we'd redistribute pretty much the whole data set.

The other issue is even more significant - we don't redistribute the
tuples immediately. We only redistribute the current batch, but leave
the other batches alone and handle them when we actually get to them.
This is possible, because the tuples never move backwards - when
splitting batch K, the tuples either stay in K or move to 2K. Or
something like that, I'm too lazy to recall the exact formula now.

And if I recall correctly, I think we can increment the number of
batches while already performing the join, after some rows were already
processed. That would probably be no longer true if we just switched the
hash function, because it might move rows backwards (to the already
processed region).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



pgsql-performance by date:

Previous
From: Gaetano Mendola
Date:
Subject: Re: Out of Memory errors are frustrating as heck!
Next
From: Jeff Janes
Date:
Subject: Re: Out of Memory errors are frustrating as heck!