Re: bad estimation together with large work_mem generates terrible slow hash joins - Mailing list pgsql-hackers

From Robert Haas
Subject Re: bad estimation together with large work_mem generates terrible slow hash joins
Date
Msg-id CA+TgmoawdRfARS+DA-k_vm2TJgGBH+nUgR52Y5V_pWxvfejovw@mail.gmail.com
Whole thread Raw
In response to Re: bad estimation together with large work_mem generates terrible slow hash joins  (Tomas Vondra <tv@fuzzy.cz>)
Responses Re: bad estimation together with large work_mem generates terrible slow hash joins
List pgsql-hackers
On Sat, Aug 9, 2014 at 9:13 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
> Adding least-significant bit does not work, we need get back to adding
> the most-significant one. Not sure what's the least complex way to do
> that, though.
>
> I'm thinking about computing the nbuckets limit (how many buckets may
> fit into work_mem):
>
>         tupsize = HJTUPLE_OVERHEAD +
>                 MAXALIGN(sizeof(MinimalTupleData)) +
>                 MAXALIGN(tupwidth);
>
>         nbuckets_bits_max = my_log2(work_mem / tupsize)
>
> and then start with nbatches from this bit, like this:
>
>  31               23     max       15                7                0
>   |----------------|----------------|----------------|----------------|
>   |      |   <- batches    |       free       | <-      buckets       |

That doesn't seem unreasonable provide there's enough bit space, but,
as you point out, the day might not be far off when there isn't.

It also strikes me that when there's only 1 batch, the set of bits
that map onto the batch number is zero-width, and one zero-width bit
range is as good as another.  In other words, if you're only planning
to do one batch, you can easily grow the number of buckets on the fly.
Growing the number of buckets only becomes difficult once you have
more than one batch.

And I mention that because, in the scenario mentioned in your original
post ("a hash table with small number of buckets (8192) containing
large number of tuples [~3.3M]"), presumably what happens is you start
out thinking you are going to have one batch with 8K buckets.  Then,
as more tuples continue to pour in, you see the load factor rising and
responding by bumping up the size of the hash table.  Now eventually
you realize, gosh, this isn't even going to fit into work_mem, so you
need to start batching it.  But by that point you've presumably done
as much as you're going to do in terms of increasing the number of
buckets; after that, you're just going to add more batches as
necessary.  So not being able to further increase the number of
buckets once the number of batches is no longer 1 wouldn't be a
problem in that case.

Now if you start out planning for multiple batches, and then you
discover you'd like to keep the same number of batches but have more
buckets per batch, that's gonna be hard.  It's not impossible, because
you could steal some of the unused high-order bits above the number of
batches, and then concatenate them with the low-order bits that you
already had, but that seems like kind of an ugly kludge, and AFAICS it
only helps if the new tuples are narrower by enough to justify the
cost of splitting all the buckets.  I won't say that couldn't happen,
but I think it would be better to keep that complexity out of the
patch unless we're absolutely certain it's justified.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: psql output change in 9.4
Next
From: Robert Haas
Date:
Subject: Re: psql output change in 9.4