Re: Avoiding hash join batch explosions with extreme skew and weird stats - Mailing list pgsql-hackers

From Melanie Plageman
Subject Re: Avoiding hash join batch explosions with extreme skew and weird stats
Date
Msg-id CAAKRu_ZsRU+nszShs3AGVorx=e+2jYkL7X=jiNO6+qbho7vRpw@mail.gmail.com
Whole thread Raw
In response to Re: Avoiding hash join batch explosions with extreme skew and weirdstats  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: Avoiding hash join batch explosions with extreme skew and weird stats  (Robert Haas <robertmhaas@gmail.com>)
Re: Avoiding hash join batch explosions with extreme skew and weird stats  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
So, I've rewritten the patch to use a BufFile for the outer table
batch file tuples' match statuses and write bytes to and from the file
which start as 0 and, upon encountering a match for a tuple, I set its
bit in the file to 1 (also rebased with current master).

It, of course, only works for parallel-oblivious hashjoin -- it relies
on deterministic order of tuples encountered in the outer side batch
file to set the right match bit and uses a counter to decide which bit
to set.

I did the "needlessly dumb implementation" Robert mentioned, though,
I thought about it and couldn't come up with a much smarter way to
write match bits to a file. I think there might be an optimization
opportunity in not writing the current_byte to the file each time that
the outer tuple matches and only doing this once we have advanced to a
tuple number that wouldn't have its match bit in the current_byte. I
didn't do that to keep it simple, and, I suspect there might be a bit
of gymnastics needed to make sure that that byte is actually written
to the file in case we exit from some other state before we encounter
the tuple represented in the last bit in that byte.

I plan to work on a separate implementation for parallel hashjoin
next--to understand what is required. I believe the logic to decide
when to fall back should be fairly easy to slot in at the end once
we've decided what that logic is.

On Sat, Jul 13, 2019 at 4:44 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Wed, Jul 03, 2019 at 02:22:09PM -0700, Melanie Plageman wrote:
>On Tue, Jun 18, 2019 at 3:24 PM Melanie Plageman <melanieplageman@gmail.com>
>
>Before doing that, I wanted to ask what a desirable fallback condition
>would be. In this patch, fallback to hashloop join happens only when
>inserting tuples into the hashtable after batch 0 when inserting
>another tuple from the batch file would exceed work_mem. This means
>you can't increase nbatches, which, I would think is undesirable.
>

Yes, I think that's undesirable.

>I thought a bit about when fallback should happen. So, let's say that
>we would like to fallback to hashloop join when we have increased
>nbatches X times. At that point, since we do not want to fall back to
>hashloop join for all batches, we have to make a decision. After
>increasing nbatches the Xth time, do we then fall back for all batches
>for which inserting inner tuples exceeds work_mem? Do we use this
>strategy but work_mem + some fudge factor?
>
>Or, do we instead try to determine if data skew led us to increase
>nbatches both times and then determine which batch, given new
>nbatches, contains that data, set fallback to true only for that
>batch, and let all other batches use the existing logic (with no
>fallback option) unless they contain a value which leads to increasing
>nbatches X number of times?
>

I think we should try to detect the skew and use this hashloop logic
only for the one batch. That's based on the assumption that the hashloop
is less efficient than the regular hashjoin.

We may need to apply it even for some non-skewed (but misestimated)
cases, though. At some point we'd need more than work_mem for BufFiles,
at which point we ought to use this hashloop.


I have not yet changed the logic for deciding to fall back from
my original design. It will still only fall back for a given batch if
that batch's inner batch file doesn't fit in memory. I haven't,
however, changed the logic to allow it to increase the number of
batches some number of times or according to some criteria before
falling back for that batch.

--
Melanie Plageman
Attachment

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: tap tests driving the database via psql
Next
From: Robert Haas
Date:
Subject: Re: Initdb failure