Re: Avoiding hash join batch explosions with extreme skew and weirdstats - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Avoiding hash join batch explosions with extreme skew and weirdstats
Date
Msg-id 20190607141740.fgoduqszejf6ugma@development
Whole thread Raw
In response to Re: Avoiding hash join batch explosions with extreme skew and weird stats  (Melanie Plageman <melanieplageman@gmail.com>)
Responses Re: Avoiding hash join batch explosions with extreme skew and weird stats
List pgsql-hackers
On Thu, Jun 06, 2019 at 04:33:31PM -0700, Melanie Plageman wrote:
>On Tue, Jun 4, 2019 at 12:15 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
>> On Tue, Jun 4, 2019 at 3:09 PM Melanie Plageman
>> <melanieplageman@gmail.com> wrote:
>> > One question I have is, how would the OR'd together bitmap be
>> > propagated to workers after the first chunk? That is, when there are
>> > no tuples left in the outer bunch, for a given inner chunk, would you
>> > load the bitmaps from each worker into memory, OR them together, and
>> > then write the updated bitmap back out so that each worker starts with
>> > the updated bitmap?
>>
>> I was assuming we'd elect one participant to go read all the bitmaps,
>> OR them together, and generate all the required null-extended tuples,
>> sort of like the PHJ_BUILD_ALLOCATING, PHJ_GROW_BATCHES_ALLOCATING,
>> PHJ_GROW_BUCKETS_ALLOCATING, and/or PHJ_BATCH_ALLOCATING states only
>> involve one participant being active at a time. Now you could hope for
>> something better -- why not parallelize that work?  But on the other
>> hand, why not start simple and worry about that in some future patch
>> instead of right away? A committed patch that does something good is
>> better than an uncommitted patch that does something AWESOME.
>>
>>
>What if you have a lot of tuples -- couldn't the bitmaps get pretty
>big? And then you have to OR them all together and if you can't put
>the whole bitmap from each worker into memory at once to do it, it
>seems like it would be pretty slow. (I mean maybe not as slow as
>reading the outer side 5 times when you only have 3 chunks on the
>inner + all the extra writes from my unmatched tuple file idea, but
>still...)
>

Yes, they could get quite big, and I think you're right we need to
keep that in mind, because it's on the outer (often quite large) side of
the join. And if we're aiming to restrict memory usage, it'd be weird to
just ignore this.

But I think Thomas Munro originally proposed to treat this as a separate
BufFile, so my assumption was each worker would simply rewrite the bitmap
repeatedly for each hash table fragment. That means a bit more I/O, but as
those files are buffered and written in 8kB pages, with just 1 bit per
tuple. I think that's pretty OK and way cheaper that rewriting the whole
batch, where each tuple can be hundreds of bytes.

Also, it does not require any concurrency control, which rewriting the
batches themselves probably does (because we'd be feeding the tuples into
some shared file, I suppose). Except for the final step when we need to
merge the bitmaps, of course.

So I think this would work, it does not have the issue with using too much
memory, and I don't think the overhead is too bad.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Avoiding hash join batch explosions with extreme skew and weirdstats
Next
From: Robert Haas
Date:
Subject: Re: Avoiding hash join batch explosions with extreme skew and weird stats