Re: Avoiding hash join batch explosions with extreme skew and weirdstats - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Avoiding hash join batch explosions with extreme skew and weirdstats
Date
Msg-id 20190604223111.4mxtoflbavckc63q@development
Whole thread Raw
In response to Re: Avoiding hash join batch explosions with extreme skew and weird stats  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Tue, Jun 04, 2019 at 03:08:24PM -0400, Robert Haas wrote:
>On Tue, Jun 4, 2019 at 2:47 PM Melanie Plageman
><melanieplageman@gmail.com> wrote:
>> Oops! You are totally right.
>> I will amend the idea:
>> For each chunk on the inner side, loop through both the original batch
>> file and the unmatched outer tuples file created for the last chunk.
>> Emit any matches and write out any unmatched tuples to a new unmatched
>> outer tuples file.
>>
>> I think, in the worst case, if no tuples from the outer have a match,
>> you end up writing out all of the outer tuples for each chunk on the
>> inner side. However, using the match bit in the tuple header solution
>> would require this much writing.
>> Probably the bigger problem is that in this worst case you would also
>> need to read double the number of outer tuples for each inner chunk.
>>
>> However, in the best case it seems like it would be better than the
>> match bit/write everything from the outer side out solution.
>
>I guess so, but the downside of needing to read twice as many outer
>tuples for each inner chunk seems pretty large.  It would be a lot
>nicer if we could find a way to store the matched-bits someplace other
>than where we are storing the tuples, what Thomas called a
>bits-in-order scheme, because then the amount of additional read and
>write I/O would be tiny -- one bit per tuple doesn't add up very fast.
>
>In the scheme you propose here, I think that after you read the
>original outer tuples for each chunk and the unmatched outer tuples
>for each chunk, you'll have to match up the unmatched tuples to the
>original tuples, probably by using memcmp() or something.  Otherwise,
>when a new match occurs, you won't know which tuple should now not be
>emitted into the new unmatched outer tuples file that you're going to
>produce.  So I think what's going to happen is that you'll read the
>original batch file, then read the unmatched tuples file and use that
>to set or not set a bit on each tuple in memory, then do the real work
>setting more bits, then write out a new unmatched-tuples file with the
>tuples that still don't have the bit set.  So your unmatched tuple
>file is basically a list of tuple identifiers in the least compact
>form imaginable: the tuple is identified by the entire tuple contents.
>That doesn't seem very appealing, although I expect that it would
>still win for some queries.
>

I wonder how big of an issue that actually is in practice. If this is 
meant for significantly skewed data sets, which may easily cause OOM
(e.g. per the recent report, which restarted this discussion). So if we
still only expect to use this for rare cases, which may easily end up
with an OOM at the moment, the extra cost might be acceptable.

But if we plan to use this more widely (say, allow hashjoins even for
cases that we know won't fit into work_mem), then the extra cost would
be an issue. But even then it should be included in the cost estimate, 
and switch the plan to a merge join when appropriate.

Of course, maybe there are many data sets with enough skew to consume 
explosive growth and consume a lot of memory, but not enough to trigger 
OOM. Those cases may get slower, but I think that's OK. If appropriate,
the user can increase work_mem and get the "good" plan.

FWIW this is a challenge for all approaches discussed in this thread,
not just this particular one. We're restricting the resources available
to the query, switching to something (likely) slower.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Sort support for macaddr8
Next
From: Dave Cramer
Date:
Subject: Re: Binary support for pgoutput plugin