Re: Avoiding hash join batch explosions with extreme skew and weirdstats - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Avoiding hash join batch explosions with extreme skew and weirdstats |
Date | |
Msg-id | 20190516225427.uvqthiocuibkgh2s@development Whole thread Raw |
In response to | Re: Avoiding hash join batch explosions with extreme skew and weird stats (Thomas Munro <thomas.munro@gmail.com>) |
List | pgsql-hackers |
On Fri, May 17, 2019 at 10:21:56AM +1200, Thomas Munro wrote: >On Fri, May 17, 2019 at 4:39 AM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> I think this is a step in the right direction, but as I said on the other >> thread(s), I think we should not disable growth forever and recheck once >> in a while. Otherwise we'll end up in sad situation with non-uniform data >> sets, as poined out by Hubert Zhang in [1]. It's probably even truer with >> this less strict logic, using 95% as a threshold (instead of 100%). >> >> I kinda like the idea with increasing the spaceAllowed value. Essentially, >> if we decide adding batches would be pointless, increasing the memory >> budget is the only thing we can do anyway. > >But that's not OK, we need to fix THAT. > I agree increasing the budget is not ideal, althought at the moment it's the only thing we can do. If we can improve that, great. >> The problem however is that we only really look at a single bit - it may >> be that doubling the batches would not help, but doing it twice would >> actually reduce the memory usage. For example, assume there are 2 distinct >> values in the batch, with hash values (in binary) > >Yes, that's a good point, and not a case that we should ignore. But >if we had a decent fall-back strategy that respected work_mem, we >wouldn't care so much if we get it wrong in a corner case. I'm >arguing that we should use Grace partitioning as our primary >partitioning strategy, but fall back to looping (or possibly >sort-merging) for the current batch if Grace doesn't seem to be >working. You'll always be able to find cases where if you'd just >tried one more round, Grace would work, but that seems acceptable to >me, because getting it wrong doesn't melt your computer, it just >probably takes longer. Or maybe it doesn't. How much longer would it >take to loop twice? Erm, twice as long, and each loop makes actual >progress, unlike extra speculative Grace partition expansions which >apply not just to the current batch but all batches, might not >actually work, and you *have* to abandon at some point. The more I >think about it, the more I think that a loop-base escape valve, though >unpalatably quadratic, is probably OK because we're in a sink-or-swim >situation at this point, and our budget is work_mem, not work_time. > True. >I'm concerned that we're trying to find ways to treat the symptoms, >allowing us to exceed work_mem but maybe not so much, instead of >focusing on the fundamental problem, which is that we don't yet have >an algorithm that is guaranteed to respect work_mem. > Yes, that's a good point. >Admittedly I don't have a patch, just a bunch of handwaving. One >reason I haven't attempted to write it is because although I know how >to do the non-parallel version using a BufFile full of match bits in >sync with the tuples for outer joins, I haven't figured out how to do >it for parallel-aware hash join, because then each loop over the outer >batch could see different tuples in each participant. You could use >the match bit in HashJoinTuple header, but then you'd have to write >all the tuples out again, which is more IO than I want to do. I'll >probably start another thread about that. > That pesky parallelism ;-) regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: