Re: Avoiding hash join batch explosions with extreme skew and weirdstats - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Avoiding hash join batch explosions with extreme skew and weirdstats |
Date | |
Msg-id | 20200623222405.5ggfkwdwkm2cqitq@development Whole thread Raw |
In response to | Re: Avoiding hash join batch explosions with extreme skew and weird stats (Melanie Plageman <melanieplageman@gmail.com>) |
Responses |
Re: Avoiding hash join batch explosions with extreme skew and weird stats
Re: Avoiding hash join batch explosions with extreme skew and weird stats |
List | pgsql-hackers |
On Mon, Jun 08, 2020 at 05:12:25PM -0700, Melanie Plageman wrote: >On Wed, May 27, 2020 at 7:25 PM Melanie Plageman <melanieplageman@gmail.com> >wrote: > >> I've attached a rebased patch which includes the "provisionally detach" >> deadlock hazard fix approach >> > >Alas, the "provisional detach" logic proved incorrect (see last point in >the list of changes included in the patch at bottom). > > >> Also, we kept the batch 0 spilling patch David Kimura authored [1] >> separate so it could be discussed separately because we still had some >> questions. >> > >The serial batch 0 spilling is in the attached patch. Parallel batch 0 >spilling is still in a separate batch that David Kimura is working on. > >I've attached a rebased and updated patch with a few fixes: > >- semi-join fallback works now >- serial batch 0 spilling in main patch >- added instrumentation for stripes to the parallel case >- SharedBits uses same SharedFileset as SharedTuplestore >- reverted the optimization to allow workers to re-attach to a batch and > help out with stripes if they are sure they pose no deadlock risk > >For the last point, I discovered a pretty glaring problem with this >optimization: I did not include the bitmap created by a worker while >working on its first participating stripe in the final combined bitmap. >I only was combining the last bitmap file each worker worked on. > >I had the workers make new bitmaps for each time that they attached to >the batch and participated because having them keep an open file >tracking information for a batch they are no longer attached to on the >chance that they might return and work on that batch was a >synchronization nightmare. It was difficult to figure out when to close >the file if they never returned and hard to make sure that the combining >worker is actually combining all the files from all participants who >were ever active. > >I am sure I can hack around those, but I think we need a better solution >overall. After reverting those changes, loading and probing of stripes >after stripe 0 is serial. This is not only sub-optimal, it also means >that all the synchronization variables and code complexity around >coordinating work on fallback batches is practically wasted. >So, they have to be able to collaborate on stripes after the first >stripe. This version of the patch has correct results and no deadlock >hazard, however, it lacks parallelism on stripes after stripe 0. >I am looking for ideas on how to address the deadlock hazard more >efficiently. > >The next big TODOs are: >- come up with a better solution to the potential tuple emitting/barrier > waiting deadlock issue >- parallel batch 0 spilling complete > Hi Melanie, I started looking at the patch to refresh my knowledge both of this patch and parallel hash join, but I think it needs a rebase. The changes in 7897e3bb90 apparently touched some of the code. I assume you're working on a patch addressing the remaining TODOS, right? I see you've switched to "stripe" naming - I find that a bit confusing, because when I hear stripe I think about RAID, where it means pieces of data interleaved and stored on different devices. But maybe that's just me and it's a good name. Maybe it'd be better to keep the naming and only tweak it at the end, not to disrupt reviews unnecessarily. Now, a couple comments / questions about the code. nodeHash.c ---------- 1) MultiExecPrivateHash says this /* * Not subject to skew optimization, so either insert normally * or save to batch file if it belongs to another stripe */ I wonder what it means to "belong to another stripe". I understand what that means for batches, which are identified by batchno computed from the hash value. But I thought "stripes" are just work_mem-sized pieces of a batch, so I don't quite understand this. Especially when the code does not actually check "which stripe" the row belongs to. 2) I find the fields hashloop_fallback rather confusing. We have one in HashJoinTable (and it's array of BufFile items) and another one in ParallelHashJoinBatch (this time just bool). I think HashJoinTable should be renamed to hashloopBatchFile (similarly to the other BufFile arrays). Although I'm not sure why we even need this file, when we have innerBatchFile? BufFile(s) are not exactly free, in fact it's one of the problems for hashjoins with many batches. 3) I'm a bit puzzled about this formula in ExecHashIncreaseNumBatches childbatch = (1U << (my_log2(hashtable->nbatch) - 1)) | hashtable->curbatch; and also about this comment /* * TODO: what to do about tuples that don't go to the child * batch or stay in the current batch? (this is why we are * counting tuples to child and curbatch with two diff * variables in case the tuples go to a batch that isn't the * child) */ if (batchno == childbatch) childbatch_outgoing_tuples++; I thought each old batch is split into two new ones, and the tuples either stay in the current one, or are moved to the new one - which I presume is the childbatch, although I haven't tried to decode that formula. So where else could the tuple go, as the comment tried to suggest? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: