Re: Avoiding hash join batch explosions with extreme skew and weirdstats - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Avoiding hash join batch explosions with extreme skew and weirdstats
Date
Msg-id 20190910131057.r7hhmynayz7k5r33@development
Whole thread Raw
In response to Re: Avoiding hash join batch explosions with extreme skew and weird stats  (Melanie Plageman <melanieplageman@gmail.com>)
List pgsql-hackers
On Fri, Sep 06, 2019 at 10:54:13AM -0700, Melanie Plageman wrote:
>On Thu, Sep 5, 2019 at 10:35 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
>> Seems like a good time for me to try to summarise what I think the
>> main problems are here:
>>
>> 1.  The match-bit storage problem already discussed.  The tuples that
>> each process receives while reading from SharedTupleStore are
>> non-deterministic (like other parallel scans).  To use a bitmap-based
>> approach, I guess we'd need to invent some way to give the tuples a
>> stable identifier within some kind of densely packed number space that
>> we could use to address the bitmap, or take the IO hit and write all
>> the tuples back.  That might involve changing the way SharedTupleStore
>> holds data.
>>
>
>This I've dealt with by adding a tuplenum to the SharedTupleStore
>itself which I atomically increment in sts_puttuple().
>In ExecParallelHashJoinPartitionOuter(), as each worker writes tuples
>to the batch files, they call sts_puttuple() and this increments the
>number so each tuple has a unique number.
>For persisting this number, I added the tuplenum to the meta data
>section of the MinimalTuple (along with the hashvalue -- there was a
>comment about this meta data that said it could be used for other
>things in the future, so this seemed like a good place to put it) and
>write that out to the batch file.
>
>At the end of ExecParallelHashJoinPartitionOuter(), I make the outer
>match status bitmap file. I use the final tuplenum count to determine
>the number of bytes to write to it. Each worker has a file with a
>bitmap which has the number of bytes required to represent the number
>of tuples in that batch.
>
>Because one worker may beat the other(s) and build the whole batch
>file for a batch before the others have a chance, I also make the
>outer match status bitmap file for workers who missed out in
>ExecParallelHashJoinOuterGetTuple() using the final tuplenum as well.
>

That seems like a perfectly sensible solution to me. I'm sure there are
ways to optimize it (say, having a bitmap optimized for sparse data, or
bitmap shared by all the workers or something like that), but that's
definitely not needed for v1.

Even having a bitmap per worker is pretty cheap. Assume we have 1B rows,
the bitmap is 1B/8 bytes = ~120MB per worker. So with 16 workers that's
~2GB, give or take. But with 100B rows, the original data is ~100GB. So
the bitmaps are not free, but it's not terrible either.

>>
>> 2.  Tricky problems relating to barriers and flow control.  First, let
>> me explain why PHJ doesn't support full/right outer joins yet.  At
>> first I thought it was going to be easy, because, although the shared
>> memory hash table is read-only after it has been built, it seems safe
>> to weaken that only slightly and let the match flag be set by any
>> process during probing: it's OK if two processes clobber each other's
>> writes, as the only transition is a single bit going strictly from 0
>> to 1, and there will certainly be a full memory barrier before anyone
>> tries to read those match bits.  Then during the scan for unmatched,
>> you just have to somehow dole out hash table buckets or ranges of
>> buckets to processes on a first-come-first-served basis.  But.... then
>> I crashed into the following problem:
>>
>> * You can't begin the scan for unmatched tuples until every process
>> has finished probing (ie until you have the final set of match bits).
>> * You can't wait for every process to finish probing, because any
>> process that has emitted a tuple might never come back if there is
>> another node that is also waiting for all processes (ie deadlock
>> against another PHJ doing the same thing), and probing is a phase that
>> emits tuples.
>>
>> Generally, it's not safe to emit tuples while you are attached to a
>> Barrier, unless you're only going to detach from it, not wait at it,
>> because emitting tuples lets the program counter escape your control.
>> Generally, it's not safe to detach from a Barrier while accessing
>> resources whose lifetime it controls, such as a hash table, because
>> then it might go away underneath you.
>>
>> The PHJ plans that are supported currently adhere to that programming
>> rule and so don't have a problem: after the Barrier reaches the
>> probing phase, processes never wait for each other again so they're
>> free to begin emitting tuples.  They just detach when they're done
>> probing, and the last to detach cleans up (frees the hash table etc).
>> If there is more than one batch, they detach from one batch and attach
>> to another when they're ready (each batch has its own Barrier), so we
>> can consider the batches to be entirely independent.
>>
>> There is probably a way to make a scan-for-unmatched-inner phase work,
>> possibly involving another Barrier or something like that, but I ran
>> out of time trying to figure it out and wanted to ship a working PHJ
>> for the more common plan types.  I suppose PHLJ will face two variants
>> of this problem: (1) you need to synchronise the loops (you can't dump
>> the hash table in preparation for the next loop until all have
>> finished probing for the current loop), and yet you've already emitted
>> tuples, so you're not allowed to wait for other processes and they're
>> not allowed to wait for you, and (2) you can't start the
>> scan-for-unmatched-outer until all the probe loops belonging to one
>> batch are done.  The first problem is sort of analogous to a problem I
>> faced with batches in the first place, which Robert and I found a
>> solution to by processing the batches in parallel, and could perhaps
>> be solved in the same way: run the loops in parallel (if that sounds
>> crazy, recall that every worker has its own quota of work_mem and the
>> data is entirely prepartitioned up front, which is why we are able to
>> run the batches in parallel; in constrast, single-batch mode makes a
>> hash table with a quota of nparticipants * work_mem).  The second
>> problem is sort of analogous to the existing scan-for-unmatched-inner
>> problem that I haven't solved.
>>
>>
>I "solved" these problem for now by having all workers except for one
>detach from the outer batch file after finishing probing. The last
>worker to arrive does not detach from the batch and instead iterates
>through all of the workers' outer match status files per participant
>shared mem SharedTuplestoreParticipant) and create a single unified
>bitmap. All the other workers continue to wait at the barrier until
>the sole remaining worker has finished with iterating through the
>outer match status bitmap files.
>

Why did you put solved in quotation marks? This seems like a reasonable
solution to me, at least for now, but the quotation marks kinda suggest
you think it's either not correct or not good enough. Or did I miss some
flaw that makes this unacceptable?

>Admittedly, I'm still fighting with this step a bit, but, my intent is
>to have all the backends wait until the lone remaining worker has
>created the unified bitmap, then, that worker, which is still attached
>to the outer batch will scan the outer batch file and the unified
>outer match status bitmap and emit unmatched tuples.
>

Makes sense, I think.

The one "issue" this probably has is that it serializes the last step, 
i.e. the search for unmatched tuples is done in a single process, instead
of parallelized over multiple workers. That's certainly unfortunate, but 
is that really an issue in practice? Probably not for queries with just a
small number of unmatched tuples. And for cases with many unmatched rows 
it's probably going to degrade to non-parallel case.

>I thought that the other workers can move on and stop waiting at the
>barrier once the lone remaining worker has scanned their outer match
>status files. All the probe loops would be done, and the worker that
>is emitting tuples is not referencing the inner side hashtable at all
>and only the outer batch file and the combined bitmap.
>

Why would the workers need to wait for the lone worker to scan their
bitmap file? Or do the files disappear with the workers, or something
like that? 

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




pgsql-hackers by date:

Previous
From: Patrik Novotny
Date:
Subject: [PATCH] Move user options to the end of the command in pg_upgrade
Next
From: Tomas Vondra
Date:
Subject: Re: Specifying attribute slot for storing/reading statistics