Re: PATCH: postpone building buckets to the end of Hash (in HashJoin) - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: PATCH: postpone building buckets to the end of Hash (in HashJoin)
Date
Msg-id 569DB42D.4030302@2ndquadrant.com
Whole thread Raw
In response to Re: PATCH: postpone building buckets to the end of Hash (in HashJoin)  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: PATCH: postpone building buckets to the end of Hash (in HashJoin)  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi,

On 12/17/2015 10:28 PM, Tomas Vondra wrote:
> Hi,
>
> On 12/17/2015 07:20 PM, Robert Haas wrote:
> ...
>>
>> If this doesn't regress performance in the case where the number of
>> buckets is estimated accurately to begin with, then I think this is
>> a great idea. Can you supply some performance tests results for that
>> case, and maybe some of the other cases also?
>
> I don't see how it could regress performance, and the benchmarks I've
> done confirm that. I'll do more thorough benchmarking and post the
> results here, but not now as this patch is in 2016-01 CF and I want to
> put all my time into reviewing patches from the open commitfest.

I've finally got to do more thorough benchmarks, and surprisingly there
seems to be a minor regression. The scripts I've used are attached,
along with results, but in short it joins two tables, with different
combinations:

   1M x 10M
   10M x 100M
   5M x 250M

with different work_mem sizes (so that the smallest value uses a bunch
of batches while the largest value uses a single batch).

The 1x10 and 10x100 are designed to fit into RAM on the machine
(including the temporary files in batching mode), while 5x250 is
designed to specifically force the temp files to disk (but it's on fast
SSD array, so not a big deal).

Average timings for current master and the first two patches of the
series (0001 postpones the build of buckets and 0002 always starts
without batching) look like this (the timings are in milliseconds):

     size   work_mem    master    postpone   no-batch
  -----------------------------------------------------
     1x10          8      5735      5760         6107
                  32      5939      5955         6124
                 128      4852      5038         5175
  -----------------------------------------------------
    5x250         64    158512    158429       159008
                 256    144046    144584       144994
                1024    111478    117529       116933
  -----------------------------------------------------
   10x100         64     68389     68278        68530
                 256     66693     66415        66605
                1024     48606     50654        50490

So compared to master, the differences look like this:

     size   work_mem    postpone   no-batch
   -----------------------------------------
     1x10          8       0.44%      6.50%
                  32       0.28%      3.13%
                 128       3.83%      6.65%
   -----------------------------------------
    5x250         64      -0.05%      0.31%
                 256       0.37%      0.66%
                1024       5.43%      4.89%
   -----------------------------------------
    10x100        64      -0.16%      0.21%
                 256      -0.42%     -0.13%
                1024       4.21%      3.88%

So for the first patch (postponing building of buckets) with batching,
the difference is rather negligible, possibly within noise (~0.5%). When
the hash join runs with a single batch, the difference however gets much
more significant 4-5%.

I'm not going to discuss the second patch for now, but naturally it has
the same issue and apparently also some additional ones (at least on the
1x10 data set).

I have spent a fair amount of time profiling this, and I can't wrap my
head around where the difference comes from, though. Essentially we
don't do any additional stuff, we've just moved some of the stuff to a
different place in the code.

It might change the memory access pattern a bit, but the original
patterns are not exactly sequential or so, as we're moving random
updates to array of pointers. Or rather moving them to the
BuildBuckets() method, so if something consumes more time, it should be
in this method, I believe.

So I've measured how much time the function takes for the 1x10 and
10x100 datasets, and it's ~25ms and ~290ms respectively. That's way less
than the actual difference in query runtime, which is ~190ms and ~2050ms.

So even if we entirely skipped the bucket build, we would not recover
the difference. Not sure what's going on here.

I've also done some profiling using perf, but I haven't really found
anything suspicious there.

Any ideas what might be the cause of this?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Vitaly Burovoy
Date:
Subject: Re: custom function for converting human readable sizes to bytes
Next
From: David Rowley
Date:
Subject: Re: Combining Aggregates