Re: Performance improvement for joins where outer side is unique - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Performance improvement for joins where outer side is unique
Date
Msg-id 55DA1F9F.5040204@2ndquadrant.com
Whole thread Raw
In response to Re: Performance improvement for joins where outer side is unique  (Erik Rijkers <er@xs4all.nl>)
Responses Re: Performance improvement for joins where outer side is unique  (David Rowley <david.rowley@2ndquadrant.com>)
List pgsql-hackers
Hi,

I did some initial performance evaluation of this patch today, and I see
a clear improvement on larger joins. The scenario I've chosen for the
experiments is a simple fact-dimension join, i.e. a small table
referenced by a large table. So effectively something like this:

CREATE TABLE dim  (id INT PRIMARY KEY, ...);
CREATE TABLE fact (dim_d INT REFERENCES dim(id), ...);

except that I haven't used the foreign key constraint. In all the
experiments I've used a fact table 10x the size of the dimension, but I
believe what really matters most is the size of the dimension (and how
the hash table fits into the L2/L3 cache).

The query tested is very simple:

     select count(1) from (
         select * from f join d on (f.fact_id = d.dim_id)
     ) foo;

The outer aggregation is intentional - the join produces many rows and
formatting them as string would completely eliminate any gains from the
patch (even with "\o /dev/null" or such).

The following numbers come current master, running on E5-2630 v3
(2.40GHz), 64GB of RAM and this configuration:

  checkpoint_timeout = 15min
  effective_cache_size = 48GB
  maintenance_work_mem = 1GB
  max_wal_size = 4GB
  min_wal_size = 1GB
  random_page_cost = 1.5
  shared_buffers = 4GB
  work_mem = 1GB

all the other values are set to default.

I did 10 runs for each combination of sizes - the numbers seem quite
consistent and repeatable. I also looked at the median values.


dim 100k rows, fact 1M rows
---------------------------

          master     patched
         -------     -------
    1    286.184     265.489
    2    284.827     264.961
    3    281.040     264.768
    4    280.926     267.720
    5    280.984     261.348
    6    280.878     261.463
    7    280.875     261.338
    8    281.042     261.265
    9    281.003     261.236
   10    280.939     261.185
         -------     -------
  med    280.994     261.406 (-7%)


dim 1M rows, fact 10M rows
--------------------------

          master     patched
        --------    --------
    1   4316.235    3690.373
    2   4399.539    3738.097
    3   4360.551    3655.602
    4   4359.763    3626.142
    5   4361.821    3621.941
    6   4359.205    3654.835
    7   4371.438    3631.212
    8   4361.857    3626.237
    9   4357.317    3676.651
   10   4359.561    3641.830
        --------    --------
  med   4360.157    3648.333 (-17%)


dim 10M rows, fact 100M rows
----------------------------

          master     patched
        --------    --------
    1  46246.467   39561.597
    2  45982.937   40083.352
    3  45818.118   39674.661
    4  45716.281   39616.585
    5  45651.117   40463.966
    6  45979.036   41395.390
    7  46045.400   41358.047
    8  45978.698   41253.946
    9  45801.343   41156.440
   10  45720.722   41374.688
       ---------   ---------
  med  45898.408   40810.203 (-10%)


So the gains seem quite solid - it's not something that would make the
query an order of magnitude faster, but it's well above the noise.

Of course, in practice the queries will be more complicated, making the
improvement less significant, but I don't think that's a reason not to
apply it.

Two minor comments on the patch:

1) the 'subquery' variable in specialjoin_is_unique_join is unused

2) in the explain output, there should probably be a space before the
    '(inner unique)' text, so

      Hash Join (inner unique) ...

    instead of

      Hash Join(inner unique)

but that's just nitpicking at this point. Otherwise the patch seems
quite solid to me.

regard

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: PostgreSQL for VAX on NetBSD/OpenBSD
Next
From: xpNitin
Date:
Subject: AIX 7.1 compile and initdb error TRAP: FailedAssertion