Home > mailing lists

Re: Experimental evaluation of PostgreSQL's query optimizer - Mailing list pgsql-hackers

From	Pavel Stehule
Subject	Re: Experimental evaluation of PostgreSQL's query optimizer
Date	December 22, 2015 12:10:22
Msg-id	CAFj8pRBDaZOyAQRymgW9YXZyhrXMeQMK00ME2aL5FNH-dJ9fxQ@mail.gmail.com Whole thread Raw
In response to	Re: Experimental evaluation of PostgreSQL's query optimizer (Viktor Leis <leis@in.tum.de>)
List	pgsql-hackers

Tree view

2015-12-22 9:28 GMT+01:00 Viktor Leis <leis@in.tum.de>:

On 12/22/2015 02:40 AM, Craig Ringer wrote:
> On 21 December 2015 at 23:57, Viktor Leis <leis@in.tum.de <mailto:leis@in.tum.de>> wrote:
>
>
>
> Please have a look at Figure 6 (page 6) in
> http://www.vldb.org/pvldb/vol9/p204-leis.pdf Disabling nested loop
> joins without index scan (going from (a) to (b)) results in great
> improvements across the board. And even more importantly, it avoids
> most of the cases where queries took unreasonably long and timed
> out. Basically this amounts to the being able to run the query on
> PostgreSQL or not.
>
>
> For that data, yes. But you're ignoring other important cases. Small or even 1-element lookup tables can be one where a nestloop over a seqscan turns out to be by far the fastest way to do the job.
> This can really add up if it's deep in a subplan that's excuted repeatedly, or if it's part of queries that get run very frequently on a busy OLTP system.
Ok here's what I presume to be the extreme case: Joining a large table
with a 1-entry table.

create table r1 (a int not null);
create table r2 (b int not null);
insert into r1 select 1 from generate_series(1,1000000);
insert into r2 values (1);
analyze r1;
analyze r2;

set enable_mergejoin to off;
set enable_nestloop to on;
set enable_hashjoin to off;
explain select count(*) from r1, r2 where r1.a = r2.b;
\timing
select count(*) from r1, r2 where r1.a = r2.b;
\timing

set enable_nestloop to off;
set enable_hashjoin to on;
explain select count(*) from r1, r2 where r1.a = r2.b;
\timing
select count(*) from r1, r2 where r1.a = r2.b;
\timing

I get 128.894ms vs. 183.724ms, i.e., a 43% slowdown for the hash
join. However, let me stress that this is really the extreme case:

- If the join has few matches (due to inserting a value different from
1 into r2), hash and nested loop join have pretty much the same
performance.

- If you add just one more row to r2, the hash join is faster by a
similar margin.

- Also if there is disk IO or network involved, I suspect that you
will see no performance differences.

There are many difficult tradeoffs in any query optimizer, but I do
not think picking nested loops where a hash join can be used is one of
those. To me this seems more like a self-inflicted wound.

this is oversimplification :( Probably correct in OLAP, but wrong in OLTP. The seq scan enforced by hash join can be problematic.

Regards

Pavel

--
Viktor Leis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pgsql-hackers by date:

From: Pavel Stehule
Date: 22 December 2015, 11:59:55
Subject: Re: plpgsql - DECLARE - cannot to use %TYPE or %ROWTYPE for composite types

From: Dilip Kumar
Date: 22 December 2015, 12:14:18
Subject: Re: parallel joins, and better parallel explain

Re: Experimental evaluation of PostgreSQL's query optimizer - Mailing list pgsql-hackers

Previous

Next