Thread: [PERFORM] Optimization inner join

[PERFORM] Optimization inner join

From
Clailson
Date:
Hi,

Is there something in the roadmap to optimize the inner join?

I've this situation
above. Table b has 400 rows with null in the column b.

explain analyze select * from a inner join b on (b.b = a.a);
"Merge Join  (cost=0.55..65.30 rows=599 width=16) (actual time=0.030..1.173 rows=599 loops=1)" 
"  Merge Cond: (a.a = b.b)" 
"  ->  Index Scan using a_pkey on a  (cost=0.28..35.27 rows=1000 width=8) (actual time=0.014..0.364 rows=1000 loops=1)" 
"  ->  Index Scan using in01 on b  (cost=0.28..33.27 rows=1000 width=8) (actual time=0.012..0.249 rows=600 loops=1)" 
"Total runtime: 1.248 ms" 

My question is: Why the planner isn't removing the null rows during the scan of table b?
-- 
Clailson Soares Dinízio de Almeida

Re: [PERFORM] Optimization inner join

From
"Phillip Couto"
Date:
NULL is still a value that may be paired with a NULL in a.a

The only optimization I could see is if the a.a column has NOT NULL defined while b.b does not have NOT NULL defined.

Not sure if it is all that common. Curious what if you put b.b IS NOT NULL in the WHERE statement?

-----------------
Phillip Couto



On Jan 19, 2017, at 05:08, Clailson <clailson.dba@gmail.com> wrote:

Hi,

Is there something in the roadmap to optimize the inner join?

I've this situation
above. Table b has 400 rows with null in the column b.

explain analyze select * from a inner join b on (b.b = a.a);
"Merge Join  (cost=0.55..65.30 rows=599 width=16) (actual time=0.030..1.173 rows=599 loops=1)" 
"  Merge Cond: (a.a = b.b)" 
"  ->  Index Scan using a_pkey on a  (cost=0.28..35.27 rows=1000 width=8) (actual time=0.014..0.364 rows=1000 loops=1)" 
"  ->  Index Scan using in01 on b  (cost=0.28..33.27 rows=1000 width=8) (actual time=0.012..0.249 rows=600 loops=1)" 
"Total runtime: 1.248 ms" 

My question is: Why the planner isn't removing the null rows during the scan of table b?
-- 
Clailson Soares Dinízio de Almeida

Re: [PERFORM] Optimization inner join

From
Clailson
Date:
Hi Phillip.

The only optimization I could see is if the a.a column has NOT NULL defined while b.b does not have NOT NULL defined.
a.a is the primary key on table a and b.b is the foreign key on table b.

 Tabela "public.a" 
+--------+---------+---------------+
| Coluna |  Tipo   | Modificadores |
+--------+---------+---------------+
| a      | integer | não nulo      |
| b      | integer |               |
+--------+---------+---------------+
Índices:   "a_pkey" PRIMARY KEY, btree (a)
Referenciada por:   TABLE "b" CONSTRAINT "b_b_fkey" FOREIGN KEY (b) REFERENCES a(a)
        Tabela "public.b" 
+--------+---------+---------------+
| Coluna |  Tipo   | Modificadores |
+--------+---------+---------------+
| a      | integer | não nulo      |
| b      | integer |               |
+--------+---------+---------------+
Índices:   "b_pkey" PRIMARY KEY, btree (a)
Restrições de chave estrangeira:   "b_b_fkey" FOREIGN KEY (b) REFERENCES a(a)

Not sure if it is all that common. Curious what if you put b.b IS NOT NULL in the WHERE statement?

It's the question. In the company I work with, one of my clients asked me: "Why PostgreSQL does not remove rows with null in column b (table b), before joining, since these rows have no corresponding in table a?" I gave the suggestion to put the IS NOT NULL in the WHERE statement, but HE can't modify the query in the application.

I did the tests with Oracle and it uses a predicate in the query plan, removing the lines where b.b is null.
In Oracle, it´s the same plan, with and without IS NOT NULL in the
WHERE statement.
-- 
Clailson Soares Dinízio de Almeida

On 19/01/2017 09:34, Phillip Couto wrote:
NULL is still a value that may be paired with a NULL in a.a

The only optimization I could see is if the a.a column has NOT NULL defined while b.b does not have NOT NULL defined.

Not sure if it is all that common. Curious what if you put b.b IS NOT NULL in the WHERE statement?

-----------------
Phillip Couto



On Jan 19, 2017, at 05:08, Clailson <clailson.dba@gmail.com> wrote:

Hi,

Is there something in the roadmap to optimize the inner join?

I've this situation
above. Table b has 400 rows with null in the column b.

explain analyze select * from a inner join b on (b.b = a.a);
"Merge Join  (cost=0.55..65.30 rows=599 width=16) (actual time=0.030..1.173 rows=599 loops=1)" 
"  Merge Cond: (a.a = b.b)" 
"  ->  Index Scan using a_pkey on a  (cost=0.28..35.27 rows=1000 width=8) (actual time=0.014..0.364 rows=1000 loops=1)" 
"  ->  Index Scan using in01 on b  (cost=0.28..33.27 rows=1000 width=8) (actual time=0.012..0.249 rows=600 loops=1)" 
"Total runtime: 1.248 ms" 

My question is: Why the planner isn't removing the null rows during the scan of table b?
-- 
Clailson Soares Dinízio de Almeida


Re: [PERFORM] Optimization inner join

From
"Phillip Couto"
Date:
Ah ok that makes sense. I am curious if there is actually a performance benefit to doing that. In postgresql as per the execution plan you provided the Merge Join joins both sets after the have been sorted. If they are sorted already then the NULLs will all be grouped at the beginning or end. (Can’t remember what the ordering is) Postgresql will just skip all the records with probably the same effort as removing them and then merging. The only performance improvement I could potentially see is if there is a lot of NULLS in one set then the cost to sort them may be large enough to recoup by ignoring them before the merge sort.

I hope someone more familiar with the internals can chime in as I would like to learn more if there is a real benefit here or a better idea why postgres does not do it.

-----------------
Phillip Couto



On Jan 19, 2017, at 08:04, Clailson <clailson.dba@gmail.com> wrote:

Hi Phillip.

The only optimization I could see is if the a.a column has NOT NULL defined while b.b does not have NOT NULL defined.
a.a is the primary key on table a and b.b is the foreign key on table b.

 Tabela "public.a" 
+--------+---------+---------------+
| Coluna |  Tipo   | Modificadores |
+--------+---------+---------------+
| a      | integer | não nulo      |
| b      | integer |               |
+--------+---------+---------------+
Índices:   "a_pkey" PRIMARY KEY, btree (a)
Referenciada por:   TABLE "b" CONSTRAINT "b_b_fkey" FOREIGN KEY (b) REFERENCES a(a)
        Tabela "public.b" 
+--------+---------+---------------+
| Coluna |  Tipo   | Modificadores |
+--------+---------+---------------+
| a      | integer | não nulo      |
| b      | integer |               |
+--------+---------+---------------+
Índices:   "b_pkey" PRIMARY KEY, btree (a)
Restrições de chave estrangeira:   "b_b_fkey" FOREIGN KEY (b) REFERENCES a(a)

Not sure if it is all that common. Curious what if you put b.b IS NOT NULL in the WHERE statement?

It's the question. In the company I work with, one of my clients asked me: "Why PostgreSQL does not remove rows with null in column b (table b), before joining, since these rows have no corresponding in table a?" I gave the suggestion to put the IS NOT NULL in the WHERE statement, but HE can't modify the query in the application.

I did the tests with Oracle and it uses a predicate in the query plan, removing the lines where b.b is null.
In Oracle, it´s the same plan, with and without IS NOT NULL in the
WHERE statement.
-- 
Clailson Soares Dinízio de Almeida

On 19/01/2017 09:34, Phillip Couto wrote:
NULL is still a value that may be paired with a NULL in a.a

The only optimization I could see is if the a.a column has NOT NULL defined while b.b does not have NOT NULL defined.

Not sure if it is all that common. Curious what if you put b.b IS NOT NULL in the WHERE statement?

-----------------
Phillip Couto



On Jan 19, 2017, at 05:08, Clailson <clailson.dba@gmail.com> wrote:

Hi,

Is there something in the roadmap to optimize the inner join?

I've this situation
above. Table b has 400 rows with null in the column b.

explain analyze select * from a inner join b on (b.b = a.a);
"Merge Join  (cost=0.55..65.30 rows=599 width=16) (actual time=0.030..1.173 rows=599 loops=1)" 
"  Merge Cond: (a.a = b.b)" 
"  ->  Index Scan using a_pkey on a  (cost=0.28..35.27 rows=1000 width=8) (actual time=0.014..0.364 rows=1000 loops=1)" 
"  ->  Index Scan using in01 on b  (cost=0.28..33.27 rows=1000 width=8) (actual time=0.012..0.249 rows=600 loops=1)" 
"Total runtime: 1.248 ms" 

My question is: Why the planner isn't removing the null rows during the scan of table b?
-- 
Clailson Soares Dinízio de Almeida



Re: [PERFORM] Optimization inner join

From
Gustavo Rezende Montesino
Date:
Hello,

Em 19/01/2017 11:04, Clailson escreveu:
Hi Phillip.



Not sure if it is all that common. Curious what if you put b.b IS NOT NULL in the WHERE statement?

It's the question. In the company I work with, one of my clients asked me: "Why PostgreSQL does not remove rows with null in column b (table b), before joining, since these rows have no corresponding in table a?" I gave the suggestion to put the IS NOT NULL in the WHERE statement, but HE can't modify the query in the application.

I did the tests with Oracle and it uses a predicate in the query plan, removing the lines where b.b is null.
In Oracle, it´s the same plan, with and without IS NOT NULL in the
WHERE statement.

Being the client in question, I would like to make a little remark: What we thought could be optimized here at first is on the row estimate of the index scan; which could take null_frac into account. To put things into perspective, our similar case in production has a table with 6 million lines where only 9.5k aren´t null for the join field, an the over-estimation is throwing away good plans (like ~150ms execution time) in favor of pretty bad ones (~80s execution time).

We´ve asked application people to put the where not null workaround, which works great, and are waiting on an answer, but I believe getting better estimates without that would be great if possible.


On 19/01/2017 09:34, Phillip Couto wrote:
NULL is still a value that may be paired with a NULL in a.a

Is that so? I would believe you would never get a match, as NULL <> NULL

On Jan 19, 2017, at 05:08, Clailson <clailson.dba@gmail.com> wrote:

Hi,

Is there something in the roadmap to optimize the inner join?

I've this situation
above. Table b has 400 rows with null in the column b.

explain analyze select * from a inner join b on (b.b = a.a);
"Merge Join  (cost=0.55..65.30 rows=599 width=16) (actual time=0.030..1.173 rows=599 loops=1)" 
"  Merge Cond: (a.a = b.b)" 
"  ->  Index Scan using a_pkey on a  (cost=0.28..35.27 rows=1000 width=8) (actual time=0.014..0.364 rows=1000 loops=1)" 
"  ->  Index Scan using in01 on b  (cost=0.28..33.27 rows=1000 width=8) (actual time=0.012..0.249 rows=600 loops=1)" 
"Total runtime: 1.248 ms" 

My question is: Why the planner isn't removing the null rows during the scan of table b?
-- 
Clailson Soares Dinízio de Almeida




Regards,

Gustavo R. Montesino
Tribunal Regional do Trabalho da 2a Região
Secretaria de Tecnologia da Informação e Comunicação
Coordenadoria de Infraestrutura de TIC
Seção de Administração de Banco de Dados
Av. Marquês de São Vicente, 121 - Bl. A - Sala 404
Telefone: (11) 3150-2082

Re: [PERFORM] Optimization inner join

From
Vitalii Tymchyshyn
Date:

Hi.

In SQL "null == any value" resolves to false, so optimizer can safely skip nulls from either side if any for the inner join.

Best regards, Vitalii Tymchyshyn

NULL is still a value that may be paired with a NULL in a.a

The only optimization I could see is if the a.a column has NOT NULL defined while b.b does not have NOT NULL defined.

Not sure if it is all that common. Curious what if you put b.b IS NOT NULL in the WHERE statement?

-----------------
Phillip Couto




Re: [PERFORM] Optimization inner join

From
"Phillip Couto"
Date:
I apologize my statement about NULL being used to join is incorrect as both Vitalii and Gustavo have both pointed out in their respective replies.

-----------------
Phillip Couto



On Jan 19, 2017, at 08:30, Vitalii Tymchyshyn <vit@tym.im> wrote:


Hi.

In SQL "null == any value" resolves to false, so optimizer can safely skip nulls from either side if any for the inner join.

Best regards, Vitalii Tymchyshyn

NULL is still a value that may be paired with a NULL in a.a

The only optimization I could see is if the a.a column has NOT NULL defined while b.b does not have NOT NULL defined.

Not sure if it is all that common. Curious what if you put b.b IS NOT NULL in the WHERE statement?

-----------------
Phillip Couto





Re: [PERFORM] Optimization inner join

From
Tom Lane
Date:
Gustavo Rezende Montesino <gustavo.montesino@trtsp.jus.br> writes:
> Being the client in question, I would like to make a little remark: What
> we thought could be optimized here at first is on the row estimate of
> the index scan; which could take null_frac into account. To put things
> into perspective, our similar case in production has a table with 6
> million lines where only 9.5k aren´t null for the join field, an the
> over-estimation is throwing away good plans (like ~150ms execution time)
> in favor of pretty bad ones (~80s execution time).

Please provide a concrete test case for that.  AFAIK the null fraction
should be accounted for in join size estimates.  Here's a little test
case showing that it is:

regression=# create table t1 as select generate_series(1,1000000) as f1;
SELECT 1000000
regression=# analyze t1;
ANALYZE
regression=# create table t2 as select generate_series(1,1000000) as f1;
SELECT 1000000
regression=# analyze t2;
ANALYZE
regression=# explain select * from t1,t2 where t1.f1=t2.f1;
                               QUERY PLAN
------------------------------------------------------------------------
 Hash Join  (cost=30832.00..70728.00 rows=1000000 width=8)
   Hash Cond: (t1.f1 = t2.f1)
   ->  Seq Scan on t1  (cost=0.00..14425.00 rows=1000000 width=4)
   ->  Hash  (cost=14425.00..14425.00 rows=1000000 width=4)
         ->  Seq Scan on t2  (cost=0.00..14425.00 rows=1000000 width=4)
(5 rows)

regression=# insert into t2 select null from generate_series(1,1000000);
INSERT 0 1000000
regression=# analyze t2;
ANALYZE
regression=# explain select * from t1,t2 where t1.f1=t2.f1;
                               QUERY PLAN
------------------------------------------------------------------------
 Hash Join  (cost=30832.00..95727.00 rows=1000000 width=8)
   Hash Cond: (t2.f1 = t1.f1)
   ->  Seq Scan on t2  (cost=0.00..27862.00 rows=2000000 width=4)
   ->  Hash  (cost=14425.00..14425.00 rows=1000000 width=4)
         ->  Seq Scan on t1  (cost=0.00..14425.00 rows=1000000 width=4)
(5 rows)

The join size estimate is still correct even though it knows there are
many more rows in t2.

As for inserting a not-null test at the scan level, I'm not exactly
convinced that it's a win:

regression=# \timing
Timing is on.
regression=# select count(*) from t1,t2 where t1.f1=t2.f1;
  count
---------
 1000000
(1 row)

Time: 562.914 ms
regression=# select count(*) from t1,t2 where t1.f1=t2.f1 and t2.f1 is not null;
  count
---------
 1000000
(1 row)

Time: 564.896 ms

[ ftr, these times are best-of-three-trials ]

It's possible that in the case where an explicit sort has to be inserted,
reducing the amount of data passing through the sort would be worth doing;
but in the general case that's unproven.

            regards, tom lane


Re: [PERFORM] Optimization inner join

From
"Phillip Couto"
Date:
The picture is becoming clearer now. So to recap the issue is in the plan selection not utilizing the null_frac statistic properly to skip what seems to be in your case 99% of the rows which are NULL for the field the join is happening on and would be discarded anyways.

For completeness do you mind posting what versions of postgres you have tested this on?

-----------------
Phillip Couto



On Jan 19, 2017, at 08:23, Gustavo Rezende Montesino <gustavo.montesino@trtsp.jus.br> wrote:

Hello,

Em 19/01/2017 11:04, Clailson escreveu:
Hi Phillip.



Not sure if it is all that common. Curious what if you put b.b IS NOT NULL in the WHERE statement?

It's the question. In the company I work with, one of my clients asked me: "Why PostgreSQL does not remove rows with null in column b (table b), before joining, since these rows have no corresponding in table a?" I gave the suggestion to put the IS NOT NULL in the WHERE statement, but HE can't modify the query in the application. 

I did the tests with Oracle and it uses a predicate in the query plan, removing the lines where b.b is null.
 In Oracle, it´s the same plan, with and without IS NOT NULL in the 
WHERE statement.

Being the client in question, I would like to make a little remark: What we thought could be optimized here at first is on the row estimate of the index scan; which could take null_frac into account. To put things into perspective, our similar case in production has a table with 6 million lines where only 9.5k aren´t null for the join field, an the over-estimation is throwing away good plans (like ~150ms execution time) in favor of pretty bad ones (~80s execution time).

We´ve asked application people to put the where not null workaround, which works great, and are waiting on an answer, but I believe getting better estimates without that would be great if possible.


On 19/01/2017 09:34, Phillip Couto wrote:
NULL is still a value that may be paired with a NULL in a.a

Is that so? I would believe you would never get a match, as NULL <> NULL

On Jan 19, 2017, at 05:08, Clailson <clailson.dba@gmail.com> wrote:

Hi,

Is there something in the roadmap to optimize the inner join?

I've this situation 
above. Table b has 400 rows with null in the column b.

explain analyze select * from a inner join b on (b.b = a.a);
"Merge Join  (cost=0.55..65.30 rows=599 width=16) (actual time=0.030..1.173 rows=599 loops=1)" 
"  Merge Cond: (a.a = b.b)" 
"  ->  Index Scan using a_pkey on a  (cost=0.28..35.27 rows=1000 width=8) (actual time=0.014..0.364 rows=1000 loops=1)" 
"  ->  Index Scan using in01 on b  (cost=0.28..33.27 rows=1000 width=8) (actual time=0.012..0.249 rows=600 loops=1)" 
"Total runtime: 1.248 ms" 

My question is: Why the planner isn't removing the null rows during the scan of table b?
-- 
Clailson Soares Dinízio de Almeida




Regards,

Gustavo R. Montesino
Tribunal Regional do Trabalho da 2a Região
Secretaria de Tecnologia da Informação e Comunicação
Coordenadoria de Infraestrutura de TIC
Seção de Administração de Banco de Dados
Av. Marquês de São Vicente, 121 - Bl. A - Sala 404
Telefone: (11) 3150-2082



Re: [PERFORM] Optimization inner join

From
Gustavo Rezende Montesino
Date:
Em 19/01/2017 12:13, Tom Lane escreveu:
> Gustavo Rezende Montesino <gustavo.montesino@trtsp.jus.br> writes:
>> Being the client in question, I would like to make a little remark: What
>> we thought could be optimized here at first is on the row estimate of
>> the index scan; which could take null_frac into account. To put things
>> into perspective, our similar case in production has a table with 6
>> million lines where only 9.5k aren´t null for the join field, an the
>> over-estimation is throwing away good plans (like ~150ms execution time)
>> in favor of pretty bad ones (~80s execution time).
> Please provide a concrete test case for that.  AFAIK the null fraction
> should be accounted for in join size estimates.  Here's a little test
> case showing that it is:

Hello,

Expanding a little on you example:

postgres=# create table t1 as select generate_series(1,1000000) as f1;
SELECT 1000000
postgres=# create table t2 as select generate_series(1,1000000) as f1;
SELECT 1000000
postgres=# insert into t2 select null from generate_series(1,1000000);
INSERT 0 1000000
postgres=# create index on t1(f1);
CREATE INDEX
postgres=# create index on t2(f1);
CREATE INDEX
postgres=# analyze t1;
ANALYZE
postgres=# analyze t2;
ANALYZE
postgres=# explain select * from t1,t2 where t1.f1=t2.f1;
                                        QUERY PLAN
-----------------------------------------------------------------------------------------
  Merge Join  (cost=2.68..59298.81 rows=499433 width=8)
    Merge Cond: (t1.f1 = t2.f1)
    ->  Index Only Scan using t1_f1_idx on t1 (cost=0.42..24916.42
rows=1000000 width=4)
    ->  Index Only Scan using t2_f1_idx on t2 (cost=0.43..48837.43
rows=2000000 width=4)
(4 rows)
postgres=# explain select * from t1,t2 where t1.f1=t2.f1 and t2.f1 is
not null;
                                        QUERY PLAN
-----------------------------------------------------------------------------------------
  Merge Join  (cost=1.85..44588.02 rows=249434 width=8)
    Merge Cond: (t1.f1 = t2.f1)
    ->  Index Only Scan using t1_f1_idx on t1 (cost=0.42..24916.42
rows=1000000 width=4)
    ->  Index Only Scan using t2_f1_idx on t2 (cost=0.43..26890.60
rows=998867 width=4)
          Index Cond: (f1 IS NOT NULL)
(5 rows)


Notice the difference in the estimated costs. In our real case this
difference leads
to a (very) bad plan choice.

BTW, execution itself is indeed faster without the not null clause.

These tests where on 9.3, but our production with the "real" case is in
9.6. Behavior seems
to be the same on both.


Regards,

Gustavo R. Montesino