Re: [PERFORM] anti-join with small table via text/varchar cannotestimate rows correctly - Mailing list pgsql-performance

From Stefan Andreatta
Subject Re: [PERFORM] anti-join with small table via text/varchar cannotestimate rows correctly
Date
Msg-id 36ec1d2e-fe3b-ec14-2f43-558435352b8c@synedra.com
Whole thread Raw
In response to Re: [PERFORM] anti-join with small table via text/varchar cannot estimate rows correctly  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
On 02.03.2017 02:06, Tom Lane wrote:
> Stefan Andreatta <s.andreatta@synedra.com> writes:
>> The same anti-join using the text fields, however estimates just 1
>> resulting row, while there are still of course 9,999 of them:
>> =# explain analyze
>>        select tmp_san_1.id
>>        from tmp_san_1
>>          left join tmp_san_2 on tmp_san_1.text = tmp_san_2.text
>>        where tmp_san_2.id is null;
> That is not an anti-join.  To make it one, you have to constrain the RHS
> join column to be IS NULL, not some random other column.  Note the join
> type isn't getting shown as Anti:
>
>>    Hash Left Join  (cost=1.02..192.53 rows=1 width=4) (actual time=0.020..3.091 rows=9999 loops=1)
> As written, the query could return some rows that weren't actually
> antijoin rows, ie tmp_san_1.text *did* have a match in tmp_san_2,
> but that row chanced to have a null value of id.
>
> Possibly the planner could be smarter about estimating for this case,
> but it doesn't look much like a typical use-case to me.
>
>             regards, tom lane

Thanks a lot! Right, my problem had nothing to do with the type of the
join field, but with the selection of the proper field for the
NULL-condition.

So, even a join on the id field is badly estimated if checked on the
text field:

=# EXPLAIN ANALYZE
      SELECT tmp_san_1.id
      FROM tmp_san_1
        LEFT JOIN tmp_san_2 ON tmp_san_1.id = tmp_san_2.id
      WHERE (tmp_san_2.text IS NULL);
                                                     QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
  Hash Left Join  (cost=1.02..192.53 rows=1 width=4) (actual
time=0.019..2.939 rows=9999 loops=1)
    Hash Cond: (tmp_san_1.id = tmp_san_2.id)
    Filter: (tmp_san_2.text IS NULL)
    Rows Removed by Filter: 1
    ->  Seq Scan on tmp_san_1  (cost=0.00..154.00 rows=10000 width=4)
(actual time=0.007..1.003 rows=10000 loops=1)
    ->  Hash  (cost=1.01..1.01 rows=1 width=6) (actual time=0.004..0.004
rows=1 loops=1)
          Buckets: 1024  Batches: 1  Memory Usage: 9kB
          ->  Seq Scan on tmp_san_2  (cost=0.00..1.01 rows=1 width=6)
(actual time=0.001..0.002 rows=1 loops=1)
  Planning time: 0.062 ms
  Execution time: 3.381 ms
(10 rows)


... but if the join and the check refer to the same field everything is
fine:

=# EXPLAIN ANALYZE
      SELECT tmp_san_1.id
      FROM tmp_san_1
        LEFT JOIN tmp_san_2 ON tmp_san_1.id = tmp_san_2.id
      WHERE (tmp_san_2.id IS NULL);
                                                     QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
  Hash Anti Join  (cost=1.02..281.26 rows=9999 width=4) (actual
time=0.018..2.672 rows=9999 loops=1)
    Hash Cond: (tmp_san_1.id = tmp_san_2.id)
    ->  Seq Scan on tmp_san_1  (cost=0.00..154.00 rows=10000 width=4)
(actual time=0.007..0.962 rows=10000 loops=1)
    ->  Hash  (cost=1.01..1.01 rows=1 width=4) (actual time=0.003..0.003
rows=1 loops=1)
          Buckets: 1024  Batches: 1  Memory Usage: 9kB
          ->  Seq Scan on tmp_san_2  (cost=0.00..1.01 rows=1 width=4)
(actual time=0.001..0.001 rows=1 loops=1)
  Planning time: 0.051 ms
  Execution time: 3.164 ms
(8 rows)


It get's more interesting again, if the text field really could be NULL
and I wanted to include those rows. If I just include "OR tmp_san_2.text
IS NULL" estimates are off again:

=# EXPLAIN ANALYZE
      SELECT tmp_san_1.id
      FROM tmp_san_1
        LEFT JOIN tmp_san_2 ON tmp_san_1.id = tmp_san_2.id
      WHERE (tmp_san_2.id IS NULL OR tmp_san_2.text IS NULL);
                                                     QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
  Hash Left Join  (cost=1.02..192.53 rows=1 width=4) (actual
time=0.019..2.984 rows=9999 loops=1)
    Hash Cond: (tmp_san_1.id = tmp_san_2.id)
    Filter: ((tmp_san_2.id IS NULL) OR (tmp_san_2.text IS NULL))
    Rows Removed by Filter: 1
    ->  Seq Scan on tmp_san_1  (cost=0.00..154.00 rows=10000 width=4)
(actual time=0.008..1.024 rows=10000 loops=1)
    ->  Hash  (cost=1.01..1.01 rows=1 width=6) (actual time=0.004..0.004
rows=1 loops=1)
          Buckets: 1024  Batches: 1  Memory Usage: 9kB
          ->  Seq Scan on tmp_san_2  (cost=0.00..1.01 rows=1 width=6)
(actual time=0.001..0.002 rows=1 loops=1)
  Planning time: 0.088 ms
  Execution time: 3.508 ms
(10 rows)


Instead, it seems, I have to move this condition (inverted) into the
join clause for the planner to make correct estimates again:

=# EXPLAIN ANALYZE
      SELECT tmp_san_1.id
      FROM tmp_san_1
        LEFT JOIN tmp_san_2 ON tmp_san_1.id = tmp_san_2.id AND
tmp_san_2.text IS NOT NULL
      WHERE (tmp_san_2.id IS NULL);
                                                     QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
  Hash Anti Join  (cost=1.02..281.26 rows=9999 width=4) (actual
time=0.017..2.761 rows=9999 loops=1)
    Hash Cond: (tmp_san_1.id = tmp_san_2.id)
    ->  Seq Scan on tmp_san_1  (cost=0.00..154.00 rows=10000 width=4)
(actual time=0.007..1.052 rows=10000 loops=1)
    ->  Hash  (cost=1.01..1.01 rows=1 width=4) (actual time=0.004..0.004
rows=1 loops=1)
          Buckets: 1024  Batches: 1  Memory Usage: 9kB
          ->  Seq Scan on tmp_san_2  (cost=0.00..1.01 rows=1 width=4)
(actual time=0.002..0.002 rows=1 loops=1)
                Filter: (text IS NOT NULL)
  Planning time: 0.058 ms
  Execution time: 3.232 ms
(9 rows)


So, yes, the planner could infer a bit more here - after all, if few
rows are present to start with only few rows can meet any condition. But
that may well be an unusual case. It's just easy to get puzzled by these
things once you get used to the postresql planner being very smart in
most cases ;-)

Thanks again,
Stefan



pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: [PERFORM] anti-join with small table via text/varchar cannot estimate rows correctly
Next
From: Dinesh Chandra 12108
Date:
Subject: Re: [PERFORM] How Can I check PostgreSQL backup is successfully ornot ?