Thread: Should the optimizer see this?

Should the optimizer see this?

From
Ron Mayer
Date:
Short summary... the second query runs faster, and I think
they should be identical queries.   Should the optimizer
have found this optimization?


I have two identical (or so I believe) queries; one where I
explicitly add a "is not null" comparison; and one where I
think it would implicitly only find not-null columns.

The queries are

    select *
      from rt4, rt5
     where rt4.tigerfile = rt5.tigerfile
       and feat = feat3;

and

    select *
      from (select * from rt4 where feat3 is not null) as rt4, rt5
     where rt4.tigerfile = rt5.tigerfile
       and feat = feat3;

I would have thought that the optimizer would see that
if feat3 is null (which it usually is), it doesn't need
to keep those rows and sort them -- but it seems (looking
both at explain analyze and "du" on the tmp directory)
that in the first query it is indeed sorting all the
rows --- even the ones with feat3=null.



The tables are the Census Tiger Line data explained in detail here:
    http://www.census.gov/geo/www/tiger/tiger2003/TGR2003.pdf
I can attach the create statemnts for the tables if people
think they'd help.   Basically, table rt4 has a column
called feat3 which is usually null, and table rt5 has a
column called feat which is never null.  Both tables have
a few million rows.

No indexes were used, since I'm joining everything to
everything, they shouldn't have helped anyway.  However
vacuum analyze was run, and (as seen in the second query)
the stats did know that the column feat3 was mostly null.

=====================================================================================================
fli=#
fli=# explain analyze
                    
    select *
                    
      from rt4, rt5
                    
     where rt4.tigerfile = rt5.tigerfile
                    
       and feat = feat3;
fli-# fli-# fli-# fli-#

                                                          QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=1922903.02..1967385.35 rows=117698 width=100) (actual time=179246.872..218920.724 rows=153091
loops=1)
   Merge Cond: (("outer".feat3 = "inner".feat) AND ("outer".tigerfile = "inner".tigerfile))
   ->  Sort  (cost=876532.10..888964.80 rows=4973079 width=45) (actual time=57213.327..67313.216 rows=4971022 loops=1)
         Sort Key: rt4.feat3, rt4.tigerfile
         ->  Seq Scan on rt4  (cost=0.00..94198.79 rows=4973079 width=45) (actual time=0.053..10433.883 rows=4971022
loops=1)
   ->  Sort  (cost=1046370.92..1060457.95 rows=5634813 width=55) (actual time=122033.463..134037.127 rows=5767675
loops=1)
         Sort Key: rt5.feat, rt5.tigerfile
         ->  Seq Scan on rt5  (cost=0.00..127146.13 rows=5634813 width=55) (actual time=0.016..22538.958 rows=5635077
loops=1)
 Total runtime: 219632.580 ms
(9 rows)

fli=# fli=# fli=#
fli=# explain analyze
                    
    select *
                    
      from (select * from rt4 where feat3 is not null) as rt4, rt5
                    
     where rt4.tigerfile = rt5.tigerfile
                    
       and feat = feat3;
                    

fli-# fli-# fli-# fli-#                                                           QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=1152466.47..1194789.77 rows=3296 width=100) (actual time=125982.562..145927.220 rows=153091 loops=1)
   Merge Cond: (("outer".feat3 = "inner".feat) AND ("outer".tigerfile = "inner".tigerfile))
   ->  Sort  (cost=106095.56..106443.67 rows=139247 width=45) (actual time=11729.319..11823.006 rows=153091 loops=1)
         Sort Key: tgr.rt4.feat3, tgr.rt4.tigerfile
         ->  Seq Scan on rt4  (cost=0.00..94198.79 rows=139247 width=45) (actual time=32.404..10893.373 rows=153091
loops=1)
               Filter: (feat3 IS NOT NULL)
   ->  Sort  (cost=1046370.92..1060457.95 rows=5634813 width=55) (actual time=114253.157..126650.225 rows=5767675
loops=1)
         Sort Key: rt5.feat, rt5.tigerfile
         ->  Seq Scan on rt5  (cost=0.00..127146.13 rows=5634813 width=55) (actual time=0.012..19253.431 rows=5635077
loops=1)
 Total runtime: 146480.294 ms
(10 rows)

fli=# fli=#
fli=#


Re: Should the optimizer see this?

From
Tom Lane
Date:
Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
> Should the optimizer have found this optimization?

I can't get excited about it.  Joining on a column that's mostly nulls
doesn't seem like a common thing to do.

            regards, tom lane