Should the optimizer see this? - Mailing list pgsql-performance
From | Ron Mayer |
---|---|
Subject | Should the optimizer see this? |
Date | |
Msg-id | Pine.LNX.4.58.0501261636010.9564@greenie.cheapcomplexdevices.com Whole thread Raw |
Responses |
Re: Should the optimizer see this?
|
List | pgsql-performance |
Short summary... the second query runs faster, and I think they should be identical queries. Should the optimizer have found this optimization? I have two identical (or so I believe) queries; one where I explicitly add a "is not null" comparison; and one where I think it would implicitly only find not-null columns. The queries are select * from rt4, rt5 where rt4.tigerfile = rt5.tigerfile and feat = feat3; and select * from (select * from rt4 where feat3 is not null) as rt4, rt5 where rt4.tigerfile = rt5.tigerfile and feat = feat3; I would have thought that the optimizer would see that if feat3 is null (which it usually is), it doesn't need to keep those rows and sort them -- but it seems (looking both at explain analyze and "du" on the tmp directory) that in the first query it is indeed sorting all the rows --- even the ones with feat3=null. The tables are the Census Tiger Line data explained in detail here: http://www.census.gov/geo/www/tiger/tiger2003/TGR2003.pdf I can attach the create statemnts for the tables if people think they'd help. Basically, table rt4 has a column called feat3 which is usually null, and table rt5 has a column called feat which is never null. Both tables have a few million rows. No indexes were used, since I'm joining everything to everything, they shouldn't have helped anyway. However vacuum analyze was run, and (as seen in the second query) the stats did know that the column feat3 was mostly null. ===================================================================================================== fli=# fli=# explain analyze select * from rt4, rt5 where rt4.tigerfile = rt5.tigerfile and feat = feat3; fli-# fli-# fli-# fli-# QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- Merge Join (cost=1922903.02..1967385.35 rows=117698 width=100) (actual time=179246.872..218920.724 rows=153091 loops=1) Merge Cond: (("outer".feat3 = "inner".feat) AND ("outer".tigerfile = "inner".tigerfile)) -> Sort (cost=876532.10..888964.80 rows=4973079 width=45) (actual time=57213.327..67313.216 rows=4971022 loops=1) Sort Key: rt4.feat3, rt4.tigerfile -> Seq Scan on rt4 (cost=0.00..94198.79 rows=4973079 width=45) (actual time=0.053..10433.883 rows=4971022 loops=1) -> Sort (cost=1046370.92..1060457.95 rows=5634813 width=55) (actual time=122033.463..134037.127 rows=5767675 loops=1) Sort Key: rt5.feat, rt5.tigerfile -> Seq Scan on rt5 (cost=0.00..127146.13 rows=5634813 width=55) (actual time=0.016..22538.958 rows=5635077 loops=1) Total runtime: 219632.580 ms (9 rows) fli=# fli=# fli=# fli=# explain analyze select * from (select * from rt4 where feat3 is not null) as rt4, rt5 where rt4.tigerfile = rt5.tigerfile and feat = feat3; fli-# fli-# fli-# fli-# QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- Merge Join (cost=1152466.47..1194789.77 rows=3296 width=100) (actual time=125982.562..145927.220 rows=153091 loops=1) Merge Cond: (("outer".feat3 = "inner".feat) AND ("outer".tigerfile = "inner".tigerfile)) -> Sort (cost=106095.56..106443.67 rows=139247 width=45) (actual time=11729.319..11823.006 rows=153091 loops=1) Sort Key: tgr.rt4.feat3, tgr.rt4.tigerfile -> Seq Scan on rt4 (cost=0.00..94198.79 rows=139247 width=45) (actual time=32.404..10893.373 rows=153091 loops=1) Filter: (feat3 IS NOT NULL) -> Sort (cost=1046370.92..1060457.95 rows=5634813 width=55) (actual time=114253.157..126650.225 rows=5767675 loops=1) Sort Key: rt5.feat, rt5.tigerfile -> Seq Scan on rt5 (cost=0.00..127146.13 rows=5634813 width=55) (actual time=0.012..19253.431 rows=5635077 loops=1) Total runtime: 146480.294 ms (10 rows) fli=# fli=# fli=#
pgsql-performance by date: