Thread: Using an index for IS DISTINCT FROM queries

Using an index for IS DISTINCT FROM queries

From
Steven Schlansker
Date:
Hi everyone,
I have a large table (~150M rows) that keeps a version field.  At any given time, it is expected that the vast majority
ofthe rows are on the "current" version, but some may be missing. 

To figure out all the missing our outdated values, I run a query along the lines of

SELECT id FROM source_table LEFT OUTER JOIN dest_table WHERE version IS DISTINCT FROM <current-version>

However, this query always selects a sequential scan and hash of both tables, which is *very* slow.

The statistics reflect that the planner knows that current-version is overwhelmingly common:

 schemaname | tablename |     attname      | inherited | null_frac | avg_width | n_distinct | most_common_vals |
most_common_freqs| histogram_bounds | correlation 

------------+-----------+------------------+-----------+-----------+-----------+------------+------------------+-------------------+------------------+-------------
 public     | event     | unpacker_version | f         |         0 |         4 |          1 | {1}              | {1}
          |                  |           1 

but it doesn't help:

event=> explain select count(1) from event where unpacker_version is distinct from 1;
                            QUERY PLAN
------------------------------------------------------------------
 Aggregate  (cost=10658841.93..10658841.94 rows=1 width=0)
   ->  Seq Scan on event  (cost=0.00..10658841.93 rows=1 width=0)
         Filter: (unpacker_version IS DISTINCT FROM 1)
(3 rows)


I can "by hand" force the planner to consider the obvious solution (find values below, above, and null) but it is ugly:

event=> explain select count(1) from event where unpacker_version < 1 or unpacker_version > 1 or unpacker_version is
null;
                                               QUERY PLAN
--------------------------------------------------------------------------------------------------------
 Aggregate  (cost=139.14..139.15 rows=1 width=0)
   ->  Bitmap Heap Scan on event  (cost=135.13..139.14 rows=1 width=0)
         Recheck Cond: ((unpacker_version < 1) OR (unpacker_version > 1) OR (unpacker_version IS NULL))
         ->  BitmapOr  (cost=135.13..135.13 rows=1 width=0)
               ->  Bitmap Index Scan on event_unpacker_version_idx  (cost=0.00..45.04 rows=1 width=0)
                     Index Cond: (unpacker_version < 1)
               ->  Bitmap Index Scan on event_unpacker_version_idx  (cost=0.00..45.04 rows=1 width=0)
                     Index Cond: (unpacker_version > 1)
               ->  Bitmap Index Scan on event_unpacker_version_idx  (cost=0.00..45.04 rows=1 width=0)
                     Index Cond: (unpacker_version IS NULL)
(10 rows)

but this sucks to do such tuning by munging the query.  Is there some case where these are not equivalent?  If they are
equivalent,would this be a reasonable case for the planner to consider and optimize by itself? 

Thanks,
Steven