Thread: IS NOT DISTINCT FROM + Indexing

IS NOT DISTINCT FROM + Indexing

From

"Jonathan S. Katz"

Date:

21 July 2014, 23:16:51

Hi,

I'm curious if there is a reason why "IS NOT DISTINCT FROM" is not an indexable operation in a B-tree index, as it is
effectivelytesting for equality albeit with some "magic" for NULLs?  Here is an example of what I mean, running tests
on9.3.4: 
-- create a table of integersCREATE TABLE numbers ASSELECT x FROM generate_series(1,1000000) x;
-- create a b-tree indexCREATE INDEX numbers_x_idx ON numbers (x);
-- find x = 500SELECT * FROM numbers WHERE x = 500;  x  ----- 500(1 row)
-- query planEXPLAIN SELECT * FROM numbers WHERE x = 500;                                       QUERY PLAN
                     ---------------------------------------------------------------------------------- Index Only Scan
usingnumbers_x_idx on numbers  (cost=0.42..8.44 rows=1 width=4)   Index Cond: (x = 500)(2 rows) 

-- now find x IS NOT DISTINCT FROM 500SELECT * FROM numbers WHERE x IS NOT DISTINCT FROM 500;  x  ----- 500(1 row)
-- but the query plan is...EXPLAIN SELECT * FROM numbers WHERE x IS NOT DISTINCT FROM 500;                        QUERY
PLAN                        ----------------------------------------------------------- Seq Scan on numbers
(cost=0.00..16925.00rows=1 width=4)   Filter: (NOT (x IS DISTINCT FROM 500)) 

With NULLs being indexable, I was wondering if there was some reason why IS NOT DISTINCT FROM could not use the index?

Thanks,

Jonathan

Re: IS NOT DISTINCT FROM + Indexing

From

Peter Geoghegan

Date:

21 July 2014, 23:51:42

On Mon, Jul 21, 2014 at 4:16 PM, Jonathan S. Katz
<jonathan.katz@excoventures.com> wrote:
> With NULLs being indexable, I was wondering if there was some reason why IS NOT DISTINCT FROM could not use the
index?

FWIW this works:

postgres=# explain analyze select * from orders where orderid in (5, null);
       QUERY PLAN
 

----------------------------------------------------------------------------------------------------------------------Index
Scanusing orders_pkey on orders  (cost=0.29..12.60 rows=1
 
width=60) (actual time=0.019..0.021 rows=1 loops=1)  Index Cond: (orderid = ANY ('{5,NULL}'::integer[]))Planning time:
0.100msExecution time: 0.416 ms
 
(4 rows)

I think that it would almost be a Simple Matter of Programming to make
IS NOT DISTINCT FROM indexable. Under the hood, IS DISTINCT FROM isn't
very different to using the equality operator:

/** DistinctExpr - expression node for "x IS DISTINCT FROM y"** Except for the nodetag, this is represented identically
toan OpExpr* referencing the "=" operator for x and y.* We use "=", not the more obvious "<>", because more datatypes
have"="* than "<>".  This means the executor must invert the operator result.* Note that the operator function won't be
calledat all if either input* is NULL, since then the result can be determined directly.*/
 
typedef OpExpr DistinctExpr;

We're already inverting the equals operator. But that isn't
necessarily how a B-Tree index represents equality (that is, a
particular B-Tree operator class could have a non-'=' operator that it
thinks of as equality-ish - in general that could even be the default
B-Tree opclass and there may not be an equals operator). The fact that
most types think of the '=' equals operator as equality is just a
convention, and so technically IS DISTINCT FROM doesn't invert B-Tree
operation 3. See "31.14. Interfacing Extensions To Indexes" for
details. The equals operator '=' isn't really supposed to be magic, it
just is in some places.

Right now the executor is directly inverting the equality operator to
make this work (and has done so since long before NULLs were
indexable). This is a bit of a kludge. I guess it just works that way
because there is no convenient place to insert the special inversion
of the operator, and the special NULL handling that currently appears
within ExecEvalDistinct().

-- 
Peter Geoghegan

Re: IS NOT DISTINCT FROM + Indexing

From

Andres Freund

Date:

21 July 2014, 23:57:20

On 2014-07-21 16:51:32 -0700, Peter Geoghegan wrote:
> On Mon, Jul 21, 2014 at 4:16 PM, Jonathan S. Katz
> <jonathan.katz@excoventures.com> wrote:
> > With NULLs being indexable, I was wondering if there was some reason why IS NOT DISTINCT FROM could not use the
index?
> 
> FWIW this works:
> 
> postgres=# explain analyze select * from orders where orderid in (5, null);

I rather doubt it will. x in (y1, ... yn) is essentially expanded to x =
y1 OR x = y2, ... OR x = yn. I.e. the NULL comparison will be done using
normal equality comparison and thus not return a row with a NULL
orderid. Am I missing something?

> I think that it would almost be a Simple Matter of Programming to make
> IS NOT DISTINCT FROM indexable. Under the hood, IS DISTINCT FROM isn't
> very different to using the equality operator:

But yea, it probably wouldn't take very much for that.

Greetings,

Andres Freund

Re: IS NOT DISTINCT FROM + Indexing

From

Peter Geoghegan

Date:

22 July 2014, 00:02:31

On Mon, Jul 21, 2014 at 4:57 PM, Andres Freund <andres@anarazel.de> wrote:
> I rather doubt it will. x in (y1, ... yn) is essentially expanded to x =
> y1 OR x = y2, ... OR x = yn. I.e. the NULL comparison will be done using
> normal equality comparison and thus not return a row with a NULL
> orderid. Am I missing something?

I was a little bit incautious in my use of words. The point is that a
scanKey could easily represent a ScalarArrayOpExpr with NULLs and
non-NULLs.
-- 
Peter Geoghegan

Re: IS NOT DISTINCT FROM + Indexing

From

Tom Lane

Date:

22 July 2014, 01:51:17

"Jonathan S. Katz" <jonathan.katz@excoventures.com> writes:
> I'm curious if there is a reason why "IS NOT DISTINCT FROM" is not an
> indexable operation in a B-tree index,

The short reason why not is that it's not an operator (where "operator"
is defined as "something with a pg_operator entry"), and all our indexing
infrastructure is built around the notion that indexable clauses are of
the form "indexed_column indexable_operator comparison_value".

You could certainly imagine ways to fix that, but nobody's put in the
probably-nontrivial effort required to do so.  The btree code itself
would likely be the easiest part to fix, as it sort of thinks nulls
are real values already.
        regards, tom lane

Re: IS NOT DISTINCT FROM + Indexing

From

"Jonathan S. Katz"

Date:

22 July 2014, 04:07:20

On Jul 21, 2014, at 9:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> "Jonathan S. Katz" <jonathan.katz@excoventures.com> writes:
>> I'm curious if there is a reason why "IS NOT DISTINCT FROM" is not an
>> indexable operation in a B-tree index,
>
> The short reason why not is that it's not an operator (where "operator"
> is defined as "something with a pg_operator entry"), and all our indexing
> infrastructure is built around the notion that indexable clauses are of
> the form "indexed_column indexable_operator comparison_value".
>
> You could certainly imagine ways to fix that, but nobody's put in the
> probably-nontrivial effort required to do so.  The btree code itself
> would likely be the easiest part to fix, as it sort of thinks nulls
> are real values already.

What got me thinking this initially problem is that I know "IS NULL" is indexable and I was unsure of how adding "IS
NOTDISTINCT FROM" would be too different from that - of course, this is from my perspective from primarily operating on
thesurface.  It sounds like the IS NULL work is in the btree code? 

Even if it is trivial, it would be tough for me personally to hack on without some hand-holding.  I did want to ask
aboutit because it can be useful in simplifying some queries when you have do deal with NULLs (though in reality I tend
touse IS DISTINCT FROM much more, though in things like triggers) and would be useful with exclusion constraints
(thoughwith those it sounds like it would have to be an operator?). 

If it is a small project someone is interested, I would be happy to contribute by testing.

Jonathan

Re: IS NOT DISTINCT FROM + Indexing

From

Tom Lane

Date:

22 July 2014, 04:40:33

"Jonathan S. Katz" <jonathan.katz@excoventures.com> writes:
> On Jul 21, 2014, at 9:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The short reason why not is that it's not an operator (where "operator"
>> is defined as "something with a pg_operator entry"), and all our indexing
>> infrastructure is built around the notion that indexable clauses are of
>> the form "indexed_column indexable_operator comparison_value".

> What got me thinking this initially problem is that I know "IS NULL" is indexable and I was unsure of how adding "IS
NOTDISTINCT FROM" would be too different from that - of course, this is from my perspective from primarily operating on
thesurface.  It sounds like the IS NULL work is in the btree code?
 

We hacked in IS [NOT] NULL as a potentially indexable construct, but the
key thing that made that possible without major redesign is that IS [NOT]
NULL is datatype independent, so there's no need to identify any
particular underlying operator or opclass.  I'm not sure what we'd do to
handle IS [NOT] DISTINCT FROM, but that particular approach ain't gonna
cut it.

Another point is that people are unlikely to be satisfied with planner
optimization for IS NOT DISTINCT FROM that doesn't support it as a join
clause (i.e., tab1.col1 IS NOT DISTINCT FROM tab2.col2); which is an issue
that doesn't arise for IS [NOT] NULL, as it has only one argument.  So
that brings you into not just indexability but hashing and merging
support.  I hasten to say that that doesn't necessarily have to happen
in a version-zero patch; but trying to make IS NOT DISTINCT FROM into
a first-class construct is a big project.
        regards, tom lane

Re: IS NOT DISTINCT FROM + Indexing

From

"Jonathan S. Katz"

Date:

23 July 2014, 02:01:06

On Jul 22, 2014, at 12:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> "Jonathan S. Katz" <jonathan.katz@excoventures.com> writes:
>> On Jul 21, 2014, at 9:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> The short reason why not is that it's not an operator (where "operator"
>>> is defined as "something with a pg_operator entry"), and all our indexing
>>> infrastructure is built around the notion that indexable clauses are of
>>> the form "indexed_column indexable_operator comparison_value".
>
>> What got me thinking this initially problem is that I know "IS NULL" is indexable and I was unsure of how adding "IS
NOTDISTINCT FROM" would be too different from that - of course, this is from my perspective from primarily operating on
thesurface.  It sounds like the IS NULL work is in the btree code? 
>
> We hacked in IS [NOT] NULL as a potentially indexable construct, but the
> key thing that made that possible without major redesign is that IS [NOT]
> NULL is datatype independent, so there's no need to identify any
> particular underlying operator or opclass.  I'm not sure what we'd do to
> handle IS [NOT] DISTINCT FROM, but that particular approach ain't gonna
> cut it.
>
> Another point is that people are unlikely to be satisfied with planner
> optimization for IS NOT DISTINCT FROM that doesn't support it as a join
> clause (i.e., tab1.col1 IS NOT DISTINCT FROM tab2.col2); which is an issue
> that doesn't arise for IS [NOT] NULL, as it has only one argument.  So
> that brings you into not just indexability but hashing and merging
> support.  I hasten to say that that doesn't necessarily have to happen
> in a version-zero patch; but trying to make IS NOT DISTINCT FROM into
> a first-class construct is a big project.

Well that definitely answers "how hard would it be." - before embarking on something laborious (as even just indexing
isnontrivial), I think it would be good to figure out how people are using IS [NOT] DISTINCT FROM and if there is
interestin having it be indexable, let alone used in a JOIN optimization.  It could become a handy tool to simplify the
SQLin queries that are returning a lot of NULL / NOT NULL data mixed together. 

Jonathan

Re: IS NOT DISTINCT FROM + Indexing

From

Alvaro Herrera

Date:

23 July 2014, 17:30:49

Jonathan S. Katz wrote:

> Well that definitely answers "how hard would it be." - before
> embarking on something laborious (as even just indexing is
> nontrivial), I think it would be good to figure out how people are
> using IS [NOT] DISTINCT FROM and if there is interest in having it be
> indexable, let alone used in a JOIN optimization.  It could become a
> handy tool to simplify the SQL in queries that are returning a lot of
> NULL / NOT NULL data mixed together.

FWIW my project (abandoned for now, but I intend to get back to it
someday) to catalog NOT NULL constraints in pg_constraint had me
rewriting them into some kind of IS DISTINCT FROM NULL expression.  (It
was IS NOT NULL initially until somebody pointed out that that wouldn't
work for composite type columns).  I'm not sure how that interacts with
what you're doing here, maybe not at all; I thought I'd mention it.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: IS NOT DISTINCT FROM + Indexing

From

Kevin Grittner

Date:

23 July 2014, 17:34:05

Jonathan S. Katz <jonathan.katz@excoventures.com> wrote:

> before embarking on something laborious (as even just indexing
> is nontrivial), I think it would be good to figure out how people
> are using IS [NOT] DISTINCT FROM and if there is interest in
> having it be indexable, let alone used in a JOIN optimization.
> It could become a handy tool to simplify the SQL in queries that
> are returning a lot of NULL / NOT NULL data mixed together.

To prevent subtle inconsistencies, I think we would need to limit
support to data types with a btree opclass which uses "=" as the
equality operator on indexes using that opclass (either by default
or explicitly).  That limitation would still allow a lot of useful
cases.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company