Thread: IS NOT DISTINCT FROM + Indexing
Hi, I'm curious if there is a reason why "IS NOT DISTINCT FROM" is not an indexable operation in a B-tree index, as it is effectivelytesting for equality albeit with some "magic" for NULLs? Here is an example of what I mean, running tests on9.3.4: -- create a table of integersCREATE TABLE numbers ASSELECT x FROM generate_series(1,1000000) x; -- create a b-tree indexCREATE INDEX numbers_x_idx ON numbers (x); -- find x = 500SELECT * FROM numbers WHERE x = 500; x ----- 500(1 row) -- query planEXPLAIN SELECT * FROM numbers WHERE x = 500; QUERY PLAN ---------------------------------------------------------------------------------- Index Only Scan usingnumbers_x_idx on numbers (cost=0.42..8.44 rows=1 width=4) Index Cond: (x = 500)(2 rows) -- now find x IS NOT DISTINCT FROM 500SELECT * FROM numbers WHERE x IS NOT DISTINCT FROM 500; x ----- 500(1 row) -- but the query plan is...EXPLAIN SELECT * FROM numbers WHERE x IS NOT DISTINCT FROM 500; QUERY PLAN ----------------------------------------------------------- Seq Scan on numbers (cost=0.00..16925.00rows=1 width=4) Filter: (NOT (x IS DISTINCT FROM 500)) With NULLs being indexable, I was wondering if there was some reason why IS NOT DISTINCT FROM could not use the index? Thanks, Jonathan
On Mon, Jul 21, 2014 at 4:16 PM, Jonathan S. Katz <jonathan.katz@excoventures.com> wrote: > With NULLs being indexable, I was wondering if there was some reason why IS NOT DISTINCT FROM could not use the index? FWIW this works: postgres=# explain analyze select * from orders where orderid in (5, null); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------Index Scanusing orders_pkey on orders (cost=0.29..12.60 rows=1 width=60) (actual time=0.019..0.021 rows=1 loops=1) Index Cond: (orderid = ANY ('{5,NULL}'::integer[]))Planning time: 0.100msExecution time: 0.416 ms (4 rows) I think that it would almost be a Simple Matter of Programming to make IS NOT DISTINCT FROM indexable. Under the hood, IS DISTINCT FROM isn't very different to using the equality operator: /** DistinctExpr - expression node for "x IS DISTINCT FROM y"** Except for the nodetag, this is represented identically toan OpExpr* referencing the "=" operator for x and y.* We use "=", not the more obvious "<>", because more datatypes have"="* than "<>". This means the executor must invert the operator result.* Note that the operator function won't be calledat all if either input* is NULL, since then the result can be determined directly.*/ typedef OpExpr DistinctExpr; We're already inverting the equals operator. But that isn't necessarily how a B-Tree index represents equality (that is, a particular B-Tree operator class could have a non-'=' operator that it thinks of as equality-ish - in general that could even be the default B-Tree opclass and there may not be an equals operator). The fact that most types think of the '=' equals operator as equality is just a convention, and so technically IS DISTINCT FROM doesn't invert B-Tree operation 3. See "31.14. Interfacing Extensions To Indexes" for details. The equals operator '=' isn't really supposed to be magic, it just is in some places. Right now the executor is directly inverting the equality operator to make this work (and has done so since long before NULLs were indexable). This is a bit of a kludge. I guess it just works that way because there is no convenient place to insert the special inversion of the operator, and the special NULL handling that currently appears within ExecEvalDistinct(). -- Peter Geoghegan
On 2014-07-21 16:51:32 -0700, Peter Geoghegan wrote: > On Mon, Jul 21, 2014 at 4:16 PM, Jonathan S. Katz > <jonathan.katz@excoventures.com> wrote: > > With NULLs being indexable, I was wondering if there was some reason why IS NOT DISTINCT FROM could not use the index? > > FWIW this works: > > postgres=# explain analyze select * from orders where orderid in (5, null); I rather doubt it will. x in (y1, ... yn) is essentially expanded to x = y1 OR x = y2, ... OR x = yn. I.e. the NULL comparison will be done using normal equality comparison and thus not return a row with a NULL orderid. Am I missing something? > I think that it would almost be a Simple Matter of Programming to make > IS NOT DISTINCT FROM indexable. Under the hood, IS DISTINCT FROM isn't > very different to using the equality operator: But yea, it probably wouldn't take very much for that. Greetings, Andres Freund
On Mon, Jul 21, 2014 at 4:57 PM, Andres Freund <andres@anarazel.de> wrote: > I rather doubt it will. x in (y1, ... yn) is essentially expanded to x = > y1 OR x = y2, ... OR x = yn. I.e. the NULL comparison will be done using > normal equality comparison and thus not return a row with a NULL > orderid. Am I missing something? I was a little bit incautious in my use of words. The point is that a scanKey could easily represent a ScalarArrayOpExpr with NULLs and non-NULLs. -- Peter Geoghegan
"Jonathan S. Katz" <jonathan.katz@excoventures.com> writes: > I'm curious if there is a reason why "IS NOT DISTINCT FROM" is not an > indexable operation in a B-tree index, The short reason why not is that it's not an operator (where "operator" is defined as "something with a pg_operator entry"), and all our indexing infrastructure is built around the notion that indexable clauses are of the form "indexed_column indexable_operator comparison_value". You could certainly imagine ways to fix that, but nobody's put in the probably-nontrivial effort required to do so. The btree code itself would likely be the easiest part to fix, as it sort of thinks nulls are real values already. regards, tom lane
On Jul 21, 2014, at 9:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Jonathan S. Katz" <jonathan.katz@excoventures.com> writes: >> I'm curious if there is a reason why "IS NOT DISTINCT FROM" is not an >> indexable operation in a B-tree index, > > The short reason why not is that it's not an operator (where "operator" > is defined as "something with a pg_operator entry"), and all our indexing > infrastructure is built around the notion that indexable clauses are of > the form "indexed_column indexable_operator comparison_value". > > You could certainly imagine ways to fix that, but nobody's put in the > probably-nontrivial effort required to do so. The btree code itself > would likely be the easiest part to fix, as it sort of thinks nulls > are real values already. What got me thinking this initially problem is that I know "IS NULL" is indexable and I was unsure of how adding "IS NOTDISTINCT FROM" would be too different from that - of course, this is from my perspective from primarily operating on thesurface. It sounds like the IS NULL work is in the btree code? Even if it is trivial, it would be tough for me personally to hack on without some hand-holding. I did want to ask aboutit because it can be useful in simplifying some queries when you have do deal with NULLs (though in reality I tend touse IS DISTINCT FROM much more, though in things like triggers) and would be useful with exclusion constraints (thoughwith those it sounds like it would have to be an operator?). If it is a small project someone is interested, I would be happy to contribute by testing. Jonathan
"Jonathan S. Katz" <jonathan.katz@excoventures.com> writes: > On Jul 21, 2014, at 9:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The short reason why not is that it's not an operator (where "operator" >> is defined as "something with a pg_operator entry"), and all our indexing >> infrastructure is built around the notion that indexable clauses are of >> the form "indexed_column indexable_operator comparison_value". > What got me thinking this initially problem is that I know "IS NULL" is indexable and I was unsure of how adding "IS NOTDISTINCT FROM" would be too different from that - of course, this is from my perspective from primarily operating on thesurface. It sounds like the IS NULL work is in the btree code? We hacked in IS [NOT] NULL as a potentially indexable construct, but the key thing that made that possible without major redesign is that IS [NOT] NULL is datatype independent, so there's no need to identify any particular underlying operator or opclass. I'm not sure what we'd do to handle IS [NOT] DISTINCT FROM, but that particular approach ain't gonna cut it. Another point is that people are unlikely to be satisfied with planner optimization for IS NOT DISTINCT FROM that doesn't support it as a join clause (i.e., tab1.col1 IS NOT DISTINCT FROM tab2.col2); which is an issue that doesn't arise for IS [NOT] NULL, as it has only one argument. So that brings you into not just indexability but hashing and merging support. I hasten to say that that doesn't necessarily have to happen in a version-zero patch; but trying to make IS NOT DISTINCT FROM into a first-class construct is a big project. regards, tom lane
On Jul 22, 2014, at 12:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Jonathan S. Katz" <jonathan.katz@excoventures.com> writes: >> On Jul 21, 2014, at 9:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> The short reason why not is that it's not an operator (where "operator" >>> is defined as "something with a pg_operator entry"), and all our indexing >>> infrastructure is built around the notion that indexable clauses are of >>> the form "indexed_column indexable_operator comparison_value". > >> What got me thinking this initially problem is that I know "IS NULL" is indexable and I was unsure of how adding "IS NOTDISTINCT FROM" would be too different from that - of course, this is from my perspective from primarily operating on thesurface. It sounds like the IS NULL work is in the btree code? > > We hacked in IS [NOT] NULL as a potentially indexable construct, but the > key thing that made that possible without major redesign is that IS [NOT] > NULL is datatype independent, so there's no need to identify any > particular underlying operator or opclass. I'm not sure what we'd do to > handle IS [NOT] DISTINCT FROM, but that particular approach ain't gonna > cut it. > > Another point is that people are unlikely to be satisfied with planner > optimization for IS NOT DISTINCT FROM that doesn't support it as a join > clause (i.e., tab1.col1 IS NOT DISTINCT FROM tab2.col2); which is an issue > that doesn't arise for IS [NOT] NULL, as it has only one argument. So > that brings you into not just indexability but hashing and merging > support. I hasten to say that that doesn't necessarily have to happen > in a version-zero patch; but trying to make IS NOT DISTINCT FROM into > a first-class construct is a big project. Well that definitely answers "how hard would it be." - before embarking on something laborious (as even just indexing isnontrivial), I think it would be good to figure out how people are using IS [NOT] DISTINCT FROM and if there is interestin having it be indexable, let alone used in a JOIN optimization. It could become a handy tool to simplify the SQLin queries that are returning a lot of NULL / NOT NULL data mixed together. Jonathan
Jonathan S. Katz wrote: > Well that definitely answers "how hard would it be." - before > embarking on something laborious (as even just indexing is > nontrivial), I think it would be good to figure out how people are > using IS [NOT] DISTINCT FROM and if there is interest in having it be > indexable, let alone used in a JOIN optimization. It could become a > handy tool to simplify the SQL in queries that are returning a lot of > NULL / NOT NULL data mixed together. FWIW my project (abandoned for now, but I intend to get back to it someday) to catalog NOT NULL constraints in pg_constraint had me rewriting them into some kind of IS DISTINCT FROM NULL expression. (It was IS NOT NULL initially until somebody pointed out that that wouldn't work for composite type columns). I'm not sure how that interacts with what you're doing here, maybe not at all; I thought I'd mention it. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Jonathan S. Katz <jonathan.katz@excoventures.com> wrote: > before embarking on something laborious (as even just indexing > is nontrivial), I think it would be good to figure out how people > are using IS [NOT] DISTINCT FROM and if there is interest in > having it be indexable, let alone used in a JOIN optimization. > It could become a handy tool to simplify the SQL in queries that > are returning a lot of NULL / NOT NULL data mixed together. To prevent subtle inconsistencies, I think we would need to limit support to data types with a btree opclass which uses "=" as the equality operator on indexes using that opclass (either by default or explicitly). That limitation would still allow a lot of useful cases. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company