Thread: Suggestions for implementing IS DISTINCT FROM?

Suggestions for implementing IS DISTINCT FROM?

From
Thomas Lockhart
Date:
I'm looking at implementing IS DISTINCT FROM, among other things. It has
the unusual behavior that it compares elements for a tuple by
considering two NULLs to be equal (hence non-distinct) rather than
"unknown". So the rules for comparison seem to be:

a) if the rows compared have different lengths, they are distinct
b) if the rows are both zero-length, they are not distinct
c) otherwise, each element in the row (or a single value on each side of
the comparison) are compared pairwise, with1) if both elements are NULL, they are not distinct2) if one element of the
pairis NULL, they are distinct3) if both elements are NOT NULL and are equal, they are not distinct4) if no pair of
elementsis distinct, the rows are not distinct5) otherwise, the rows are distinct
 

I was thinking to implement this by simply expanding these rules within
gram.y to be a tree of comparison tests. But I think that this does not
generalize properly into allowing tuples or rows to be supplied by
subqueries or other non-literal tuples. So, I'm looking for suggestions
on how to go about implementing this. Should I define a new comparison
node like the AND expression which can directly handle the NULL
behaviors correctly? That would require at least minor changes in the
optimizer and executor. Does another approach come to mind (esp. one
which works ;)?

TIA
                   - Thomas


Re: Suggestions for implementing IS DISTINCT FROM?

From
Tom Lane
Date:
Thomas Lockhart <lockhart@fourpalms.org> writes:
> I'm looking at implementing IS DISTINCT FROM, among other things.
> ...
> I was thinking to implement this by simply expanding these rules within
> gram.y to be a tree of comparison tests.

Please, please, do not do that.  Make a new expression node tree type,
instead.  We've made this mistake before (eg for BETWEEN) and I don't
want to do it again.

Aside from the points you make, a direct expansion approach cannot
reverse-list properly in rules/views, and it will force multiple
evaluations of arguments that should not be multiply evaluated.

Adding a new expression node tree type is not too difficult these days;
see for example Joe Conway's recent NullTest and BooleanTest additions.

I believe the existing expansions of row comparison operators
(makeRowExpr) should be replaced by specialized nodes, too.  That would
give us a shot at implementing row '<', '>' comparisons in a
spec-compliant fashion...
        regards, tom lane


Re: Suggestions for implementing IS DISTINCT FROM?

From
Thomas Lockhart
Date:
> I've actually already done almost all the work for converting BETWEEN to a
> node but I have a couple of questions:
> Should I use a boolean in the node to indicate whether it is SYMMETRIC or
> ASYMMETRIC, or should I use some sort of integer to indicate whether it is
> SYMMETRIC, ASYMMETRIC or DEFAULT (ASYMMETRIC).  That way the reverse in
> rules and views could leave out the ASYMMETRIC if it wasn't specified
> originally, rather than always adding it in.  Which is better?

Great! 

I would use a boolean (or integer) to indicate two possibilities, not
three.

The language specifies what the default should be, and dump programs
could choose to omit the ASYMMETRIC if they choose. imho it is best to
resolve defaults earlier, rather than pushing the resolution deep into
the parser or even farther.
                  - Thomas




Re: Suggestions for implementing IS DISTINCT FROM?

From
Thomas Lockhart
Date:
> > I'm looking at implementing IS DISTINCT FROM, among other things.
> > ...
> > I was thinking to implement this by simply expanding these rules within
> > gram.y to be a tree of comparison tests.
> Please, please, do not do that.  Make a new expression node tree type,
> instead.  We've made this mistake before (eg for BETWEEN) and I don't
> want to do it again.

Uh, sure. If you don't quote out of context I think it is pretty clear
that I was looking for a helpful suggestion to do just that. Thanks,
I'll proceed with the assurance that you won't object to *that* too ;)
                - Thomas




Re: Suggestions for implementing IS DISTINCT FROM?

From
"Christopher Kings-Lynne"
Date:
> Please, please, do not do that.  Make a new expression node tree type,
> instead.  We've made this mistake before (eg for BETWEEN) and I don't
> want to do it again.

I've actually already done almost all the work for converting BETWEEN to a
node but I have a couple of questions:

Should I use a boolean in the node to indicate whether it is SYMMETRIC or
ASYMMETRIC, or should I use some sort of integer to indicate whether it is
SYMMETRIC, ASYMMETRIC or DEFAULT (ASYMMETRIC).  That way the reverse in
rules and views could leave out the ASYMMETRIC if it wasn't specified
originally, rather than always adding it in.  Which is better?

Chris






Re: Suggestions for implementing IS DISTINCT FROM?

From
Tom Lane
Date:
"Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes:
> Should I use a boolean in the node to indicate whether it is SYMMETRIC or
> ASYMMETRIC, or should I use some sort of integer to indicate whether it is
> SYMMETRIC, ASYMMETRIC or DEFAULT (ASYMMETRIC).  That way the reverse in
> rules and views could leave out the ASYMMETRIC if it wasn't specified
> originally, rather than always adding it in.  Which is better?

My intention is to reverse-list as either "BETWEEN" or "BETWEEN SYMMETRIC".
While I believe in reproducing the source text during reverse listing,
I don't take it to extremes ;-)

So a boolean is sufficient.
        regards, tom lane




Re: Suggestions for implementing IS DISTINCT FROM?

From
Thomas Lockhart
Date:
...
> Adding a new expression node tree type is not too difficult these days;
> see for example Joe Conway's recent NullTest and BooleanTest additions.
> I believe the existing expansions of row comparison operators
> (makeRowExpr) should be replaced by specialized nodes, too.  That would
> give us a shot at implementing row '<', '>' comparisons in a
> spec-compliant fashion...

OK, now that we are pushing new nodes into the executor with abandon :)
...

Let's talk about the preferred technique for doing so, especially with
row-style argument lists. I want to implement IS NULL for rows also
(actually, already did so using a transformation in gram.y -- no need to
jump on that one Tom ;), and found a comment from Joe on the NullTest
code saying that he wanted to do just that "someday". Should we honk
around the existing NullTest node to handle both lists of arguments and
single arguments, or should we have a completely different set of nodes
for handling row arguments? I'll guess the latter, but we should talk
whether that scales properly, about the preferred style for this new
kind of node, and about how to minimize performance hits which we might
see if we have a large number of new node types being handled by the
executor.

If we extend existing nodes to handle lists, then I might be able to
package the DISTINCT test into an A_Expr node, though I haven't pushed
that all the way through to see for sure.

Using the SubLink node does not seem quite right because IS DISTINCT
FROM does not seem to make sense with an embedded select as one of the
arguments, but maybe it does??

Comments? I'm itchy to code but don't want to waste more time than
necessary heading the wrong direction...
                  - Thomas




Re: Suggestions for implementing IS DISTINCT FROM?

From
Tom Lane
Date:
Thomas Lockhart <lockhart@fourpalms.org> writes:
> Let's talk about the preferred technique for doing so, especially with
> row-style argument lists. I want to implement IS NULL for rows also
> (actually, already did so using a transformation in gram.y -- no need to
> jump on that one Tom ;), and found a comment from Joe on the NullTest
> code saying that he wanted to do just that "someday". Should we honk
> around the existing NullTest node to handle both lists of arguments and
> single arguments, or should we have a completely different set of nodes
> for handling row arguments? I'll guess the latter, but we should talk
> whether that scales properly, about the preferred style for this new
> kind of node, and about how to minimize performance hits which we might
> see if we have a large number of new node types being handled by the
> executor.

I have a mild preference for minimizing the number of expression node
types.  If we can represent two similar kinds of expressions with one
node type instead of two, we roughly halve the amount of overhead code
needed.

My inclination would be to divvy things up on the basis of the kinds
of things to be compared, so we'd have these classes of nodes:

1. Test a single <row value expression>; this would cover
        <null predicate> ::=             <row value expression> IS [ NOT ] NULL

2. Compare two <row value expressions>; this would cover
        <comparison predicate> ::=             <row value expression> <comp op> <row value expression>
        <distinct predicate> ::=             <row value expression> IS DISTINCT FROM <row value expression>
  (OVERLAPS is a special case that should remain separate, IMHO)

3. Compare three <row value expressions>:
        <between predicate> ::=             <row value expression> [ NOT ] BETWEEN               [ ASYMMETRIC |
SYMMETRIC]               <row value expression> AND <row value expression>
 

4. Compare a <row value expression> to a list of <row value expressions>:
        <in predicate> ::=             <row value expression>               [ NOT ] IN <left paren> <in value list>
<rightparen>
 
        <in value list> ::=             <row value expression> { <comma> <row value expression> }...

5. Compare a <row value expression> to the outputs of a <subquery>:
        <in predicate> ::=             <row value expression>               [ NOT ] IN <table subquery>
        <quantified comparison predicate> ::=             <row value expression> <comp op> <quantifier>
<tablesubquery>
 
        <match predicate> ::=             <row value expression> MATCH [ UNIQUE ]             [ SIMPLE | PARTIAL | FULL
]                <table subquery>
 

Case 5 corresponds exactly to the existing SubLink node type (although
it's missing the MATCH options at the moment, and doesn't implement all
the <comp op> cases it should).

The spec intends all of these constructs to include the case where the
<row value expression> is a single scalar expression.  I'm feeling
ambivalent about whether we should have the same node type or different
ones for the single-value and row-value cases.  For SubLink we currently
handle single-value and row-value left hand sides with the same code,
and that seems fine.  For <comparison predicate> I definitely *don't*
want to burden scalar comparisons with row-value overhead, so we need
two separate representations for that.  The other cases seem to be on the
borderline.  We don't currently have scalar-case node types for these,
except for NullTest, so new node-type coding is needed anyway --- it
could be either a separate scalar node type or merged with the row-value
case.  No strong preference here, but slight leaning towards merging.

> Using the SubLink node does not seem quite right because IS DISTINCT
> FROM does not seem to make sense with an embedded select as one of the
> arguments, but maybe it does??

I agree it doesn't make sense.  However, we should look at SubLink and
see if we can't clean it up and maybe share some code.  For all of the
cases that compare rows, you need to have a list of the appropriate
scalar comparison operators to use for each column.  The way SubLink
does that is pretty ugly (especially its use of a modifiable Const node).
Let's see if we can't improve that before we copy it ;-).  I'm thinking
that the expression tree itself ought to contain just a list of Oper
nodes dangling from the SubLink or row comparison node.  We'd need an
alternate entry point similar to ExecEvalOper/ExecMakeFunctionResult
that would accept two Datum values rather than a list of subexpressions
to evaluate, but that seems very doable.
        regards, tom lane