Thread: Question about rtrees (overleft replacing left in nodes)

Question about rtrees (overleft replacing left in nodes)

From

"bwhite"

Date:

31 March 2004, 04:10:43

Hello,

I'm rather confused about the logic of something in the rtree code, perhaps
someone can provide some insight here.  Without loss of generality I'll
use intervals on R (real number line) below, but this would apply to
boxes as well.  Note that sup(S) and inf(S) are the upper and lower bound
of interval S, which (if S is the closed interval [min,max]) equals
min and max, respectively.

geo_ops.c defines the overleft operator as (considering intervals, not
boxes)

S &< T iff sup(S) <= sup(T)

whereas the left operator is defined as

S << T iff sup(S) < inf(T)

fine and dandy.

If I understand correctly, in scanning an R-tree (see rtstrat.c) there
appear to be several replacements for these operators that occur at nodes
(as opposed to leaves) in the tree.  For example, if searching for an
interval (box) that is the same as T, we check if the node contains T,
because the node's interval contains the leaves' intervals.

Again, fine and dandy.

My concern is this.  The left (<<) operator is replaced with overleft (&<)
in tests at a node.  However, consider the node N whose leaves L and R
are as follows:

N = [0,3]   L = [0,1]   R = [2,3]

and a target interval

T = [1.4,1.6]

If I understand correctly, a search for all X << T will test if N &< T.
While it is the case that L << T, it is not the case that N &< T, as
sup(N) > sup(T).

I expect that I'm missing something important in the code.  Can someone
let me know what that is, so I'll stop worrying? ;)

Alternatively, if I'm not missing something, either the node-level
replacement must be changed to "N << T or X && T" (which probably
wouldn't work unless one added operators to the geo-ops class), or
the definition of "overleft" should truly include all cases of overlap
(e.g., [0,3] &< [1,2] would be true).

Thank you,

 -- Bill

Re: Question about rtrees (overleft replacing left in nodes)

From

Tom Lane

Date:

31 March 2004, 10:57:52

"bwhite" <bwhite@frognet.net> writes:
> I'm rather confused about the logic of something in the rtree code, perhaps
> someone can provide some insight here.

Hmm ... given the comments in rtstrat.c it seems clear that "overleft"
has to mean "overlaps or is to left of", which means that this
definition is wrong:

> S &< T iff sup(S) <= sup(T)

It'd need to be

S &< T iff inf(S) <= sup(T)

to satisfy the geometrical intuition.  (You could quibble about the
equality case, but box_overlap seems to consider touching boxes to
overlap, so for consistency this should too.)

However, if this is indeed wrong, why have we not heard bug reports
stating that rtree indexes don't work?  Can you generate a test case
in which it fails?

            regards, tom lane

Re: Question about rtrees (overleft replacing left in nodes)

From

Tom Lane

Date:

31 March 2004, 11:06:37

I said:
> However, if this is indeed wrong, why have we not heard bug reports
> stating that rtree indexes don't work?  Can you generate a test case
> in which it fails?

Actually, it's not necessary to look very far: there's one rtree index
defined in the regression database, and guess what: it gets << queries
wrong.

regression=# explain select count(*) from fast_emp4000 where home_base << '(35565,5404),(35546,5360)';
                             QUERY PLAN
---------------------------------------------------------------------
 Aggregate  (cost=65.53..65.53 rows=1 width=0)
   ->  Seq Scan on fast_emp4000  (cost=0.00..64.75 rows=310 width=0)
         Filter: (home_base << '(35565,5404),(35546,5360)'::box)
(3 rows)

regression=# select count(*) from fast_emp4000 where home_base << '(35565,5404),(35546,5360)';
 count
-------
  2214
(1 row)

regression=# set enable_seqscan to 0;
SET
regression=# explain select count(*) from fast_emp4000 where home_base << '(35565,5404),(35546,5360)';
                                      QUERY PLAN
---------------------------------------------------------------------------------------
 Aggregate  (cost=112.96..112.96 rows=1 width=0)
   ->  Index Scan using rect2ind on fast_emp4000  (cost=0.00..112.18 rows=310 width=0)
         Index Cond: (home_base << '(35565,5404),(35546,5360)'::box)
(3 rows)

regression=# select count(*) from fast_emp4000 where home_base << '(35565,5404),(35546,5360)';
 count
-------
  1363
(1 row)

So we've got a problem here :-(

            regards, tom lane

Re: Question about rtrees (overleft replacing left in nodes)

From

Tom Lane

Date:

31 March 2004, 14:02:48

William White <bwhite@frognet.net> writes:
> Tom Lane wrote:
>> However, if this is indeed wrong, why have we not heard bug reports
>> stating that rtree indexes don't work?

> Because the positionsel/positionjoinsel estimator functions make it all
> but impossible to invoke index search on r-trees for position operators;

Good point.  You can force it by setting enable_seqscan to false, but
otherwise it's unlikely to happen.

> Suggested remediation, if this is reproducible:

> 1. Modify overleft and overright to actually mean "overlap or left" and
> "overlap or right" respectively, which might break existing code, or

> 2. Add two new functions, e.g., "overlap_or_left" and
> "overlap_or_right", and two operators (e.g., "&&||<<" and "&&||>>" ...
> yes, I'm kidding about the ugly operator names), and modify the operator
> class definition to use the new operators instead of the old.

Those seem like our choices, all right.

It seems to me that the operator rtree actually wants is best thought of
as "is not to right of" (resp. "is not to left of").  The reference to
overlapping is misleading, at least when thinking in 2 or more
dimensions, since a box that is completely above or below the reference
box could still "overlap" in the 1-dimensional sense being used here.
So reasonable operator names would be !>> and !<< respectively.

So, as far as fixing rtree goes, I'm in favor of #2.

Independently of rtree considerations, though, ISTM the existing
operators are broken in the sense that they don't meet the advertised
definition, which is "Overlaps or is left of?".  In the first place it's
not obvious that this is meant to be thought of in 1-D terms.  A person
thinking in 2-D terms might think "a &< b" is supposed to mean "a && b
OR a << b", which is a tighter constraint than the actual 1-D behavior,
since && considers Y overlap as well as X.  And even when thinking in 1-D,
"overlaps" has a different meaning than what the code is doing, anyway.

So it's clear that we should also change either the code or the
documentation for the existing operators.  If we are going to add new
operators then I'd favor leaving the existing code alone as a
potentially useful behavior, but fixing the documentation.  I can't
think of an equally succinct definition of what the operators really do
though.  Comments?

BTW, it occurs to me we should add rtree opclasses associated with
Y-dimension tests (is below, etc) ... or should there be more members of
an rtree opclass than there are now?

            regards, tom lane

Re: Question about rtrees (overleft replacing left in nodes)

From

Tom Lane

Date:

31 March 2004, 15:46:38

William White <bwhite@frognet.net> writes:
> Out of curiosity how many tuples are in that test?  I wasn't able to
> invoke index scan with position operators even at 100K tuples.

Number of tuples doesn't matter --- I just forced the plan choice with
enable_seqscan = off.  (Although I imagine that no failure would be
observed in very small tests --- you'd need enough entries to force the
rtree index to cover multiple pages.)

> Side question: is there a user contrib area for extensions?  The path to
> this discovery started with a general interval "template class" to
> support any interval type (open, half-open, or closed) on any scalar
> data type (note: by "template class" read "C equivalent thereof using
> Gnu cpp ## macro construction kluges to create another C file with data
> types fileld in").  My original goal was [timestamp,timestamp) intervals
> (or (t,t) or [t,t] or whatever) but it works with any scalar that's
> internally numeric.  Someone else might as well benefit from the
> frustrations I've had in figuring out how to handle operations on half-
> and fully-open empty intervals. :)

I could see accepting this as a contrib module, if you want to submit
it.  Plan B would be to set up a project for it on gborg.postgresql.org,
but it seems tightly enough tied to the backend that maintenance would
be easier as a contrib item.

            regards, tom lane

Re: Question about rtrees (overleft replacing left in nodes)

From

William White

Date:

31 March 2004, 16:02:11

Tom Lane wrote:
> I said:
>
>>However, if this is indeed wrong, why have we not heard bug reports
>>stating that rtree indexes don't work?  Can you generate a test case
>>in which it fails?
>
>
> Actually, it's not necessary to look very far: there's one rtree index
> defined in the regression database, and guess what: it gets << queries
> wrong.

OK, glad to know I'm not going crazy.  Oh well, if there's a way to
break something, trust me to find it within a week of using it. :(

Out of curiosity how many tuples are in that test?  I wasn't able to
invoke index scan with position operators even at 100K tuples.

Also worth noting that &< and &> queries fail (that's the test case I
just sent) as well; the << and >> failure is (if I understand correctly)
a result of &< and &> being used at the node level in the rtree.

Side question: is there a user contrib area for extensions?  The path to
this discovery started with a general interval "template class" to
support any interval type (open, half-open, or closed) on any scalar
data type (note: by "template class" read "C equivalent thereof using
Gnu cpp ## macro construction kluges to create another C file with data
types fileld in").  My original goal was [timestamp,timestamp) intervals
(or (t,t) or [t,t] or whatever) but it works with any scalar that's
internally numeric.  Someone else might as well benefit from the
frustrations I've had in figuring out how to handle operations on half-
and fully-open empty intervals. :)

Thanks,

-- Bill

Re: Question about rtrees (overleft replacing left in nodes)

From

William White

Date:

31 March 2004, 16:12:47

Tom Lane wrote:

> It'd need to be
>
> S &< T iff inf(S) <= sup(T)
>
> to satisfy the geometrical intuition.  (You could quibble about the
> equality case, but box_overlap seems to consider touching boxes to
> overlap, so for consistency this should too.)
>
> However, if this is indeed wrong, why have we not heard bug reports
> stating that rtree indexes don't work?

Because the positionsel/positionjoinsel estimator functions make it all
but impossible to invoke index search on r-trees for position operators;
see below.

 > Can you generate a test case
> in which it fails?

Yes, actually; I was able to demonstrate there is a problem after I'd
already sent off the email.  But you have to "cheat", because the
default cost estimator functions left, overleft, right, and overright
all return a large fixed value (0.1, IIRC two orders of magnitude over
areasel/areajoinsel) which all but guarantees sequential scan for any
relvar with a reasonable number of tuples (I tried 100,000 tuples and
still got sequential scan).

These are, of course, constant cost estimator functions.

The following Perl script temporarily replaces oprrest with "areasel"
and oprjoin with "areajoinsel" (instead of positionsel and
positionjoinsel rsp.).  WARNING: This assumes oids 493 and 494
correspond to box_ops << and &< respectively, that happens to be true in
the pgsql version I'm using at the moment.  Could be replaced with test
for operator name and data type, of course.

Suggested remediation, if this is reproducible:

1. Modify overleft and overright to actually mean "overlap or left" and
"overlap or right" respectively, which might break existing code, or

2. Add two new functions, e.g., "overlap_or_left" and
"overlap_or_right", and two operators (e.g., "&&||<<" and "&&||>>" ...
yes, I'm kidding about the ugly operator names), and modify the operator
class definition to use the new operators instead of the old.  Existing
code would continue to work (since &< and &> would now never invoke
index scan), new code could take advantage of index scan on position
operators (assuming the cost estimator were modified to permit this).

Code follows.  Apologies for Perl rather than straight SQL, this was
adapted from part of a test suite for an interval (as in "interval in R"
not the IMNSHO misnamed SQL "interval" type) extension I'm writing,
which is why I ran across the problem.

---- cut here ----

#!/usr/bin/perl
open(P,"|psql >/dev/null 2>&1");
print P <<"EOF";
DROP INDEX i_test2;
DROP TABLE test2;
CREATE TABLE test2
(
     i box NOT NULL
);

CREATE INDEX i_test2 ON test2 using rtree (i);
EOF

$t0 = time;
for ($i=1;$i<=1000;$i++)
{
     printf(P "INSERT INTO test2 (i) VALUES ('(%d,%d),(%d,%d)');\n",
             $i+1,1,$i,0);
}
print P "VACUUM ANALYZE;\n";
close(P);

open(P,"|psql");
print P <<"EOF";

-- Temporarily use cost estimators for areasel/areajoinsel on << and &<
update pg_operator set oprrest = 'areasel', oprjoin = 'areajoinsel'
     where oid = 494 or oid = 493;

-- This works (at least for 10 results) since it'll be towards the right
-- half of leaves

SELECT * FROM test2 WHERE i &< ('(999.5,1),(998.5,0)') LIMIT 1;

-- This fails since it's to the left half of leaves

SELECT * FROM test2 WHERE i &< ('(11.5,1),(10.5,0)') LIMIT 1;

-- Restore original cost estimators
update pg_operator set oprrest = 'positionsel', oprjoin = 'positionjoinsel'
     where oid = 494 or oid = 493;

-- Now they both work:
SELECT * FROM test2 WHERE i &< ('(999.5,1),(998.5,0)') LIMIT 1;
SELECT * FROM test2 WHERE i &< ('(11.5,1),(10.5,0)') LIMIT 1;

EOF

---- cut here ----

Re: Question about rtrees (overleft replacing left in nodes)

From

William White

Date:

31 March 2004, 16:58:24

Tom Lane wrote:

> Good point.  You can force it by setting enable_seqscan to false, but
> otherwise it's unlikely to happen.

Ah, didn't know about enable_seqscan, thanks.

> It seems to me that the operator rtree actually wants is best thought of
> as "is not to right of" (resp. "is not to left of").  The reference to
> overlapping is misleading, at least when thinking in 2 or more
> dimensions, since a box that is completely above or below the reference
> box could still "overlap" in the 1-dimensional sense being used here.
> So reasonable operator names would be !>> and !<< respectively.

Well, the rtree implementation is 1-D, so I've been thinking in 1-D.
But I agree those are reasonable ... come to think of it, you could fill
in the commutator and negator attributes, right? (or am I missing the
boundary case here; I haven't really looked yet)

> And even when thinking in 1-D,
> "overlaps" has a different meaning than what the code is doing, anyway.

That was what originally caught my eye.

> So it's clear that we should also change either the code or the
> documentation for the existing operators.  If we are going to add new
> operators then I'd favor leaving the existing code alone as a
> potentially useful behavior, but fixing the documentation.

and of course the old operators wouldn't be used in the operator class
search strategies, if I'm understanding correctly.

> I can't think of an equally succinct definition of what the operators
> really do though.  Comments?

I think "not to the left of" and "not to the right of" are sufficiently
succinct.  That they may not see much user application may not be
relevant, if they are required to properly implement << and >> on rtrees.

> BTW, it occurs to me we should add rtree opclasses associated with
> Y-dimension tests (is below, etc) ... or should there be more members of
> an rtree opclass than there are now?

I'm not familiar enough with rtrees to know for sure, but it appears
that the indexing itself is "dimension agnostic", only caring about
union (well, bounding region), intersection, and size.  Thus, you could
add as many dimensions as you wanted, you'd just need 4d+4 search
strategy operators (&&, ~=, ~, and @ being non-positional) in the
opclass.  It would be nice if pgsql supported a variable number of
strategy operators, but this would probably mean changing the order
(putting the 4 nonpositionals first) and rewriting a lot of code.

Re: Question about rtrees (overleft replacing left in nodes)

From

Tom Lane

Date:

31 March 2004, 17:49:54

William White <bwhite@frognet.net> writes:
> Perhaps document as S &< T iff S "does not extend to the right
> of/beyond" (the right boundary of) T?

"Does not extend to the right of" works for me, unless someone on the
list has got a better idea ...

            regards, tom lane

Re: Question about rtrees (overleft replacing left in nodes)

From

William White

Date:

31 March 2004, 18:35:08

Tom Lane wrote:

> Right, but what about the existing operators --- what is a more correct
> way to document them?

Ouch.

Appealing to J.F. Allen's terminology ("An Interval-Based Representation
of Temporal Knowledge", Comm ACM 26(11) 832-43), overleft could be
called "left or finishes" (implying all other related conditions, of
which there are a bunch) but this relies too heavily on time notation I
think.  "right boundary left of right boundary" is accurate but rather
verbose, "rightoverleft" too confusing.  Of course, I suspect most
people will see the operator, not the C function name; if they see the
latter they can read the code anyway.

Perhaps document as S &< T iff S "does not extend to the right
of/beyond" (the right boundary of) T?  And either leave the C functions
as-is or give them reasonable names; in either case, "a.high.x <=
b.high.x" is just as clear as any comment.

-- Bill

Re: Question about rtrees (overleft replacing left in nodes)

From

Tom Lane

Date:

31 March 2004, 18:43:50

William White <bwhite@frognet.net> writes:
> Tom Lane wrote:
>> I can't think of an equally succinct definition of what the operators
>> really do though.  Comments?

> I think "not to the left of" and "not to the right of" are sufficiently
> succinct.  That they may not see much user application may not be
> relevant, if they are required to properly implement << and >> on rtrees.

Right, but what about the existing operators --- what is a more correct
way to document them?

            regards, tom lane