Thread: Question about rtrees (overleft replacing left in nodes)
Hello, I'm rather confused about the logic of something in the rtree code, perhaps someone can provide some insight here. Without loss of generality I'll use intervals on R (real number line) below, but this would apply to boxes as well. Note that sup(S) and inf(S) are the upper and lower bound of interval S, which (if S is the closed interval [min,max]) equals min and max, respectively. geo_ops.c defines the overleft operator as (considering intervals, not boxes) S &< T iff sup(S) <= sup(T) whereas the left operator is defined as S << T iff sup(S) < inf(T) fine and dandy. If I understand correctly, in scanning an R-tree (see rtstrat.c) there appear to be several replacements for these operators that occur at nodes (as opposed to leaves) in the tree. For example, if searching for an interval (box) that is the same as T, we check if the node contains T, because the node's interval contains the leaves' intervals. Again, fine and dandy. My concern is this. The left (<<) operator is replaced with overleft (&<) in tests at a node. However, consider the node N whose leaves L and R are as follows: N = [0,3] L = [0,1] R = [2,3] and a target interval T = [1.4,1.6] If I understand correctly, a search for all X << T will test if N &< T. While it is the case that L << T, it is not the case that N &< T, as sup(N) > sup(T). I expect that I'm missing something important in the code. Can someone let me know what that is, so I'll stop worrying? ;) Alternatively, if I'm not missing something, either the node-level replacement must be changed to "N << T or X && T" (which probably wouldn't work unless one added operators to the geo-ops class), or the definition of "overleft" should truly include all cases of overlap (e.g., [0,3] &< [1,2] would be true). Thank you, -- Bill
"bwhite" <bwhite@frognet.net> writes: > I'm rather confused about the logic of something in the rtree code, perhaps > someone can provide some insight here. Hmm ... given the comments in rtstrat.c it seems clear that "overleft" has to mean "overlaps or is to left of", which means that this definition is wrong: > S &< T iff sup(S) <= sup(T) It'd need to be S &< T iff inf(S) <= sup(T) to satisfy the geometrical intuition. (You could quibble about the equality case, but box_overlap seems to consider touching boxes to overlap, so for consistency this should too.) However, if this is indeed wrong, why have we not heard bug reports stating that rtree indexes don't work? Can you generate a test case in which it fails? regards, tom lane
I said: > However, if this is indeed wrong, why have we not heard bug reports > stating that rtree indexes don't work? Can you generate a test case > in which it fails? Actually, it's not necessary to look very far: there's one rtree index defined in the regression database, and guess what: it gets << queries wrong. regression=# explain select count(*) from fast_emp4000 where home_base << '(35565,5404),(35546,5360)'; QUERY PLAN --------------------------------------------------------------------- Aggregate (cost=65.53..65.53 rows=1 width=0) -> Seq Scan on fast_emp4000 (cost=0.00..64.75 rows=310 width=0) Filter: (home_base << '(35565,5404),(35546,5360)'::box) (3 rows) regression=# select count(*) from fast_emp4000 where home_base << '(35565,5404),(35546,5360)'; count ------- 2214 (1 row) regression=# set enable_seqscan to 0; SET regression=# explain select count(*) from fast_emp4000 where home_base << '(35565,5404),(35546,5360)'; QUERY PLAN --------------------------------------------------------------------------------------- Aggregate (cost=112.96..112.96 rows=1 width=0) -> Index Scan using rect2ind on fast_emp4000 (cost=0.00..112.18 rows=310 width=0) Index Cond: (home_base << '(35565,5404),(35546,5360)'::box) (3 rows) regression=# select count(*) from fast_emp4000 where home_base << '(35565,5404),(35546,5360)'; count ------- 1363 (1 row) So we've got a problem here :-( regards, tom lane
William White <bwhite@frognet.net> writes: > Tom Lane wrote: >> However, if this is indeed wrong, why have we not heard bug reports >> stating that rtree indexes don't work? > Because the positionsel/positionjoinsel estimator functions make it all > but impossible to invoke index search on r-trees for position operators; Good point. You can force it by setting enable_seqscan to false, but otherwise it's unlikely to happen. > Suggested remediation, if this is reproducible: > 1. Modify overleft and overright to actually mean "overlap or left" and > "overlap or right" respectively, which might break existing code, or > 2. Add two new functions, e.g., "overlap_or_left" and > "overlap_or_right", and two operators (e.g., "&&||<<" and "&&||>>" ... > yes, I'm kidding about the ugly operator names), and modify the operator > class definition to use the new operators instead of the old. Those seem like our choices, all right. It seems to me that the operator rtree actually wants is best thought of as "is not to right of" (resp. "is not to left of"). The reference to overlapping is misleading, at least when thinking in 2 or more dimensions, since a box that is completely above or below the reference box could still "overlap" in the 1-dimensional sense being used here. So reasonable operator names would be !>> and !<< respectively. So, as far as fixing rtree goes, I'm in favor of #2. Independently of rtree considerations, though, ISTM the existing operators are broken in the sense that they don't meet the advertised definition, which is "Overlaps or is left of?". In the first place it's not obvious that this is meant to be thought of in 1-D terms. A person thinking in 2-D terms might think "a &< b" is supposed to mean "a && b OR a << b", which is a tighter constraint than the actual 1-D behavior, since && considers Y overlap as well as X. And even when thinking in 1-D, "overlaps" has a different meaning than what the code is doing, anyway. So it's clear that we should also change either the code or the documentation for the existing operators. If we are going to add new operators then I'd favor leaving the existing code alone as a potentially useful behavior, but fixing the documentation. I can't think of an equally succinct definition of what the operators really do though. Comments? BTW, it occurs to me we should add rtree opclasses associated with Y-dimension tests (is below, etc) ... or should there be more members of an rtree opclass than there are now? regards, tom lane
William White <bwhite@frognet.net> writes: > Out of curiosity how many tuples are in that test? I wasn't able to > invoke index scan with position operators even at 100K tuples. Number of tuples doesn't matter --- I just forced the plan choice with enable_seqscan = off. (Although I imagine that no failure would be observed in very small tests --- you'd need enough entries to force the rtree index to cover multiple pages.) > Side question: is there a user contrib area for extensions? The path to > this discovery started with a general interval "template class" to > support any interval type (open, half-open, or closed) on any scalar > data type (note: by "template class" read "C equivalent thereof using > Gnu cpp ## macro construction kluges to create another C file with data > types fileld in"). My original goal was [timestamp,timestamp) intervals > (or (t,t) or [t,t] or whatever) but it works with any scalar that's > internally numeric. Someone else might as well benefit from the > frustrations I've had in figuring out how to handle operations on half- > and fully-open empty intervals. :) I could see accepting this as a contrib module, if you want to submit it. Plan B would be to set up a project for it on gborg.postgresql.org, but it seems tightly enough tied to the backend that maintenance would be easier as a contrib item. regards, tom lane
Tom Lane wrote: > I said: > >>However, if this is indeed wrong, why have we not heard bug reports >>stating that rtree indexes don't work? Can you generate a test case >>in which it fails? > > > Actually, it's not necessary to look very far: there's one rtree index > defined in the regression database, and guess what: it gets << queries > wrong. OK, glad to know I'm not going crazy. Oh well, if there's a way to break something, trust me to find it within a week of using it. :( Out of curiosity how many tuples are in that test? I wasn't able to invoke index scan with position operators even at 100K tuples. Also worth noting that &< and &> queries fail (that's the test case I just sent) as well; the << and >> failure is (if I understand correctly) a result of &< and &> being used at the node level in the rtree. Side question: is there a user contrib area for extensions? The path to this discovery started with a general interval "template class" to support any interval type (open, half-open, or closed) on any scalar data type (note: by "template class" read "C equivalent thereof using Gnu cpp ## macro construction kluges to create another C file with data types fileld in"). My original goal was [timestamp,timestamp) intervals (or (t,t) or [t,t] or whatever) but it works with any scalar that's internally numeric. Someone else might as well benefit from the frustrations I've had in figuring out how to handle operations on half- and fully-open empty intervals. :) Thanks, -- Bill
Tom Lane wrote: > It'd need to be > > S &< T iff inf(S) <= sup(T) > > to satisfy the geometrical intuition. (You could quibble about the > equality case, but box_overlap seems to consider touching boxes to > overlap, so for consistency this should too.) > > However, if this is indeed wrong, why have we not heard bug reports > stating that rtree indexes don't work? Because the positionsel/positionjoinsel estimator functions make it all but impossible to invoke index search on r-trees for position operators; see below. > Can you generate a test case > in which it fails? Yes, actually; I was able to demonstrate there is a problem after I'd already sent off the email. But you have to "cheat", because the default cost estimator functions left, overleft, right, and overright all return a large fixed value (0.1, IIRC two orders of magnitude over areasel/areajoinsel) which all but guarantees sequential scan for any relvar with a reasonable number of tuples (I tried 100,000 tuples and still got sequential scan). These are, of course, constant cost estimator functions. The following Perl script temporarily replaces oprrest with "areasel" and oprjoin with "areajoinsel" (instead of positionsel and positionjoinsel rsp.). WARNING: This assumes oids 493 and 494 correspond to box_ops << and &< respectively, that happens to be true in the pgsql version I'm using at the moment. Could be replaced with test for operator name and data type, of course. Suggested remediation, if this is reproducible: 1. Modify overleft and overright to actually mean "overlap or left" and "overlap or right" respectively, which might break existing code, or 2. Add two new functions, e.g., "overlap_or_left" and "overlap_or_right", and two operators (e.g., "&&||<<" and "&&||>>" ... yes, I'm kidding about the ugly operator names), and modify the operator class definition to use the new operators instead of the old. Existing code would continue to work (since &< and &> would now never invoke index scan), new code could take advantage of index scan on position operators (assuming the cost estimator were modified to permit this). Code follows. Apologies for Perl rather than straight SQL, this was adapted from part of a test suite for an interval (as in "interval in R" not the IMNSHO misnamed SQL "interval" type) extension I'm writing, which is why I ran across the problem. ---- cut here ---- #!/usr/bin/perl open(P,"|psql >/dev/null 2>&1"); print P <<"EOF"; DROP INDEX i_test2; DROP TABLE test2; CREATE TABLE test2 ( i box NOT NULL ); CREATE INDEX i_test2 ON test2 using rtree (i); EOF $t0 = time; for ($i=1;$i<=1000;$i++) { printf(P "INSERT INTO test2 (i) VALUES ('(%d,%d),(%d,%d)');\n", $i+1,1,$i,0); } print P "VACUUM ANALYZE;\n"; close(P); open(P,"|psql"); print P <<"EOF"; -- Temporarily use cost estimators for areasel/areajoinsel on << and &< update pg_operator set oprrest = 'areasel', oprjoin = 'areajoinsel' where oid = 494 or oid = 493; -- This works (at least for 10 results) since it'll be towards the right -- half of leaves SELECT * FROM test2 WHERE i &< ('(999.5,1),(998.5,0)') LIMIT 1; -- This fails since it's to the left half of leaves SELECT * FROM test2 WHERE i &< ('(11.5,1),(10.5,0)') LIMIT 1; -- Restore original cost estimators update pg_operator set oprrest = 'positionsel', oprjoin = 'positionjoinsel' where oid = 494 or oid = 493; -- Now they both work: SELECT * FROM test2 WHERE i &< ('(999.5,1),(998.5,0)') LIMIT 1; SELECT * FROM test2 WHERE i &< ('(11.5,1),(10.5,0)') LIMIT 1; EOF ---- cut here ----
Tom Lane wrote: > Good point. You can force it by setting enable_seqscan to false, but > otherwise it's unlikely to happen. Ah, didn't know about enable_seqscan, thanks. > It seems to me that the operator rtree actually wants is best thought of > as "is not to right of" (resp. "is not to left of"). The reference to > overlapping is misleading, at least when thinking in 2 or more > dimensions, since a box that is completely above or below the reference > box could still "overlap" in the 1-dimensional sense being used here. > So reasonable operator names would be !>> and !<< respectively. Well, the rtree implementation is 1-D, so I've been thinking in 1-D. But I agree those are reasonable ... come to think of it, you could fill in the commutator and negator attributes, right? (or am I missing the boundary case here; I haven't really looked yet) > And even when thinking in 1-D, > "overlaps" has a different meaning than what the code is doing, anyway. That was what originally caught my eye. > So it's clear that we should also change either the code or the > documentation for the existing operators. If we are going to add new > operators then I'd favor leaving the existing code alone as a > potentially useful behavior, but fixing the documentation. and of course the old operators wouldn't be used in the operator class search strategies, if I'm understanding correctly. > I can't think of an equally succinct definition of what the operators > really do though. Comments? I think "not to the left of" and "not to the right of" are sufficiently succinct. That they may not see much user application may not be relevant, if they are required to properly implement << and >> on rtrees. > BTW, it occurs to me we should add rtree opclasses associated with > Y-dimension tests (is below, etc) ... or should there be more members of > an rtree opclass than there are now? I'm not familiar enough with rtrees to know for sure, but it appears that the indexing itself is "dimension agnostic", only caring about union (well, bounding region), intersection, and size. Thus, you could add as many dimensions as you wanted, you'd just need 4d+4 search strategy operators (&&, ~=, ~, and @ being non-positional) in the opclass. It would be nice if pgsql supported a variable number of strategy operators, but this would probably mean changing the order (putting the 4 nonpositionals first) and rewriting a lot of code.
William White <bwhite@frognet.net> writes: > Perhaps document as S &< T iff S "does not extend to the right > of/beyond" (the right boundary of) T? "Does not extend to the right of" works for me, unless someone on the list has got a better idea ... regards, tom lane
Tom Lane wrote: > Right, but what about the existing operators --- what is a more correct > way to document them? Ouch. Appealing to J.F. Allen's terminology ("An Interval-Based Representation of Temporal Knowledge", Comm ACM 26(11) 832-43), overleft could be called "left or finishes" (implying all other related conditions, of which there are a bunch) but this relies too heavily on time notation I think. "right boundary left of right boundary" is accurate but rather verbose, "rightoverleft" too confusing. Of course, I suspect most people will see the operator, not the C function name; if they see the latter they can read the code anyway. Perhaps document as S &< T iff S "does not extend to the right of/beyond" (the right boundary of) T? And either leave the C functions as-is or give them reasonable names; in either case, "a.high.x <= b.high.x" is just as clear as any comment. -- Bill
William White <bwhite@frognet.net> writes: > Tom Lane wrote: >> I can't think of an equally succinct definition of what the operators >> really do though. Comments? > I think "not to the left of" and "not to the right of" are sufficiently > succinct. That they may not see much user application may not be > relevant, if they are required to properly implement << and >> on rtrees. Right, but what about the existing operators --- what is a more correct way to document them? regards, tom lane