Thread: [PATCH] we have added support for box type in SP-GiST index
Hello, Hacker. * [PATCH] add a box index to sp-gist We have extended sp-gist with an index that keeps track of boxes We have used ideas underlying sp-gist range implementation to represent 2D boxes as points in 4D space. We use quad tree analogue, but in 4-dimensional space. We call this tree q4d. Each node of this tree is a box (a point in 4D space) which splits space in 16 hyperrectangles. Rationale: r-tree assumes that boxes we're trying to index don't overlap much. When this assumption fails, r-tree performs badly, while our proposal to represent a rectangle as a point in 4D space solves this problem. NB: the index uses traversalValue introduced in a separate patch. * [PATCH] add traversalValue in sp-gist During implementation of box index for sp-gist we saw that we only keep rectangles, but to determine traversal direction we may need to know boundaries of a hyperrectangle. So we calculate them on the fly and store them in traversalValue, available when traversing child nodes, because otherwise we would't be able to calculate them from inside the inner_consistent function call. This patch was written by Teodor Sigaev. * [PATCH] change reconstructValue -> traversalValue in range_spgist.c During traversal, local context is insufficient to pick traversal direction. As of now, this is worked around with the help of reconstructValue. reconstructValue only stores data of the same type as a tree node, that is, a range. We have written a patch that calculates auxillary values and makes them accessible during traversal. We then use traversalValue in a new box index and change range_spgist.c to use it in place of reconstructValue. NB: apply this patch after traversalValue patch.
Attachment
On Sat, Oct 31, 2015 at 9:49 PM, Alexander Lebedev <a.lebedev@postgrespro.ru> wrote:
Hello, Hacker.
* [PATCH] add a box index to sp-gist
We have extended sp-gist with an index that keeps track of boxes
We have used ideas underlying sp-gist range implementation to
represent 2D boxes as points in 4D space. We use quad tree
analogue, but in 4-dimensional space. We call this tree q4d. Each
node of this tree is a box (a point in 4D space) which splits space
in 16 hyperrectangles.
Rationale: r-tree assumes that boxes we're trying to index don't
overlap much. When this assumption fails, r-tree performs badly,
while our proposal to represent a rectangle as a point in 4D space
solves this problem.
NB: the index uses traversalValue introduced in a separate patch.
* [PATCH] add traversalValue in sp-gist
During implementation of box index for sp-gist we saw that we only
keep rectangles, but to determine traversal direction we may need
to know boundaries of a hyperrectangle. So we calculate them
on the fly and store them in traversalValue, available
when traversing child nodes, because otherwise we would't be able to
calculate them from inside the inner_consistent function call.
This patch was written by Teodor Sigaev.
* [PATCH] change reconstructValue -> traversalValue in range_spgist.c
During traversal, local context is
insufficient to pick traversal direction. As of now, this is worked
around with the help of reconstructValue. reconstructValue only
stores data of the same type as a tree node, that is, a range.
We have written a patch that calculates auxillary values and makes
them accessible during traversal.
We then use traversalValue in a new box index and change
range_spgist.c to use it in place of reconstructValue.
NB: apply this patch after traversalValue patch.
Did you forget to show us performance numbers ?
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
CC'ing Teodor because he's author of one of the patches. Alexander Lebedev wrote: > Hello, Hacker. So, can we have a more thorough explanation of what is the point of this patch? If it is supposed to improve the performance of searching for boxes, can we see measurements from some benchmark? Do the patches still apply, or do you need to rebase? If so, please post an updated version. I'm setting this patch as Waiting on Author. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alexander Lebedev wrote: > Hello, Hacker. > > * [PATCH] add a box index to sp-gist I closed this patch as "returned with feedback" because the author didn't reply in three months. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
It's very pity but author is not able to continue work on this patch, and I would like to raise this flag. I'd like to add some comments about patches: traversalValue patch adds arbitrary value assoсiated with branch in SP-GiST tree walk. Unlike to recostructedValue it could be just pointer, it havn't to be a regular pgsql type. Also, it could be used independly from reconstructedValue. This patch is used both following attached patches. range patch just switchs using reconstructedValue to traversalValue in range opclasses. reconstructedValue was used just because it was an acceptable workaround in case of range type. Range opclase stores a full value in leafs and doesn't need to use reconstructedValue to return tuple in index only scan. See http://www.postgresql.org/message-id/5399.1343512250@sss.pgh.pa.us q4d patch implements index over boxes using SP-GiST. Basic idea was an observation, range opclass thinks about one-dimensional ranges as 2D points. Following this idea, we can think that 2D box (what is 2 points or 4 numbers) could represent a 4D point. We hoped that this approach will be much more effective than traditional R-Tree in case of many overlaps in box's collection. Performance results are coming shortly. In near future (say, tommorrow) I hope to suggest a opclass over polygons based on this approach. Alexander Lebedev wrote: > Hello, Hacker. > > * [PATCH] add a box index to sp-gist > > We have extended sp-gist with an index that keeps track of boxes > > We have used ideas underlying sp-gist range implementation to > represent 2D boxes as points in 4D space. We use quad tree > analogue, but in 4-dimensional space. We call this tree q4d. Each > node of this tree is a box (a point in 4D space) which splits space > in 16 hyperrectangles. > > Rationale: r-tree assumes that boxes we're trying to index don't > overlap much. When this assumption fails, r-tree performs badly, > while our proposal to represent a rectangle as a point in 4D space > solves this problem. > > NB: the index uses traversalValue introduced in a separate patch. > > * [PATCH] add traversalValue in sp-gist > > During implementation of box index for sp-gist we saw that we only > keep rectangles, but to determine traversal direction we may need > to know boundaries of a hyperrectangle. So we calculate them > on the fly and store them in traversalValue, available > when traversing child nodes, because otherwise we would't be able to > calculate them from inside the inner_consistent function call. > > This patch was written by Teodor Sigaev. > > * [PATCH] change reconstructValue -> traversalValue in range_spgist.c > > During traversal, local context is > insufficient to pick traversal direction. As of now, this is worked > around with the help of reconstructValue. reconstructValue only > stores data of the same type as a tree node, that is, a range. > > We have written a patch that calculates auxillary values and makes > them accessible during traversal. > > We then use traversalValue in a new box index and change > range_spgist.c to use it in place of reconstructValue. > > NB: apply this patch after traversalValue patch. > > > > -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Attachment
On Mon, Feb 15, 2016 at 6:29 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
It's very pity but author is not able to continue work on this patch, and I would like to raise this flag.
I'd like to add some comments about patches:
traversalValue patch adds arbitrary value assoсiated with branch in SP-GiST tree walk. Unlike to recostructedValue it could be just pointer, it havn't to be a regular pgsql type. Also, it could be used independly from reconstructedValue. This patch is used both following attached patches.
range patch just switchs using reconstructedValue to traversalValue in range opclasses. reconstructedValue was used just because it was an acceptable workaround in case of range type. Range opclase stores a full value in leafs and doesn't need to use reconstructedValue to return tuple in index only scan.
See http://www.postgresql.org/message-id/5399.1343512250@sss.pgh.pa.us
q4d patch implements index over boxes using SP-GiST. Basic idea was an observation, range opclass thinks about one-dimensional ranges as 2D points.
Following this idea, we can think that 2D box (what is 2 points or 4 numbers) could represent a 4D point. We hoped that this approach will be much more effective than traditional R-Tree in case of many overlaps in box's collection.
Performance results are coming shortly.
I made some benchmarks using two real-life datasets. Streets data set is consists of 3691620 MBRs of russian streets with not much overlaps. As expected, spgist is slower than rtree (gist), because of much bigger number of blocks readed. Build time, however, is faster for spgist than rtree (~ 1.3 times).
Bounds data set is consists of 7803499 MBRs of some objects and are highly overlapped, see attached picture (crop with center in Moscow) of boxes.
Create index time (ms):
spgist: 47036.051
rtree: 68491.242
Size:
spgist: 505 MB
rtree: 663 MB
heap: 871 MB
Let's consider a small box in downtown of Moscow: box '(37.607164,55.756489), (37.607797,55.756725)'.
&& оператор (returns 23958 boxes) :
&& оператор (returns 23958 boxes) :
spgist: 14.649
rtree: 23.715
seqscan: 924.344
@> operator (returns 23868 boxes):
spgist: 14.113
rtree: 26.853
seqscan: 909.147
<@ operator (returns 0 boxes):
spgist: 0..268 ms
rtree: 12.563
My conclusion is that for "spagetti" data new opclass for spgist is good to have - it's smaller and faster (both, created index and search) that rtree (gist based).
In near future (say, tommorrow) I hope to suggest a opclass over polygons based on this approach.
Yes, it'd be interested.
--
Alexander Lebedev wrote:Hello, Hacker.
* [PATCH] add a box index to sp-gist
We have extended sp-gist with an index that keeps track of boxes
We have used ideas underlying sp-gist range implementation to
represent 2D boxes as points in 4D space. We use quad tree
analogue, but in 4-dimensional space. We call this tree q4d. Each
node of this tree is a box (a point in 4D space) which splits space
in 16 hyperrectangles.
Rationale: r-tree assumes that boxes we're trying to index don't
overlap much. When this assumption fails, r-tree performs badly,
while our proposal to represent a rectangle as a point in 4D space
solves this problem.
NB: the index uses traversalValue introduced in a separate patch.
* [PATCH] add traversalValue in sp-gist
During implementation of box index for sp-gist we saw that we only
keep rectangles, but to determine traversal direction we may need
to know boundaries of a hyperrectangle. So we calculate them
on the fly and store them in traversalValue, available
when traversing child nodes, because otherwise we would't be able to
calculate them from inside the inner_consistent function call.
This patch was written by Teodor Sigaev.
* [PATCH] change reconstructValue -> traversalValue in range_spgist.c
During traversal, local context is
insufficient to pick traversal direction. As of now, this is worked
around with the help of reconstructValue. reconstructValue only
stores data of the same type as a tree node, that is, a range.
We have written a patch that calculates auxillary values and makes
them accessible during traversal.
We then use traversalValue in a new box index and change
range_spgist.c to use it in place of reconstructValue.
NB: apply this patch after traversalValue patch.
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 2/15/16 10:29 AM, Teodor Sigaev wrote: > It's very pity but author is not able to continue work on this patch, > and I would like to raise this flag. > > I'd like to add some comments about patches: > > traversalValue patch adds arbitrary value assoсiated with branch in > SP-GiST tree walk. Unlike to recostructedValue it could be just pointer, > it havn't to be a regular pgsql type. Also, it could be used independly > from reconstructedValue. This patch is used both following attached > patches. > > range patch just switchs using reconstructedValue to traversalValue in > range opclasses. reconstructedValue was used just because it was an > acceptable workaround in case of range type. Range opclase stores a full > value in leafs and doesn't need to use reconstructedValue to return > tuple in index only scan. > See http://www.postgresql.org/message-id/5399.1343512250@sss.pgh.pa.us > > q4d patch implements index over boxes using SP-GiST. Basic idea was an > observation, range opclass thinks about one-dimensional ranges as 2D > points. > Following this idea, we can think that 2D box (what is 2 points or 4 > numbers) could represent a 4D point. We hoped that this approach will be > much more effective than traditional R-Tree in case of many overlaps in > box's collection. > Performance results are coming shortly. It appears that the issues raised in this thread have been addressed but the patch still has not gone though a real review. Anybody out there willing to take a crack at a review? All three patches apply (with whitespace issues). Thanks, -- -David david@pgmasters.net
On Mon, Mar 14, 2016 at 9:01 PM, David Steele <david@pgmasters.net> wrote:
On 2/15/16 10:29 AM, Teodor Sigaev wrote:It's very pity but author is not able to continue work on this patch,
and I would like to raise this flag.
I'd like to add some comments about patches:
traversalValue patch adds arbitrary value assoсiated with branch in
SP-GiST tree walk. Unlike to recostructedValue it could be just pointer,
it havn't to be a regular pgsql type. Also, it could be used independly
from reconstructedValue. This patch is used both following attached
patches.
range patch just switchs using reconstructedValue to traversalValue in
range opclasses. reconstructedValue was used just because it was an
acceptable workaround in case of range type. Range opclase stores a full
value in leafs and doesn't need to use reconstructedValue to return
tuple in index only scan.
See http://www.postgresql.org/message-id/5399.1343512250@sss.pgh.pa.us
q4d patch implements index over boxes using SP-GiST. Basic idea was an
observation, range opclass thinks about one-dimensional ranges as 2D
points.
Following this idea, we can think that 2D box (what is 2 points or 4
numbers) could represent a 4D point. We hoped that this approach will be
much more effective than traditional R-Tree in case of many overlaps in
box's collection.
Performance results are coming shortly.
It appears that the issues raised in this thread have been addressed but the patch still has not gone though a real review.
Anybody out there willing to take a crack at a review? All three patches apply (with whitespace issues).
Emre Hasegeli will review the patch.
Thanks,
--
-David
david@pgmasters.net
Here are my first comments. I haven't read the actual index implementation, yet. I think traversal value is a useful addition. It is nice that the implementation for the range types is also changed. My questions about them are: Do reconstructedValues is required now? Wouldn't it be cleaner to use the new field on the prefix tree implementation, too? We haven't had specific memory context for reconstructedValues. Why is it required for the new field? > + Memory for <structfield>traversalValues</> should be allocated in > + the default context, but make sure to switch to > + <structfield>traversalMemoryContext</> before allocating memory > + for pointers themselves. This sentence is not understandable. Shouldn't it say "the memory should *not* be allocated in the default context"? What are the "pointers themselves"? The box opclass is broken because of the floating point arithmetics of the box type. You can reproduce it with the attached script. We really need to fix the geometric types, before building more on them. They fail to serve the only purpose they are useful for, demonstrating features. It think the opclass should support the cross data type operators like box @> point. Ideally we should do it by using multiple opclasses in an opfamily. The floating problem will bite us again in here, because some of the operators are not using FP macros. The tests check not supported operator "@". It should be "<@". It is nice that the opclass doesn't support long deprecated operators. We needs tests for the remaining operators and for adding new values to the index. I am not sure create_index.sql is the best place for them. Maybe we should use the test scripts of "box". > + #define LT -1 > + #define GT 1 > + #define EQ 0 Using these numbers is a very well established pattern. I don't think macros make the code any more readable. > + /* SP-GiST API functions */ > + Datum spg_box_quad_config(PG_FUNCTION_ARGS); > + Datum spg_box_quad_choose(PG_FUNCTION_ARGS); > + Datum spg_box_quad_picksplit(PG_FUNCTION_ARGS); > + Datum spg_box_quad_inner_consistent(PG_FUNCTION_ARGS); > + Datum spg_box_quad_leaf_consistent(PG_FUNCTION_ARGS); I guess they should go to the header file.
Attachment
> Do reconstructedValues is required now? Wouldn't it be cleaner to use > the new field on the prefix tree implementation, too? reconstructedValues is needed to reconctruct full indexed value and should match to type info indexed column > > We haven't had specific memory context for reconstructedValues. Why > is it required for the new field? because pg knows type of reconstructedValues (see above why) but traversalValue just a void*, SP-GiST core doesn't knnow how to free memory of allocated for it. > >> + Memory for <structfield>traversalValues</> should be allocated in >> + the default context, but make sure to switch to >> + <structfield>traversalMemoryContext</> before allocating memory >> + for pointers themselves. > > This sentence is not understandable. Shouldn't it say "the memory > should *not* be allocated in the default context"? What are the > "pointers themselves"? Clarified, I hope > > The box opclass is broken because of the floating point arithmetics of > the box type. You can reproduce it with the attached script. We hope, fixed. At least, seems, syncronized with seqscan > really need to fix the geometric types, before building more on them. > They fail to serve the only purpose they are useful for, demonstrating > features. agree, but it's not a deal for this patch > > It think the opclass should support the cross data type operators like > box @> point. Ideally we should do it by using multiple opclasses in > an opfamily. The floating problem will bite us again in here, because > some of the operators are not using FP macros. Again, agree. But I suggest to do it by separate patch. > > The tests check not supported operator "@". It should be "<@". It is > nice that the opclass doesn't support long deprecated operators. fixed >> + #define LT -1 >> + #define GT 1 >> + #define EQ 0 > > Using these numbers is a very well established pattern. I don't think > macros make the code any more readable. fixed > >> + /* SP-GiST API functions */ >> + Datum spg_box_quad_config(PG_FUNCTION_ARGS); >> + Datum spg_box_quad_choose(PG_FUNCTION_ARGS); >> + Datum spg_box_quad_picksplit(PG_FUNCTION_ARGS); >> + Datum spg_box_quad_inner_consistent(PG_FUNCTION_ARGS); >> + Datum spg_box_quad_leaf_consistent(PG_FUNCTION_ARGS); > > I guess they should go to the header file. Remove them because they are presented in auto-generated file ./src/include/utils/builtins.h range patch is unchanged, but still attached to keep them altogether. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Attachment
Here are my comments about the operator class implementation: > + * implementation of quad-4d tree over boxes for SP-GiST. Isn't the whole thing actually 3D? > + * For example consider the case of intersection. No need for a new line after this. > + * A quadrant has bounds, but sp-gist keeps only 4-d point (box) in inner nodes. I couldn't get the term 4D point. Maybe it means that we are using box datatype as the prefix, but we are not treating them as boxes. > + * We use traversalValue to calculate quadrant bounds from parent's quadrant > + * bounds. I couldn't understand this sentence. We are using the parent's prefix and the quadrant to generate the traversalValue. I think this is the crucial part of the opclass. It would be nice to explain it more clearly. > + * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group 2016. > + * src/backend/utils/adt/boxtype_spgist.c Maybe we should name this file as geo_spgist.c to support other types in the future. > + #include "utils/builtins.h"; Extra ;. > + #include "utils/datum.h" I think this is unnecessary. > + /* InfR type implements doubles and +- infinity */ > + typedef struct > + { > + int infFlag; > + double val; > + } InfR; Why do we need this? Can't we store infinity in "double"? > + /* wrap double to InfR */ > + static InfR > + toInfR(double v, InfR * r) > + { > + r->infFlag = NotInf; > + r->val = v; > + } This isn't returning the value. > + typedef struct > + { > + Range range_x; > + Range range_y; > + } Rectangle; Wouldn't it be simpler to just using BOX instead of this? > + static void > + evalIRangeBox(const IRangeBox *range_box, const Range *range, const int half1, > + const int half2, IRangeBox *new_range_box) > > + static void > + evalIRectBox(const IRectBox *rect_box, const Rectangle *centroid, > + const uint8 quadrant, IRectBox * new_rect_box) > > + inline static void > + initializeUnboundedBox(IRectBox * rect_box) Wouldn't it be simpler to palloc and return the value on those functions? > + static int > + intersect2D(const Range * range, const IRangeBox * range_box) Wouldn't it be better to return "bool" on those functions. > + return ((p1 >= 0) && (p2 <= 0)); Extra parentheses. > + i const spgChooseIn *in = (spgChooseIn *) PG_GETARG_POINTER(0); > + i spgChooseOut *out = (spgChooseOut *) PG_GETARG_POINTER(1); Many variables are defined with "const". I am not sure they are all that doesn't change. If it applies to the pointer, "out" should also not change in here. Am I wrong? > + /* > + * Begin. This block evaluates the median of coordinates of boxes > + */ I would rather explain what the function does on the function header. > + memcpy(new_rect_box, rect_box, sizeof(IRectBox)); Do we really need to copy the traversalValues on allTheSame case. Wouldn't it work if just the same value is passed for all of them. The search shouldn't continue after allTheSame case. > + for (i = 0; flag && i < in->nkeys && flag; i++) It checks flag two times. > + boxPointerToRectangle( > + DatumGetBoxP(in->scankeys[i].sk_argument), > + p_query_rect > + ); I don't think this matches the project coding style. > + int flag = 1, Wouldn't it be better as "bool"? The regression tests for the remaining operators are still missing.
>> + * implementation of quad-4d tree over boxes for SP-GiST. > Isn't the whole thing actually 3D? No. The idea if this work is a representation of 2d box as 4d point. Quad means quadrant of 2d plane. Originally such kind of tree was developed for 2d point, and each node of tree splits plane on 4 quadrant. For 3d tree each node splits space for 8 "quadrants", 4d - 16 ones. >> + * For example consider the case of intersection. > No need for a new line after this. fixed >> + * A quadrant has bounds, but sp-gist keeps only 4-d point (box) in inner nodes. > I couldn't get the term 4D point. Maybe it means that we are using > box datatype as the prefix, but we are not treating them as boxes. exactly, we treat boxes as 4-dimentional points. > >> + * We use traversalValue to calculate quadrant bounds from parent's quadrant >> + * bounds. > I couldn't understand this sentence. We are using the parent's prefix > and the quadrant to generate the traversalValue. I think this is the > crucial part of the opclass. It would be nice to explain it more > clearly. I'll try to explain with two-dimensional example over points. ASCII-art: | | 1 | 2 | -----------+------------- |P 3 | 4 | | '+' with 'A' represents a centroid or, other words, point which splits plane for 4 quadrants and it strorend in parent node. 1,2,3,4 are labels of quadrants, each labeling will be the same for all pictures and all centriods, and i will not show them in pictures later to prevent too complicated images. Let we add C - child node (and again, it will split plane for 4 quads): | | ----+---------+--- X |B |C | | -----------+---------+--- |P |A | | | | A and B are points of intersection of lines. So, box PBCAis a bounding box for points contained in 3-rd (see labeling above). For example X labeled point is not a descendace of child node with centroid C because it must be in branch of 1-st quad of parent node. So, each node (except root) will have a limitation in its quadrant. To transfer that limitation the traversalValue is used. To keep formatting I attached a txt file with this fragment. > >> + * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group > 2016. fixed > >> + * src/backend/utils/adt/boxtype_spgist.c > Maybe we should name this file as geo_spgist.c to support other types > in the future. done >> + #include "utils/builtins.h"; > Extra ;. fixed > >> + #include "utils/datum.h" > I think this is unnecessary. removed > >> + /* InfR type implements doubles and +- infinity */ >> + typedef struct >> + { >> + int infFlag; >> + double val; >> + } InfR; > Why do we need this? Can't we store infinity in "double"? Hmm, you are right. fixed. > This isn't returning the value. removed > >> + typedef struct >> + { >> + Range range_x; >> + Range range_y; >> + } Rectangle; > > Wouldn't it be simpler to just using BOX instead of this? Too much the same names in RectBox > >> + static void >> + evalIRangeBox(const IRangeBox *range_box, const Range *range, const int half1, >> + const int half2, IRangeBox *new_range_box) >> >> + static void >> + evalIRectBox(const IRectBox *rect_box, const Rectangle *centroid, >> + const uint8 quadrant, IRectBox * new_rect_box) >> >> + inline static void >> + initializeUnboundedBox(IRectBox * rect_box) > > Wouldn't it be simpler to palloc and return the value on those functions? evalRangeBox() initializes part of RectBox, so, it could not palloc its result. Other methods use the same signature just for consistency. > >> + static int >> + intersect2D(const Range * range, const IRangeBox * range_box) > > Wouldn't it be better to return "bool" on those functions. done > >> + return ((p1 >= 0) && (p2 <= 0)); > Extra parentheses. removed > >> + i const spgChooseIn *in = (spgChooseIn *) PG_GETARG_POINTER(0); >> + i spgChooseOut *out = (spgChooseOut *) PG_GETARG_POINTER(1); > > Many variables are defined with "const". I am not sure they are all > that doesn't change. If it applies to the pointer, "out" should also > not change in here. Am I wrong? No, changes > >> + /* >> + * Begin. This block evaluates the median of coordinates of boxes >> + */ > I would rather explain what the function does on the function header. fixed > >> + memcpy(new_rect_box, rect_box, sizeof(IRectBox)); > Do we really need to copy the traversalValues on allTheSame case. > Wouldn't it work if just the same value is passed for all of them. > The search shouldn't continue after allTheSame case. Seems, yes. spgist tree could contain a locng branches with allTheSame. > >> + for (i = 0; flag && i < in->nkeys && flag; i++) > It checks flag two times. fixed > >> + boxPointerToRectangle( >> + DatumGetBoxP(in->scankeys[i].sk_argument), >> + p_query_rect >> + ); > I don't think this matches the project coding style. fixed > >> + int flag = 1, > Wouldn't it be better as "bool"? fixed. > The regression tests for the remaining operators are still missing. Ooops, will be soon. Attached all patches to simplify review. Thank you very much! -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Attachment
On Mon, Mar 21, 2016 at 11:57 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: >> I couldn't get the term 4D point. Maybe it means that we are >> using box datatype as the prefix, but we are not treating them >> as boxes. > > exactly, we treat boxes as 4-dimentional points. I'm not entirely sure I understand the terminology either, since I tend to think of a *point* as having *zero* dimensions. Would it perhaps be more accurate to say we are treating a 2-dimensional box as a point in 4-dimensional space? -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On 21 Mar 2016, at 22:38, Kevin Grittner <kgrittn@gmail.com> wrote: > > On Mon, Mar 21, 2016 at 11:57 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: > >>> I couldn't get the term 4D point. Maybe it means that we are >>> using box datatype as the prefix, but we are not treating them >>> as boxes. >> >> exactly, we treat boxes as 4-dimentional points. > > I'm not entirely sure I understand the terminology either, since I > tend to think of a *point* as having *zero* dimensions. Would it > perhaps be more accurate to say we are treating a 2-dimensional box > as a point in 4-dimensional space? Or just say 4-d vector instead of 4-d point. Look like it will be enough rigorous. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 3/21/16 11:57 AM, Teodor Sigaev wrote: > A and B are points of intersection of lines. So, box PBCAis a bounding > box for points contained in 3-rd (see labeling above). For example X > labeled point is not a descendace of child node with centroid C because > it must be in branch of 1-st quad of parent node. So, each node (except > root) will have a limitation in its quadrant. To transfer that > limitation the traversalValue is used. Isn't this basically the same thing that the cube contrib module does? (Which has the added benefit of kNN-capable operators). If that's true then ISTM it'd be better to work on pulling cube's features into box? If it's not true, I'm still wondering if there's enough commonality here that we should pull cube into core... -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
> On 22 Mar 2016, at 01:47, Jim Nasby <Jim.Nasby@BlueTreble.com> wrote: > > On 3/21/16 11:57 AM, Teodor Sigaev wrote: >> A and B are points of intersection of lines. So, box PBCAis a bounding >> box for points contained in 3-rd (see labeling above). For example X >> labeled point is not a descendace of child node with centroid C because >> it must be in branch of 1-st quad of parent node. So, each node (except >> root) will have a limitation in its quadrant. To transfer that >> limitation the traversalValue is used. > > Isn't this basically the same thing that the cube contrib module does? (Which has the added benefit of kNN-capable operators). Cube and postgres own geometric types are indexed by building R-tree. While this is one of the most popular solutions it degrades almost to linear search when bounding boxes for each index node overlaps a lot. This problem will arise when onewill try to index streets in the city with grid system — a lot of street's bounding boxes will just coincide with bounding boxfor whole city. But that patch use totally different strategy for building index and do not suffer from above problem. > > If that's true then ISTM it'd be better to work on pulling cube's features into box? > > If it's not true, I'm still wondering if there's enough commonality here that we should pull cube into core… There is also intarray module with very common functionality, but it also addressing different data pattern. Optimal indexing and kNN strategy are very sensible to the data itself and if we want some general multidimensional searchroutines, then I think none of the mentioned extensions is general enough. Cube barely applicable when dimensions number is higherthan 10-20, intarray performs bad on data with big sets of possible coordinates, this patch is also intended to help with specific, nicheproblem. While people tends to use machine learning and regressions models more and more it is interesting to have some general n-dimindexing with kNN, but I think it is different problem and should be solved in a different way. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 3/21/16 7:41 PM, Stas Kelvich wrote: > While people tends to use machine learning and regressions models more and more it is interesting to have some generaln-dim indexing with kNN, > but I think it is different problem and should be solved in a different way. I think one of the issues here is it's not very clear to someone without a good amount of ML knowledge how these things relate. I hear "box' and 'cube' and think it's just a 2D vs 3D issue, and intarray isn't even on the radar. Maybe what's needed are actual vector and matrix types? In any case, if you've got a good reason why box and cube should stay separate then further discussion should happen in another thread. BTW, if you haven't seen it, take a look at http://madlib.apache.org/ -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
> tend to think of a *point* as having *zero* dimensions. Would it > perhaps be more accurate to say we are treating a 2-dimensional box > as a point in 4-dimensional space? Exactly, sorry for ambiguity. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> Isn't this basically the same thing that the cube contrib module does? (Which > has the added benefit of kNN-capable operators). No, cube module introduces new type - N-dimensional box. And adds an index support for it. Current patch suggests non-traditional indexing technique for 2D boxes by treating them as point in 4D space. With such representation it's became possible to use quad-tree technique which: 1 supports only points to index 2 already supported by SP-GiST Such technique provides some benefits in comparance with traditional R-Tree which implemented in pg with a help GiST. See Oleg's message. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> + * boxtype_spgist.c The names on the file header need to be changed, too. > I'll try to explain with two-dimensional example over points. ASCII-art: > | > | > 1 | 2 > | > -----------+------------- > |P > 3 | 4 > | > | > > '+' with 'A' represents a centroid or, other words, point which splits plane > for 4 quadrants and it strorend in parent node. 1,2,3,4 are labels of > quadrants, each labeling will be the same for all pictures and all > centriods, and i will not show them in pictures later to prevent too > complicated images. Let we add C - child node (and again, it will split > plane for 4 quads): > > > | | > ----+---------+--- > X |B |C > | | > -----------+---------+--- > |P |A > | | > | > | > > A and B are points of intersection of lines. So, box PBCAis a bounding box > for points contained in 3-rd (see labeling above). For example X labeled > point is not a descendace of child node with centroid C because it must be > in branch of 1-st quad of parent node. So, each node (except root) will have > a limitation in its quadrant. To transfer that limitation the traversalValue > is used. > > To keep formatting I attached a txt file with this fragment. Thank you for the explanation. Should we incorporate this with the patch. >>> + * Portions Copyright (c) 1996-2015, PostgreSQL Global Development >>> Group >> >> 2016. > > fixed Not on the patch. > + cmp_double(const double a, const double b) Does this function necessary? We can just compare the doubles. > + boxPointerToRangeBox(BOX *box, RangeBox * rectangle) The "rectangle" variable in here should be renamed. > + Assert(is_infinite(b) == 0); This is failing on my test. You can reproduce with the script I have sent. >> Wouldn't it be simpler to palloc and return the value on those functions? > > evalRangeBox() initializes part of RectBox, so, it could not palloc its > result. > Other methods use the same signature just for consistency. I couldn't understand it. evalRangeBox() can palloc and return the result. evalRectBox() can return the result palloc'ed by evalRangeBox(). The caller wouldn't need to palloc anything. >> Many variables are defined with "const". I am not sure they are all >> that doesn't change. If it applies to the pointer, "out" should also >> not change in here. Am I wrong? > > No, changes I now read about "const". I am sorry for not being experienced in C. The new version of the patch marks them as "const" by mistake. I went through all other variables: > + int r = is_infinite(a); > > + double x = *(double *) a; > + double y = *(double *) b; > > + spgInnerConsistentIn *in = (spgInnerConsistentIn *) PG_GETARG_POINTER(0); > > + spgLeafConsistentIn *in = (spgLeafConsistentIn *) PG_GETARG_POINTER(0); > > + BOX *leafBox = DatumGetBoxP(in->leafDatum); Shouldn't they be "const", too? >>> + /* >>> + * Begin. This block evaluates the median of coordinates of boxes >>> + */ >> >> I would rather explain what the function does on the function header. > > fixed The "end" part of it is still there. >> Do we really need to copy the traversalValues on allTheSame case. >> Wouldn't it work if just the same value is passed for all of them. >> The search shouldn't continue after allTheSame case. > > Seems, yes. spgist tree could contain a locng branches with allTheSame. Does SP-GiST allows any node under allTheSame to not being allTheSame?Not setting traversalValues for allTheSame worked finewith my test. > + if (in->allTheSame) Most of the things happening before this check is not used in there. Shouldn't we move this to the top of the function? > + out->nodeNumbers = (int *) palloc(sizeof(int) * in->nNodes); This could go before allTheSame block.
>> + * boxtype_spgist.c > The names on the file header need to be changed, too. Oops. fixed >> I'll try to explain with two-dimensional example over points. ASCII-art: > Thank you for the explanation. Should we incorporate this with the patch. added >> + cmp_double(const double a, const double b) > Does this function necessary? We can just compare the doubles. Yes, it compares with limited precision as it does by geometry operations. Renamed to point actual arguments. > >> + boxPointerToRangeBox(BOX *box, RangeBox * rectangle) > The "rectangle" variable in here should be renamed. fixed > >> + Assert(is_infinite(b) == 0); > This is failing on my test. You can reproduce with the script I have sent. I didn't know: # select '(1,inf)'::point; point --------- (1,inf) fixed > >>> Wouldn't it be simpler to palloc and return the value on those functions? >> evalRangeBox() initializes part of RectBox, so, it could not palloc its >> result. >> Other methods use the same signature just for consistency. > > I couldn't understand it. evalRangeBox() can palloc and return the > result. evalRectBox() can return the result palloc'ed by > evalRangeBox(). The caller wouldn't need to palloc anything. evalRangeBox() is used to initialize fields of RangeBox in evalRectBox(). If evalRangeBox() will palloc its result then we need to copy its result into RangeBox struct and free. Let both fucntion use the same interface. > I went through all other variables: >> + int r = is_infinite(a); >> + double x = *(double *) a; >> + double y = *(double *) b; >> + spgInnerConsistentIn *in = (spgInnerConsistentIn *) PG_GETARG_POINTER(0); >> + spgLeafConsistentIn *in = (spgLeafConsistentIn *) PG_GETARG_POINTER(0); >> + BOX *leafBox = DatumGetBoxP(in->leafDatum); > Shouldn't they be "const", too? They could. But it doesn't required. To be in consistent state I've removed const modifier where possible > >>>> + /* >>>> + * Begin. This block evaluates the median of coordinates of boxes >>>> + */ >>> >>> I would rather explain what the function does on the function header. >> >> fixed > The "end" part of it is still there. Oops again, fixed > >>> Do we really need to copy the traversalValues on allTheSame case. >>> Wouldn't it work if just the same value is passed for all of them. >>> The search shouldn't continue after allTheSame case. >> >> Seems, yes. spgist tree could contain a locng branches with allTheSame. > > Does SP-GiST allows any node under allTheSame to not being allTheSame? No, SP-GiST doesn't allow that > Not setting traversalValues for allTheSame worked fine with my test. it works until allthesame branch contains only one inner node. Otherwise traversalValue will be freeed several times, see spgWalk(). # select i as id, '1,2,3,4'::box as b into x from generate_series(1,1000000) i; # create index ix on i using spgist (b); # select count(*) from x where b && '1,2,3,4'::box; -- coredump gdb: #0 0x000000080143564a in thr_kill () from /lib/libc.so.7 #1 0x0000000801435636 in raise () from /lib/libc.so.7 #2 0x00000008014355b9 in abort () from /lib/libc.so.7 #3 0x0000000000a80739 in in_error_recursion_trouble () at elog.c:199 #4 0x0000000000abb748 in pfree (pointer=0x801e90868) at mcxt.c:1016 #5 0x000000000053330c in freeScanStackEntry (so=0x801e8d358, stackEntry=0x801e935d8) at spgscan.c:47 #6 0x0000000000532cdb in spgWalk (index=0x801f1c588, so=0x801e8d358, scanWholeIndex=1 '\001', storeRes=0x532d10 <storeBitm ... > >> + if (in->allTheSame) > Most of the things happening before this check is not used in there. > Shouldn't we move this to the top of the function? yep, fixed > >> + out->nodeNumbers = (int *) palloc(sizeof(int) * in->nNodes); > This could go before allTheSame block. yep, fixed I've attached all patches again. Thank you very much! -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Attachment
>>> I'll try to explain with two-dimensional example over points. ASCII-art: >> >> Thank you for the explanation. Should we incorporate this with the patch. > > added I have worked on the comments of the patch. It is attached. I hope it looks more clear than it was before. >>> + cmp_double(const double a, const double b) >> >> Does this function necessary? We can just compare the doubles. > > Yes, it compares with limited precision as it does by geometry operations. > Renamed to point actual arguments. I meant that we could use FP macros directly instead of this function. Doing so is also more correct. I haven't tried to produce the problem, but this function is not same as using the macros directly. For differences smaller that the epsilon, it can return different results. I have removed it on the attached version. >>> + boxPointerToRangeBox(BOX *box, RangeBox * rectangle) >> >> The "rectangle" variable in here should be renamed. > > fixed I found a bunch of those too and renamed for clarity. I have also reordered the arguments of the helper functions to keep them consistent. > evalRangeBox() is used to initialize fields of RangeBox in evalRectBox(). If > evalRangeBox() will palloc its result then we need to copy its result into > RangeBox struct and free. Let both fucntion use the same interface. I found it nicer to copy and edit the existing value than creating it in two steps like this. It is also attached. > it works until allthesame branch contains only one inner node. Otherwise > traversalValue will be freeed several times, see spgWalk(). It just works, when traversalValues is not set. It is also attached. I have also added the missing regression tests. While doing that I noticed that some operators are missing and also added support for them. It is also attached with the updated version of my test script. I hope I haven't changed the patch too much. I have also pushed the changes here: https://github.com/hasegeli/postgres/commits/box-spgist
Attachment
Thank you, pushed Emre Hasegeli wrote: >>>> I'll try to explain with two-dimensional example over points. ASCII-art: >>> >>> Thank you for the explanation. Should we incorporate this with the patch. >> >> added > > I have worked on the comments of the patch. It is attached. I hope > it looks more clear than it was before. > >>>> + cmp_double(const double a, const double b) >>> >>> Does this function necessary? We can just compare the doubles. >> >> Yes, it compares with limited precision as it does by geometry operations. >> Renamed to point actual arguments. > > I meant that we could use FP macros directly instead of this function. > Doing so is also more correct. I haven't tried to produce the > problem, but this function is not same as using the macros directly. > For differences smaller that the epsilon, it can return different > results. I have removed it on the attached version. > >>>> + boxPointerToRangeBox(BOX *box, RangeBox * rectangle) >>> >>> The "rectangle" variable in here should be renamed. >> >> fixed > > I found a bunch of those too and renamed for clarity. I have also > reordered the arguments of the helper functions to keep them > consistent. > >> evalRangeBox() is used to initialize fields of RangeBox in evalRectBox(). If >> evalRangeBox() will palloc its result then we need to copy its result into >> RangeBox struct and free. Let both fucntion use the same interface. > > I found it nicer to copy and edit the existing value than creating it > in two steps like this. It is also attached. > >> it works until allthesame branch contains only one inner node. Otherwise >> traversalValue will be freeed several times, see spgWalk(). > > It just works, when traversalValues is not set. It is also attached. > > I have also added the missing regression tests. While doing that I > noticed that some operators are missing and also added support for > them. It is also attached with the updated version of my test script. > > I hope I haven't changed the patch too much. I have also pushed the > changes here: > > https://github.com/hasegeli/postgres/commits/box-spgist > -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> Thank you, pushed Thank you for working on this. I noticed that have overlooked this: static RectBox * -initRectBox() +initRectBox(void){
On Thu, Mar 31, 2016 at 05:46:41PM +0200, Emre Hasegeli wrote: > > Thank you, pushed > > Thank you for working on this. > > I noticed that have overlooked this: > > static RectBox * > -initRectBox() > +initRectBox(void) > { Done, thanks. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +