Thread: GiST: PickSplit and multi-attr indexes
Oleg & Teodor, If I understand the code correctly, GiST will only pass the first attribute of each index tuple to the user-defined PickSplit method when it wants to split a node. (see circa line 1269 of gist.c) Is this a wise design decision? Granted, in many situations the first attribute in the index is sufficient to make a reasonable decision about how to divide the node into two halves, but I don't think that is universally true. For example, consider a two column index whose first attribute has a small number of distinct values. It could well be that all the first attribute values in a node-to-be-split would be the same. Only passing the first attribute to PickSplit would result in an essentially random distribution of tuples among the split nodes, rather than allowing the GiST extension to make use of the second attribution to partition the nodes. That's an extreme example, but it is easy to construct more realistic scenarios (basically, any situation in which the cardinality of the first index attribute is low -- a reasonably common occurrence with a multi-column index, I believe). I'm not sure whether this would be a problem in practice. Speculation: repeated invocations of PickSplit are one of the main factors in deriving the ultimate shape of the GiST tree. Poor distribution of keys resulting from PickSplit would eventually result in unnecessarily loose bounding predicates in internal nodes, which would hurt performance. Comments welcome, Neil
Neil Conway <neilc@samurai.com> writes: > If I understand the code correctly, GiST will only pass the first > attribute of each index tuple to the user-defined PickSplit method when > it wants to split a node. (see circa line 1269 of gist.c) > Is this a wise design decision? It's probably just a hangover from the days when GiST didn't support multi-column indexes at all. I agree it should be changed. But note you will then have to cope with NULL values. regards, tom lane
On Sun, 2004-11-14 at 18:54 -0500, Tom Lane wrote: > It's probably just a hangover from the days when GiST didn't support > multi-column indexes at all. I agree it should be changed. I'm not sure the right way to fix it (at least without significant changes to the GiST API). At present, the PickSplit() method is passed a vector of GISTENTRYs and fills in a GIST_SPLITVEC. The GISTENTRYs correspond to the first attributes of all the tuples in the node to be split. There is no provision for the GiST extension to be informed about any additional attributes in the index tuples. Even if we changed the API to allow that, the GiST extension would have a hard time making a reasonable decision in the multi-attribute case: the additional index attributes could well be implemented using a different GiST extension. > But note you will then have to cope with NULL values. Yes -- I'm not sure offhand why GiST does not allow leading NULL values in index attributes, but that ought to be fixed anyway. -Neil
Neil, I think we should not touch this right now unless you certainly know algorithm proven to be good in general case. Did you already got some idea ? Oleg On Mon, 15 Nov 2004, Neil Conway wrote: > On Sun, 2004-11-14 at 18:54 -0500, Tom Lane wrote: >> It's probably just a hangover from the days when GiST didn't support >> multi-column indexes at all. I agree it should be changed. > > I'm not sure the right way to fix it (at least without significant > changes to the GiST API). At present, the PickSplit() method is passed a > vector of GISTENTRYs and fills in a GIST_SPLITVEC. The GISTENTRYs > correspond to the first attributes of all the tuples in the node to be > split. There is no provision for the GiST extension to be informed about > any additional attributes in the index tuples. Even if we changed the > API to allow that, the GiST extension would have a hard time making a > reasonable decision in the multi-attribute case: the additional index > attributes could well be implemented using a different GiST extension. > >> But note you will then have to cope with NULL values. > > Yes -- I'm not sure offhand why GiST does not allow leading NULL values > in index attributes, but that ought to be fixed anyway. > > -Neil > > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Neil Conway <neilc@samurai.com> writes: > I'm not sure the right way to fix it (at least without significant > changes to the GiST API). At present, the PickSplit() method is passed a > vector of GISTENTRYs and fills in a GIST_SPLITVEC. The GISTENTRYs > correspond to the first attributes of all the tuples in the node to be > split. There is no provision for the GiST extension to be informed about > any additional attributes in the index tuples. Even if we changed the > API to allow that, the GiST extension would have a hard time making a > reasonable decision in the multi-attribute case: the additional index > attributes could well be implemented using a different GiST extension. Right, when you consider attributes of different types it gets tough. I'm not familiar with the details of the GiST code, but would it work to generalize PickSplit to return a three-way classification? That is, instead of actually splitting the node, have it identify each item as "definitely left", "definitely right", or "don't care" (think "less", "greater", or "equal to" the desired split value). If there is another index column, you then invoke its PickSplit on just the "don't care" set to try to classify them; and so on. Once you are out of columns, any remaining "don't care" set is split using some standardized method like trying to equalize the physical space. Of course this requires API changes for PickSplit (not only a more complex result, but it will have to deal with null inputs). But it seems like it would be localized otherwise. Also it might allow removal of duplicated logic to deal with splitting equal values sanely. regards, tom lane
On Mon, 2004-11-15 at 10:19 -0500, Tom Lane wrote: > I'm not familiar with the details of the GiST code, but would it work to > generalize PickSplit to return a three-way classification? That is, > instead of actually splitting the node, have it identify each item as > "definitely left", "definitely right", or "don't care" (think "less", > "greater", or "equal to" the desired split value). I had thought about this solution, but I was worried that it will actually make GiST a less general framework, because it requires that the PickSplit() method effectively order the leaves of the tree. For btree that is a reasonable assumption to make, but I'm not sure if it can or should be made in general. That said, I haven't actually implemented any indexes using GiST; perhaps Oleg or Teodor can comment? -Neil
Neil Conway <neilc@samurai.com> writes: > On Mon, 2004-11-15 at 10:19 -0500, Tom Lane wrote: >> I'm not familiar with the details of the GiST code, but would it work to >> generalize PickSplit to return a three-way classification? That is, >> instead of actually splitting the node, have it identify each item as >> "definitely left", "definitely right", or "don't care" (think "less", >> "greater", or "equal to" the desired split value). > I had thought about this solution, but I was worried that it will > actually make GiST a less general framework, because it requires that > the PickSplit() method effectively order the leaves of the tree. For > btree that is a reasonable assumption to make, but I'm not sure if it > can or should be made in general. If there are no don't-care cases, then you're effectively saying that the first column's PickSplit has sole control over the tree shape, which is where we're at now. ISTM the entire point of a multi-column index is that the first column has duplicates, or at least values that are similar enough to qualify as don't-cares. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > If there are no don't-care cases, then you're effectively saying that > the first column's PickSplit has sole control over the tree shape, > which is where we're at now. ISTM the entire point of a multi-column > index is that the first column has duplicates, or at least values that > are similar enough to qualify as don't-cares. I'm not sure that GiST indexes behave the same way as btree indexes for the multi-column case. In a btree index the second column is entirely subordinate to the first column. In a GiST index the data is multi-dimensional, and all dimensions are equally important. For a concrete example, say you wanted to index a 2d cartesian coordinate. If you stored it in a single column using the point data type (or a 2 element array) then the PickSplit function can use both coordinates to determine its split. A GiST index on two separate integer columns really shouldn't be prevented from operating in the same way. -- greg
Greg Stark <gsstark@MIT.EDU> writes: > I'm not sure that GiST indexes behave the same way as btree indexes for the > multi-column case. > > In a btree index the second column is entirely subordinate to the first > column. In a GiST index the data is multi-dimensional, and all dimensions are > equally important. In fact on further consideration I do have a proposal. If you look at the GiST implementations for various index types you'll see that many (all?) take the same approach for PickSplit. In fact they pretty much all have the same code copy/pasted to handle it. The approach they take is to have a function which calculates an abstract "distance" between any two entries. There's an algorithm that they use to pick the split based on this distance function. If you abandoned "PickSplit" and instead exposed this distance function as the external API then the behaviour for multi-column indexes is clear. You calculate the distance along all the axes and calculate the diagonal distance. I think abandoning PickSplit for the distance function might also mean you don't need a separate function for Penalty either, but I'm not sure on that. -- greg
Greg Stark <gsstark@MIT.EDU> writes: > The approach they take is to have a function which calculates an > abstract "distance" between any two entries. There's an algorithm that > they use to pick the split based on this distance function. > If you abandoned "PickSplit" and instead exposed this distance > function as the external API then the behaviour for multi-column > indexes is clear. You calculate the distance along all the axes and > calculate the diagonal distance. Hmm ... the problem with that is the assumption that different opclasses will compute similarly-scaled distances. If opclass A generates distances in the range (0,1e6) while B generates in the range (0,1), combining them with Euclidean distance won't work well at all. OTOH you can't blindly normalize, because in some cases maybe the data is such that a massive difference in distances is truly appropriate. I'm also a bit leery of the assumption that every GiST application can reduce its PickSplit logic to Euclidean distances. regards, tom lane