On 2005-11-21, at 19:32, Martijn van Oosterhout wrote:
> On Mon, Nov 21, 2005 at 04:58:25PM +0100, Grzegorz Jaskiewicz wrote:
>> my conquers with Gist index for custom type are nearly finished. It
>> is working as it is now, but there are few problems here and there.
>> One of em, being amount of disc space index it self takes. The type
>> stucture it self takes 160bytes. Adding 100.000 rows into table -
>> CREATE TABLE blah (a serial, b customType);
>
> Let's see, 160bytes means you'll get aboud 50 keys per page. So you
> would expect 2000 leaf page, 40 level 1 pages. This should be less
> than
> 20-30MB
>
yep;
> You mean you sometimes put the same elements in the two halves? You
> shouldn't do that. The whole point is that the search will descend any
> node that matches consistant, but any single key should only appear
> once in each index.
>
> picksplit should *split* the set, not return two sets about the same
> size as you started...
Nope, I mean that 'masks' created to match either 'half' sometimes
match elements in the other one.
This shouldn't be a big deal, just one level to go down on query to
much more specific result set.
I have fixed that with, somewhat hack.
Here's some pseudo code
sort_data(input);
find_split_point(input);
mask1 = generate_two_masks(input[0]);
mask2 = generate_two_masks(input[1]);
foreach(input) {bool a = matches1(input);bool b = matches2(input);if ( a && b ) { if ( left_index == 0 ) {
left[left_index++]= input; } else { if ( right_index == 0 ) { right[right_index++] = input;
continue; }
/* this part is new code, and helped a lot, now gist index takes much
less space
and is much faster, because of lower I/O consumption*/ if ( loop_index % 2 ) { right[right_index++] =
input; } else { left[left_index++] = input; } }}else { if ( a) left[left_index++]
=input; if ( b) right[right_index++] = input;}
}
mask1 = generate(left );
mask2 = generate(right);
return (left, right, blah, blih, others);
Ok, so the part with i%2 helped a lot, it distributes elements
matching both masks evenly.
Thanks guys.
I will play with k-means, and see if they will work better with no
hacks. Either way, I have to have some code that will handle "matches
both" case.
Thanks again.
--
GJ
"If we knew what we were doing, it wouldn't be called Research, would
it?" - AE