Re: why is gist index taking so much space on the disc - Mailing list pgsql-hackers

From Grzegorz Jaskiewicz
Subject Re: why is gist index taking so much space on the disc
Date
Msg-id FCB2D3B8-9CCD-404E-8728-45CDE4DBE93F@pointblue.com.pl
Whole thread Raw
In response to why is gist index taking so much space on the disc  (Grzegorz Jaskiewicz <gj@pointblue.com.pl>)
Responses Re: why is gist index taking so much space on the disc
List pgsql-hackers
On 2005-11-21, at 19:32, Martijn van Oosterhout wrote:

> On Mon, Nov 21, 2005 at 04:58:25PM +0100, Grzegorz Jaskiewicz wrote:
>> my conquers with Gist index for custom type are nearly finished. It
>> is working as it is now, but there are few problems here and there.
>> One of em, being amount of disc space index it self takes. The type
>> stucture it self takes 160bytes. Adding 100.000 rows into table -
>> CREATE TABLE blah (a serial, b customType);
>
> Let's see, 160bytes means you'll get aboud 50 keys per page. So you
> would expect 2000 leaf page, 40 level 1 pages. This should be less  
> than
> 20-30MB
>
yep;

> You mean you sometimes put the same elements in the two halves? You
> shouldn't do that. The whole point is that the search will descend any
> node that matches consistant, but any single key should only appear
> once in each index.
>
> picksplit should *split* the set, not return two sets about the same
> size as you started...

Nope, I mean that 'masks' created to match either 'half' sometimes  
match elements in the other one.
This shouldn't be a big deal, just one level to go down on query to  
much more specific result set.
I have fixed that with, somewhat hack.
Here's some pseudo code

sort_data(input);
find_split_point(input);
mask1 = generate_two_masks(input[0]);
mask2 = generate_two_masks(input[1]);

foreach(input) {bool a = matches1(input);bool b = matches2(input);if ( a && b ) {    if ( left_index == 0 ) {
left[left_index++]= input;        }    else {        if ( right_index == 0 ) {            right[right_index++] = input;
          continue;        }
 
/* this part is new code, and helped a lot, now gist index takes much  
less space
and is much faster, because of lower I/O consumption*/        if ( loop_index % 2 ) {            right[right_index++] =
input;       }        else {            left[left_index++] = input;        }    }}else {    if ( a) left[left_index++]
=input;    if ( b) right[right_index++] = input;}
 
}


mask1 = generate(left );
mask2 = generate(right);

return (left, right, blah, blih, others);

Ok, so the part with i%2 helped a lot, it distributes elements  
matching both masks evenly.

Thanks guys.

I will play with k-means, and see if they will work better with no  
hacks. Either way, I have to have some code that will handle "matches  
both" case.

Thanks again.

-- 
GJ

"If we knew what we were doing, it wouldn't be called Research, would  
it?" - AE





pgsql-hackers by date:

Previous
From: James William Pye
Date:
Subject: Re: plpython and bytea
Next
From: "Kevin McArthur"
Date:
Subject: Re: why is gist index taking so much space on the disc