Re: WIP: Fast GiST index build - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: WIP: Fast GiST index build
Date
Msg-id CAPpHfdtbZifApY+_W0OrxypUcJ6U1QPvhPe2HYsAwy30wNFNNA@mail.gmail.com
Whole thread Raw
In response to Re: WIP: Fast GiST index build  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
List pgsql-hackers
On Thu, Aug 25, 2011 at 10:53 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:

In the tests on the first version of patch I found index quality of regular
build much better than it of buffering build (without neighborrelocation).
Now it's similar, though it's because index quality of regular index build
become worse. There by in current tests regular index build is faster than
in previous. I see following possible causes of it:
 1) I didn't save source random data. So, now it's a new random data.
2) Some environment parameters of my test setup may alters, though I doubt.
Despite these possible explanation it seems quite strange for me.

That's pretty surprising. Assuming the data is truly random, I wouldn't expect a big difference in the index quality of one random data set over another. If the index quality depends so much on, say, the distribution of the few first tuples that are inserted to it, that's a quite interesting find on its own, and merits some further research.
Yeah, it's pretty strange. Using same random datasets in different tests can help to exclude onepossible cause of difference.
 
In order to compare index build methods on more qualitative indexes, I've
tried to build indexes with my double sorting split method (see:
http://syrcose.ispras.ru/2011/files/SYRCoSE2011_Proceedings.pdf#page=36). So
on uniform dataset search is faster in about 10 times! And, as it was
expected, regular index build becomes much slower. It runs more than 60
hours and while only 50% of index is complete (estimated by file sizes).

Also, automatic switching to buffering build shows better index quality
results in all the tests. While it's hard for me to explain that.

Hmm, makes me a bit uneasy that we're testing with a modified page splitting algorithm. But if the new algorithm is that good, could you post that as a separate patch, please?
I've post it in another message and I will try to get it into more appropriate form. Let me clarify this a little. I don't think my split algorithm is 10 times better than state of the art algorithms. I think that currently used new linear split shows unreasonably bad results in may cases. For example, uniformly distributed data is pretty easy case. And with almost any splitting algorithm we can get index with almost zero overlaps. But new linear split produces huge overlaps in this case. That's why I decided to make some experiments with another split algorithm. 

------
With best regards,
Alexander Korotkov.

pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: WIP: Fast GiST index build
Next
From: Dave Page
Date:
Subject: Re: Questions and experiences writing a Foreign Data Wrapper