Home > mailing lists

Re: WIP: Fast GiST index build - Mailing list pgsql-hackers

From	Alexander Korotkov
Subject	Re: WIP: Fast GiST index build
Date	August 26, 2011 11:31:52
Msg-id	CAPpHfdtbZifApY+_W0OrxypUcJ6U1QPvhPe2HYsAwy30wNFNNA@mail.gmail.com Whole thread
In response to	Re: WIP: Fast GiST index build (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
List	pgsql-hackers

Tree view

On Thu, Aug 25, 2011 at 10:53 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:

In the tests on the first version of patch I found index quality of regular
build much better than it of buffering build (without neighborrelocation).
Now it's similar, though it's because index quality of regular index build
become worse. There by in current tests regular index build is faster than
in previous. I see following possible causes of it:
1) I didn't save source random data. So, now it's a new random data.
2) Some environment parameters of my test setup may alters, though I doubt.
Despite these possible explanation it seems quite strange for me.

That's pretty surprising. Assuming the data is truly random, I wouldn't expect a big difference in the index quality of one random data set over another. If the index quality depends so much on, say, the distribution of the few first tuples that are inserted to it, that's a quite interesting find on its own, and merits some further research.

Yeah, it's pretty strange. Using same random datasets in different tests can help to exclude onepossible cause of difference.

In order to compare index build methods on more qualitative indexes, I've
tried to build indexes with my double sorting split method (see:
http://syrcose.ispras.ru/2011/files/SYRCoSE2011_Proceedings.pdf#page=36). So
on uniform dataset search is faster in about 10 times! And, as it was
expected, regular index build becomes much slower. It runs more than 60
hours and while only 50% of index is complete (estimated by file sizes).

Also, automatic switching to buffering build shows better index quality
results in all the tests. While it's hard for me to explain that.

Hmm, makes me a bit uneasy that we're testing with a modified page splitting algorithm. But if the new algorithm is that good, could you post that as a separate patch, please?

I've post it in another message and I will try to get it into more appropriate form. Let me clarify this a little. I don't think my split algorithm is 10 times better than state of the art algorithms. I think that currently used new linear split shows unreasonably bad results in may cases. For example, uniformly distributed data is pretty easy case. And with almost any splitting algorithm we can get index with almost zero overlaps. But new linear split produces huge overlaps in this case. That's why I decided to make some experiments with another split algorithm.

------
With best regards,
Alexander Korotkov.

pgsql-hackers by date:

From: Alexander Korotkov
Date: 26 August 2011, 11:19:12
Subject: Re: WIP: Fast GiST index build

From: Dave Page
Date: 26 August 2011, 11:36:35
Subject: Re: Questions and experiences writing a Foreign Data Wrapper

Re: WIP: Fast GiST index build - Mailing list pgsql-hackers

Previous

Next