Home > mailing lists

Re: WIP: Fast GiST index build - Mailing list pgsql-hackers

From	Alexander Korotkov
Subject	Re: WIP: Fast GiST index build
Date	July 22, 2011 09:39:03
Msg-id	CAPpHfdvoOh78ycX3eRePTS29635pHCSTfLdFDzHZhxTKsggCuQ@mail.gmail.com Whole thread Raw
In response to	Re: WIP: Fast GiST index build (Alexander Korotkov <aekorotkov@gmail.com>)
Responses	Re: WIP: Fast GiST index build
List	pgsql-hackers

Tree view

Hi!

Patch with my try to detect ordered datasets is attached. The implemented idea is desribed below.

Index tuples are divided by chunks of 128. On each chunk we measure how much leaf pages where index tuples was inserted don't match those of previous chunk. Based on statistics of several chunks we estimate distribution of accesses between lead pages (exponential distribution law is accumed and it's seems to be an error). After that we can estimate portion of index tuples which can be processed without actual IO. If this estimate exceeds threshold then we should switch to buffering build.

Now my implementation successfully detects randomly mixed datasets and well ordered datasets, but it's seems to be too optimistic about intermediate cases. I believe it's due to wrong assumption about distribution law.

Do you think this approach is acceptable? Probably there are some researches about distribution law for such cases (while I didn't find anything relevant in google scholar)?

As an alternative I can propose take into account actual average IO operations per tuple rather then an estimate.

------
With best regards,
Alexander Korotkov.

On Mon, Jul 18, 2011 at 10:00 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Hi!

New version of patch is attached. There are following changes.
1) Since proposed tchnique is not always a "fast" build, it was renamed everywhere in the patch to "buffering" build.
2) Parameter "buffering" now has 3 possible values "yes", "no" and "auto". "auto" means automatic switching from regular index build to buffering one. Currently it just switch when index size exceeds maintenance_work_mem.
3) Holding of many buffers pinned is avoided.
4) Rebased with head.

TODO:
1) Take care about ordered datasets in automatic switching.
2) Take care about concurrent backends in automatic switching.

------
With best regards,
Alexander Korotkov.

Attachment

gist_fast_build-0.8.0.patch.gz

pgsql-hackers by date:

From: Heikki Linnakangas
Date: 22 July 2011, 09:29:56
Subject: Re: Questions and experiences writing a Foreign Data Wrapper

From: Kohei Kaigai
Date: 22 July 2011, 09:56:19
Subject: Re: [v9.1] sepgsql - userspace access vector cache

Re: WIP: Fast GiST index build - Mailing list pgsql-hackers

Attachment

Previous

Next