Re: WIP: Fast GiST index build - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: WIP: Fast GiST index build
Date
Msg-id CAPpHfdvoOh78ycX3eRePTS29635pHCSTfLdFDzHZhxTKsggCuQ@mail.gmail.com
Whole thread Raw
In response to Re: WIP: Fast GiST index build  (Alexander Korotkov <aekorotkov@gmail.com>)
Responses Re: WIP: Fast GiST index build
List pgsql-hackers
Hi!

Patch with my try to detect ordered datasets is attached. The implemented idea is desribed below.
Index tuples are divided by chunks of 128. On each chunk we measure how much leaf pages where index tuples was inserted don't match those of previous chunk. Based on statistics of several chunks we estimate distribution of accesses between lead pages (exponential distribution law is accumed and it's seems to be an error). After that we can estimate portion of index tuples which can be processed without actual IO. If this estimate exceeds threshold then we should switch to buffering build.
Now my implementation successfully detects randomly mixed datasets and well ordered datasets, but it's seems to be too optimistic about intermediate cases. I believe it's due to wrong assumption about distribution law.
Do you think this approach is acceptable? Probably there are some researches about distribution law for such cases (while I didn't find anything relevant in google scholar)?
As an alternative I can propose take into account actual average IO operations per tuple rather then an estimate.

------
With best regards,
Alexander Korotkov.

On Mon, Jul 18, 2011 at 10:00 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
Hi!

New version of patch is attached. There are following changes.
1) Since proposed tchnique is not always a "fast" build, it was renamed everywhere in the patch to "buffering" build.
2) Parameter "buffering" now has 3 possible values "yes", "no" and "auto". "auto" means automatic switching from regular index build to buffering one. Currently it just switch when index size exceeds maintenance_work_mem.
3) Holding of many buffers pinned is avoided.
4) Rebased with head.

TODO:
1) Take care about ordered datasets in automatic switching.
2) Take care about concurrent backends in automatic switching.

------
With best regards,
Alexander Korotkov.

Attachment

pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Questions and experiences writing a Foreign Data Wrapper
Next
From: Kohei Kaigai
Date:
Subject: Re: [v9.1] sepgsql - userspace access vector cache