Thread: GiST buffering build, bug in levelStep calculation

GiST buffering build, bug in levelStep calculation

From

Heikki Linnakangas

Date:

29 May 2012, 16:14:01

I just noticed a little bug in the way the levelStep parameter is 
calculated when we switch to buffering build:

set client_min_messages='debug2';

This works:

postgres=# set maintenance_work_mem='1GB';
SET
postgres=# create index i_gisttest on gisttest using gist (t collate 
"C") WITH (fillfactor=10);
DEBUG:  building index "i_gisttest" on table "gisttest"
DEBUG:  switched to buffered GiST build; level step = 2, pagesPerBuffer 
= 232

But with higher maintenance_work_mem, it fails to switch to buffering mode:

postgres=# set maintenance_work_mem='2GB';
SET
postgres=# create index i_gisttest on gisttest using gist (t collate 
"C") WITH (fillfactor=10);
DEBUG:  building index "i_gisttest" on table "gisttest"
DEBUG:  failed to switch to buffered GiST build

This is because of this rather complex calculation here:

>     while (
>     /* subtree must fit in cache (with safety factor of 4) */
>            (1 - pow(avgIndexTuplesPerPage, (double) (levelStep + 1))) / (1 - avgIndexTuplesPerPage) <
effective_cache_size/ 4
 
>            &&
>     /* each node in the lowest level of a subtree has one page in memory */
>            (pow(maxIndexTuplesPerPage, (double) levelStep) < (maintenance_work_mem * 1024) / BLCKSZ)
>         )

What happens here is that the calculation of (maintenance_work_mem * 
1024) / BLCKSZ overflows the 32 bit signed integer type used there, if 
maintenance_work_mem >= 2GB. I think we should be doing all these 
calculations in double.

I'll go fix that..

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: GiST buffering build, bug in levelStep calculation

From

Tom Lane

Date:

29 May 2012, 16:42:49

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> This is because of this rather complex calculation here:

>> while (
>> /* subtree must fit in cache (with safety factor of 4) */
>> (1 - pow(avgIndexTuplesPerPage, (double) (levelStep + 1))) / (1 - avgIndexTuplesPerPage) < effective_cache_size / 4
>> &&
>> /* each node in the lowest level of a subtree has one page in memory */
>> (pow(maxIndexTuplesPerPage, (double) levelStep) < (maintenance_work_mem * 1024) / BLCKSZ)
>> )

> What happens here is that the calculation of (maintenance_work_mem * 
> 1024) / BLCKSZ overflows the 32 bit signed integer type used there, if 
> maintenance_work_mem >= 2GB. I think we should be doing all these 
> calculations in double.

Given the use of pow(), I concur with changing the whole calculation to
double.  Just as a remark, the correct way to code that sort of thing
normally is (maintenance_work_mem * 1024L) / BLCKSZ, which avoids
overflow because guc.c limits MAX_KILOBYTES to ensure it won't overflow
a long.  (Given that we are now supporting Win64 where long is narrower
than size_t, we might want to revisit this coding rule eventually, but
it's not high on my priority list.)

While I'm looking at this, is the first test involving
effective_cache_size bulletproof either?  In particular, is
avgIndexTuplesPerPage clamped to be strictly greater than 1?

And for that matter, is either test sane from a units standpoint?
It seems to me that maxIndexTuplesPerPage would have units of
1/blocks, which is pretty dubious to be comparing to a block count
even disregarding the power function.
        regards, tom lane

Re: GiST buffering build, bug in levelStep calculation

From

Alexander Korotkov

Date:

29 May 2012, 17:21:25

On Tue, May 29, 2012 at 11:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

While I'm looking at this, is the first test involving
effective_cache_size bulletproof either? In particular, is
avgIndexTuplesPerPage clamped to be strictly greater than 1?

It's based on collected statistics on already inserted tuple sizes. Since tuple sizes are measured after possible toasting, I don't see the way for avgIndexTuplesPerPage to be less than 1.

And for that matter, is either test sane from a units standpoint?
It seems to me that maxIndexTuplesPerPage would have units of
1/blocks, which is pretty dubious to be comparing to a block count
even disregarding the power function.

In this test we use avgIndexTuplesPerPage as estimate for number of child index pages of one page. I think we can assume it to have blocks unit.

------
With best regards,
Alexander Korotkov.

Re: GiST buffering build, bug in levelStep calculation

From

Tom Lane

Date:

29 May 2012, 17:25:22

Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Tue, May 29, 2012 at 11:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> While I'm looking at this, is the first test involving
>> effective_cache_size bulletproof either?  In particular, is
>> avgIndexTuplesPerPage clamped to be strictly greater than 1?

> It's based on collected statistics on already inserted tuple sizes. Since
> tuple sizes are measured after possible toasting, I don't see the way
> for avgIndexTuplesPerPage to be less than 1.

Yeah, but if it could be *equal* to one, you've got a zero-divide there.
        regards, tom lane

Re: GiST buffering build, bug in levelStep calculation

From

Alexander Korotkov

Date:

29 May 2012, 17:46:48

On Wed, May 30, 2012 at 12:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Tue, May 29, 2012 at 11:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> While I'm looking at this, is the first test involving
>> effective_cache_size bulletproof either? In particular, is
>> avgIndexTuplesPerPage clamped to be strictly greater than 1?

> It's based on collected statistics on already inserted tuple sizes. Since
> tuple sizes are measured after possible toasting, I don't see the way
> for avgIndexTuplesPerPage to be less than 1.

Yeah, but if it could be *equal* to one, you've got a zero-divide there.

avgIndexTuplesPerPage is calculated as:

avgIndexTuplesPerPage = pageFreeSpace / itupAvgSize;

I think size of each index tuple must be at least few times lower than pageFreeSpace to let us create any index.

------
With best regards,
Alexander Korotkov.

Re: GiST buffering build, bug in levelStep calculation

From

Heikki Linnakangas

Date:

31 May 2012, 06:36:38

On 29.05.2012 23:46, Alexander Korotkov wrote:
> On Wed, May 30, 2012 at 12:25 AM, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>
>> Alexander Korotkov<aekorotkov@gmail.com>  writes:
>>> On Tue, May 29, 2012 at 11:42 PM, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>>>> While I'm looking at this, is the first test involving
>>>> effective_cache_size bulletproof either?  In particular, is
>>>> avgIndexTuplesPerPage clamped to be strictly greater than 1?
>>
>>> It's based on collected statistics on already inserted tuple sizes. Since
>>> tuple sizes are measured after possible toasting, I don't see the way
>>> for avgIndexTuplesPerPage to be less than 1.
>>
>> Yeah, but if it could be *equal* to one, you've got a zero-divide there.
>>
>
> avgIndexTuplesPerPage is calculated as:
>
> avgIndexTuplesPerPage = pageFreeSpace / itupAvgSize;
>
> I think size of each index tuple must be at least few times lower
> than pageFreeSpace to let us create any index.

Hmm, in theory, it seems possible that every leaf level index tuple 
would completely fill an index page. Not sure how useful such an index 
would be, though. On internal pages, at least, you have to fit at least 
two tuples on a page or you can't build a tree.

I note that the calculations assume that leaf tuples and internal tuples 
have similar sizes. We calculate the average leaf tuple size, and use 
that to calculate the fan-out of internal pages. On some GiST opclasses, 
the values stored on internal pages might be quite different from the 
leaf tuples. I don't think we need to worry about that in practice, 
these calculations are not very accurate anyway, but perhaps a comment 
would be in order.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: GiST buffering build, bug in levelStep calculation

From

Alexander Korotkov

Date:

31 May 2012, 07:35:30

On Thu, May 31, 2012 at 1:36 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:

I note that the calculations assume that leaf tuples and internal tuples have similar sizes. We calculate the average leaf tuple size, and use that to calculate the fan-out of internal pages. On some GiST opclasses, the values stored on internal pages might be quite different from the leaf tuples. I don't think we need to worry about that in practice, these calculations are not very accurate anyway, but perhaps a comment would be in order.

Probably we could collect per-level statistics of average tuple size?

------
With best regards,
Alexander Korotkov.