Thread: GiST buffering build, bug in levelStep calculation
I just noticed a little bug in the way the levelStep parameter is calculated when we switch to buffering build: set client_min_messages='debug2'; This works: postgres=# set maintenance_work_mem='1GB'; SET postgres=# create index i_gisttest on gisttest using gist (t collate "C") WITH (fillfactor=10); DEBUG: building index "i_gisttest" on table "gisttest" DEBUG: switched to buffered GiST build; level step = 2, pagesPerBuffer = 232 But with higher maintenance_work_mem, it fails to switch to buffering mode: postgres=# set maintenance_work_mem='2GB'; SET postgres=# create index i_gisttest on gisttest using gist (t collate "C") WITH (fillfactor=10); DEBUG: building index "i_gisttest" on table "gisttest" DEBUG: failed to switch to buffered GiST build This is because of this rather complex calculation here: > while ( > /* subtree must fit in cache (with safety factor of 4) */ > (1 - pow(avgIndexTuplesPerPage, (double) (levelStep + 1))) / (1 - avgIndexTuplesPerPage) < effective_cache_size/ 4 > && > /* each node in the lowest level of a subtree has one page in memory */ > (pow(maxIndexTuplesPerPage, (double) levelStep) < (maintenance_work_mem * 1024) / BLCKSZ) > ) What happens here is that the calculation of (maintenance_work_mem * 1024) / BLCKSZ overflows the 32 bit signed integer type used there, if maintenance_work_mem >= 2GB. I think we should be doing all these calculations in double. I'll go fix that.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > This is because of this rather complex calculation here: >> while ( >> /* subtree must fit in cache (with safety factor of 4) */ >> (1 - pow(avgIndexTuplesPerPage, (double) (levelStep + 1))) / (1 - avgIndexTuplesPerPage) < effective_cache_size / 4 >> && >> /* each node in the lowest level of a subtree has one page in memory */ >> (pow(maxIndexTuplesPerPage, (double) levelStep) < (maintenance_work_mem * 1024) / BLCKSZ) >> ) > What happens here is that the calculation of (maintenance_work_mem * > 1024) / BLCKSZ overflows the 32 bit signed integer type used there, if > maintenance_work_mem >= 2GB. I think we should be doing all these > calculations in double. Given the use of pow(), I concur with changing the whole calculation to double. Just as a remark, the correct way to code that sort of thing normally is (maintenance_work_mem * 1024L) / BLCKSZ, which avoids overflow because guc.c limits MAX_KILOBYTES to ensure it won't overflow a long. (Given that we are now supporting Win64 where long is narrower than size_t, we might want to revisit this coding rule eventually, but it's not high on my priority list.) While I'm looking at this, is the first test involving effective_cache_size bulletproof either? In particular, is avgIndexTuplesPerPage clamped to be strictly greater than 1? And for that matter, is either test sane from a units standpoint? It seems to me that maxIndexTuplesPerPage would have units of 1/blocks, which is pretty dubious to be comparing to a block count even disregarding the power function. regards, tom lane
On Tue, May 29, 2012 at 11:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
It's based on collected statistics on already inserted tuple sizes. Since tuple sizes are measured after possible toasting, I don't see the way for avgIndexTuplesPerPage to be less than 1.
------
With best regards,
Alexander Korotkov.
While I'm looking at this, is the first test involving
effective_cache_size bulletproof either? In particular, is
avgIndexTuplesPerPage clamped to be strictly greater than 1?
And for that matter, is either test sane from a units standpoint?
It seems to me that maxIndexTuplesPerPage would have units of
1/blocks, which is pretty dubious to be comparing to a block count
even disregarding the power function.
In this test we use avgIndexTuplesPerPage as estimate for number of child index pages of one page. I think we can assume it to have blocks unit.
With best regards,
Alexander Korotkov.
Alexander Korotkov <aekorotkov@gmail.com> writes: > On Tue, May 29, 2012 at 11:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> While I'm looking at this, is the first test involving >> effective_cache_size bulletproof either? In particular, is >> avgIndexTuplesPerPage clamped to be strictly greater than 1? > It's based on collected statistics on already inserted tuple sizes. Since > tuple sizes are measured after possible toasting, I don't see the way > for avgIndexTuplesPerPage to be less than 1. Yeah, but if it could be *equal* to one, you've got a zero-divide there. regards, tom lane
On Wed, May 30, 2012 at 12:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
------
With best regards,
Alexander Korotkov.
Alexander Korotkov <aekorotkov@gmail.com> writes:Yeah, but if it could be *equal* to one, you've got a zero-divide there.
> On Tue, May 29, 2012 at 11:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> While I'm looking at this, is the first test involving
>> effective_cache_size bulletproof either? In particular, is
>> avgIndexTuplesPerPage clamped to be strictly greater than 1?
> It's based on collected statistics on already inserted tuple sizes. Since
> tuple sizes are measured after possible toasting, I don't see the way
> for avgIndexTuplesPerPage to be less than 1.
avgIndexTuplesPerPage is calculated as:
avgIndexTuplesPerPage = pageFreeSpace / itupAvgSize;
I think size of each index tuple must be at least few times lower than pageFreeSpace to let us create any index.
With best regards,
Alexander Korotkov.
On 29.05.2012 23:46, Alexander Korotkov wrote: > On Wed, May 30, 2012 at 12:25 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > >> Alexander Korotkov<aekorotkov@gmail.com> writes: >>> On Tue, May 29, 2012 at 11:42 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >>>> While I'm looking at this, is the first test involving >>>> effective_cache_size bulletproof either? In particular, is >>>> avgIndexTuplesPerPage clamped to be strictly greater than 1? >> >>> It's based on collected statistics on already inserted tuple sizes. Since >>> tuple sizes are measured after possible toasting, I don't see the way >>> for avgIndexTuplesPerPage to be less than 1. >> >> Yeah, but if it could be *equal* to one, you've got a zero-divide there. >> > > avgIndexTuplesPerPage is calculated as: > > avgIndexTuplesPerPage = pageFreeSpace / itupAvgSize; > > I think size of each index tuple must be at least few times lower > than pageFreeSpace to let us create any index. Hmm, in theory, it seems possible that every leaf level index tuple would completely fill an index page. Not sure how useful such an index would be, though. On internal pages, at least, you have to fit at least two tuples on a page or you can't build a tree. I note that the calculations assume that leaf tuples and internal tuples have similar sizes. We calculate the average leaf tuple size, and use that to calculate the fan-out of internal pages. On some GiST opclasses, the values stored on internal pages might be quite different from the leaf tuples. I don't think we need to worry about that in practice, these calculations are not very accurate anyway, but perhaps a comment would be in order. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, May 31, 2012 at 1:36 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
I note that the calculations assume that leaf tuples and internal tuples have similar sizes. We calculate the average leaf tuple size, and use that to calculate the fan-out of internal pages. On some GiST opclasses, the values stored on internal pages might be quite different from the leaf tuples. I don't think we need to worry about that in practice, these calculations are not very accurate anyway, but perhaps a comment would be in order.
Probably we could collect per-level statistics of average tuple size?
With best regards,
Alexander Korotkov.