Thread: Should heapam_estimate_rel_size consider fillfactor?

Should heapam_estimate_rel_size consider fillfactor?

From
Tomas Vondra
Date:
Hi,

While testing some stuff, I noticed heapam_estimate_rel_size (or rather
table_block_relation_estimate_size it calls) ignores fillfactor, so that
for a table without statistics it ends up with reltuple estimate much
higher than reality. For example, with fillfactor=10 the estimate is
about 10x higher.

I ran into this while doing some tests with hash indexes, where I use
fillfactor to make the table look bigger (as if the tuples were wider),
and I ran into this:

  drop table hash_test;
  create table hash_test (a int, b text) with (fillfactor=10);
  insert into hash_test select 1 + ((i - 1) / 10000), md5(i::text)
         from generate_series(1, 1000000) s(i);
  -- analyze hash_test;
  create index hash_test_idx on hash_test using hash (a);

  select pg_size_pretty(pg_relation_size('hash_test_idx'));

If you run it like this (without the analyze), the index will be 339MB.
With the analyze, it's 47MB.

This only happens if there are no relpages/reltuples statistics yet, in
which case table_block_relation_estimate_size estimates density from
tuple width etc.

So it seems the easiest "fix" is to do ANALYZE before creating the index
(and not after it, as I had in my scripts). But I wonder how many people
fail to realize this - it sure took me a while to realize the indexes
are too large and even longer what is causing it. I wouldn't be very
surprised if many people had bloated hash indexes after bulk loads.

So maybe we should make table_block_relation_estimate_size smarter to
also consider the fillfactor in the "no statistics" branch, per the
attached patch.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

Re: Should heapam_estimate_rel_size consider fillfactor?

From
Corey Huinker
Date:
So maybe we should make table_block_relation_estimate_size smarter to
also consider the fillfactor in the "no statistics" branch, per the
attached patch.

I like this a lot. The reasoning is obvious, the fix is simple,it doesn't upset any make-check-world tests, and in order to get a performance regression we'd need a table whose fillfactor has been changed after the data was loaded but before an analyze happens, and that's a narrow enough case to accept.

My only nitpick is to swap
(usable_bytes_per_page * fillfactor / 100) / tuple_width
with
(usable_bytes_per_page * fillfactor) / (tuple_width * 100)

as this will eliminate the extra remainder truncation, and it also gets the arguments "in order" algebraically.


Re: Should heapam_estimate_rel_size consider fillfactor?

From
Andres Freund
Date:
Hi,

On 2023-06-11 14:41:27 +0200, Tomas Vondra wrote:
> While testing some stuff, I noticed heapam_estimate_rel_size (or rather
> table_block_relation_estimate_size it calls) ignores fillfactor, so that
> for a table without statistics it ends up with reltuple estimate much
> higher than reality. For example, with fillfactor=10 the estimate is
> about 10x higher.
> 
> I ran into this while doing some tests with hash indexes, where I use
> fillfactor to make the table look bigger (as if the tuples were wider),
> and I ran into this:
> 
>   drop table hash_test;
>   create table hash_test (a int, b text) with (fillfactor=10);
>   insert into hash_test select 1 + ((i - 1) / 10000), md5(i::text)
>          from generate_series(1, 1000000) s(i);
>   -- analyze hash_test;
>   create index hash_test_idx on hash_test using hash (a);
> 
>   select pg_size_pretty(pg_relation_size('hash_test_idx'));
> 
> If you run it like this (without the analyze), the index will be 339MB.
> With the analyze, it's 47MB.
> 
> This only happens if there are no relpages/reltuples statistics yet, in
> which case table_block_relation_estimate_size estimates density from
> tuple width etc.
> 
> So it seems the easiest "fix" is to do ANALYZE before creating the index
> (and not after it, as I had in my scripts). But I wonder how many people
> fail to realize this - it sure took me a while to realize the indexes
> are too large and even longer what is causing it. I wouldn't be very
> surprised if many people had bloated hash indexes after bulk loads.
> 
> So maybe we should make table_block_relation_estimate_size smarter to
> also consider the fillfactor in the "no statistics" branch, per the
> attached patch.

Seems like a good idea - I can't think of a reason why we shouldn't do so.

Greetings,

Andres Freund



Re: Should heapam_estimate_rel_size consider fillfactor?

From
Peter Eisentraut
Date:
On 14.06.23 20:41, Corey Huinker wrote:
>     So maybe we should make table_block_relation_estimate_size smarter to
>     also consider the fillfactor in the "no statistics" branch, per the
>     attached patch.
> 
> 
> I like this a lot. The reasoning is obvious, the fix is simple,it 
> doesn't upset any make-check-world tests, and in order to get a 
> performance regression we'd need a table whose fillfactor has been 
> changed after the data was loaded but before an analyze happens, and 
> that's a narrow enough case to accept.
> 
> My only nitpick is to swap
> 
>     (usable_bytes_per_page * fillfactor / 100) / tuple_width
> 
> with
> 
>     (usable_bytes_per_page * fillfactor) / (tuple_width * 100)
> 
> 
> as this will eliminate the extra remainder truncation, and it also gets 
> the arguments "in order" algebraically.

The fillfactor is in percent, so it makes sense to divide it by 100 
first before doing any further arithmetic with it.  I find your version 
of this to be more puzzling without any explanation.  You could do 
fillfactor/100.0 to avoid the integer division, but then, the comment 
above that says "integer division is intentional here", without any 
further explanation.  I think a bit more explanation of all the 
subtleties here would be good in any case.




Re: Should heapam_estimate_rel_size consider fillfactor?

From
Tomas Vondra
Date:

On 7/3/23 08:46, Peter Eisentraut wrote:
> On 14.06.23 20:41, Corey Huinker wrote:
>>     So maybe we should make table_block_relation_estimate_size smarter to
>>     also consider the fillfactor in the "no statistics" branch, per the
>>     attached patch.
>>
>>
>> I like this a lot. The reasoning is obvious, the fix is simple,it
>> doesn't upset any make-check-world tests, and in order to get a
>> performance regression we'd need a table whose fillfactor has been
>> changed after the data was loaded but before an analyze happens, and
>> that's a narrow enough case to accept.
>>
>> My only nitpick is to swap
>>
>>     (usable_bytes_per_page * fillfactor / 100) / tuple_width
>>
>> with
>>
>>     (usable_bytes_per_page * fillfactor) / (tuple_width * 100)
>>
>>
>> as this will eliminate the extra remainder truncation, and it also
>> gets the arguments "in order" algebraically.
> 
> The fillfactor is in percent, so it makes sense to divide it by 100
> first before doing any further arithmetic with it.  I find your version
> of this to be more puzzling without any explanation.  You could do
> fillfactor/100.0 to avoid the integer division, but then, the comment
> above that says "integer division is intentional here", without any
> further explanation.  I think a bit more explanation of all the
> subtleties here would be good in any case.
> 

Yeah, I guess the formula should be doing (fillfactor / 100.0).

The "integer division is intentional here" comment is unrelated to this
change, it refers to the division by "tuple_width" (and yeah, it'd be
nice to have it explain why it's intentional).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Should heapam_estimate_rel_size consider fillfactor?

From
Tomas Vondra
Date:

On 7/3/23 09:00, Tomas Vondra wrote:
> 
> 
> On 7/3/23 08:46, Peter Eisentraut wrote:
>> On 14.06.23 20:41, Corey Huinker wrote:
>>>     So maybe we should make table_block_relation_estimate_size smarter to
>>>     also consider the fillfactor in the "no statistics" branch, per the
>>>     attached patch.
>>>
>>>
>>> I like this a lot. The reasoning is obvious, the fix is simple,it
>>> doesn't upset any make-check-world tests, and in order to get a
>>> performance regression we'd need a table whose fillfactor has been
>>> changed after the data was loaded but before an analyze happens, and
>>> that's a narrow enough case to accept.
>>>
>>> My only nitpick is to swap
>>>
>>>     (usable_bytes_per_page * fillfactor / 100) / tuple_width
>>>
>>> with
>>>
>>>     (usable_bytes_per_page * fillfactor) / (tuple_width * 100)
>>>
>>>
>>> as this will eliminate the extra remainder truncation, and it also
>>> gets the arguments "in order" algebraically.
>>
>> The fillfactor is in percent, so it makes sense to divide it by 100
>> first before doing any further arithmetic with it.  I find your version
>> of this to be more puzzling without any explanation.  You could do
>> fillfactor/100.0 to avoid the integer division, but then, the comment
>> above that says "integer division is intentional here", without any
>> further explanation.  I think a bit more explanation of all the
>> subtleties here would be good in any case.
>>
> 
> Yeah, I guess the formula should be doing (fillfactor / 100.0).
> 
> The "integer division is intentional here" comment is unrelated to this
> change, it refers to the division by "tuple_width" (and yeah, it'd be
> nice to have it explain why it's intentional).
> 

FWIW the reason why the integer division is intentional is most likely
that we want "floor" semantics - if there's 10.23 rows per page, that
really means 10 rows per page.

I doubt it makes a huge difference in this particular place, considering
we're calculating the estimate from somewhat unreliable values, and then
use it for rough estimate of relation size.

But from this POV, I think it's more correct to do it "my" way:

  density = (usable_bytes_per_page * fillfactor / 100) / tuple_width;

because that's doing *two* separate integer divisions, with floor
semantics. First we calculate "usable bytes" (rounded down), then
average number of rows per page (also rounded down).

Corey's formula would do just one integer division. I don't think it
makes a huge difference, though. I mean, it's just an estimate and so we
can't expect to be 100% accurate.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Should heapam_estimate_rel_size consider fillfactor?

From
Tomas Vondra
Date:
On 7/3/23 11:40, Tomas Vondra wrote:
> ...
>
> FWIW the reason why the integer division is intentional is most likely
> that we want "floor" semantics - if there's 10.23 rows per page, that
> really means 10 rows per page.
> 
> I doubt it makes a huge difference in this particular place, considering
> we're calculating the estimate from somewhat unreliable values, and then
> use it for rough estimate of relation size.
> 
> But from this POV, I think it's more correct to do it "my" way:
> 
>   density = (usable_bytes_per_page * fillfactor / 100) / tuple_width;
> 
> because that's doing *two* separate integer divisions, with floor
> semantics. First we calculate "usable bytes" (rounded down), then
> average number of rows per page (also rounded down).
> 
> Corey's formula would do just one integer division. I don't think it
> makes a huge difference, though. I mean, it's just an estimate and so we
> can't expect to be 100% accurate.
> 

Pushed, using the formula with two divisions (as in the original patch).

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company