Re: Should heapam_estimate_rel_size consider fillfactor? - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Should heapam_estimate_rel_size consider fillfactor?
Date
Msg-id 60eaa69b-1c57-67ed-3aee-7600db134e42@enterprisedb.com
Whole thread Raw
In response to Re: Should heapam_estimate_rel_size consider fillfactor?  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Should heapam_estimate_rel_size consider fillfactor?
List pgsql-hackers

On 7/3/23 09:00, Tomas Vondra wrote:
> 
> 
> On 7/3/23 08:46, Peter Eisentraut wrote:
>> On 14.06.23 20:41, Corey Huinker wrote:
>>>     So maybe we should make table_block_relation_estimate_size smarter to
>>>     also consider the fillfactor in the "no statistics" branch, per the
>>>     attached patch.
>>>
>>>
>>> I like this a lot. The reasoning is obvious, the fix is simple,it
>>> doesn't upset any make-check-world tests, and in order to get a
>>> performance regression we'd need a table whose fillfactor has been
>>> changed after the data was loaded but before an analyze happens, and
>>> that's a narrow enough case to accept.
>>>
>>> My only nitpick is to swap
>>>
>>>     (usable_bytes_per_page * fillfactor / 100) / tuple_width
>>>
>>> with
>>>
>>>     (usable_bytes_per_page * fillfactor) / (tuple_width * 100)
>>>
>>>
>>> as this will eliminate the extra remainder truncation, and it also
>>> gets the arguments "in order" algebraically.
>>
>> The fillfactor is in percent, so it makes sense to divide it by 100
>> first before doing any further arithmetic with it.  I find your version
>> of this to be more puzzling without any explanation.  You could do
>> fillfactor/100.0 to avoid the integer division, but then, the comment
>> above that says "integer division is intentional here", without any
>> further explanation.  I think a bit more explanation of all the
>> subtleties here would be good in any case.
>>
> 
> Yeah, I guess the formula should be doing (fillfactor / 100.0).
> 
> The "integer division is intentional here" comment is unrelated to this
> change, it refers to the division by "tuple_width" (and yeah, it'd be
> nice to have it explain why it's intentional).
> 

FWIW the reason why the integer division is intentional is most likely
that we want "floor" semantics - if there's 10.23 rows per page, that
really means 10 rows per page.

I doubt it makes a huge difference in this particular place, considering
we're calculating the estimate from somewhat unreliable values, and then
use it for rough estimate of relation size.

But from this POV, I think it's more correct to do it "my" way:

  density = (usable_bytes_per_page * fillfactor / 100) / tuple_width;

because that's doing *two* separate integer divisions, with floor
semantics. First we calculate "usable bytes" (rounded down), then
average number of rows per page (also rounded down).

Corey's formula would do just one integer division. I don't think it
makes a huge difference, though. I mean, it's just an estimate and so we
can't expect to be 100% accurate.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Julien Rouhaud
Date:
Subject: Re: Outdated description of PG_CACHE_LINE_SIZE
Next
From: Peter Eisentraut
Date:
Subject: Re: Improve list manipulation in several places