Home > mailing lists

Re: Improved Cost Calculation for IndexOnlyScan - Mailing list pgsql-hackers

From	Heikki Linnakangas
Subject	Re: Improved Cost Calculation for IndexOnlyScan
Date	September 29, 2020 08:08:46
Msg-id	bfa84ba5-bc7c-3794-e8eb-34897feac610@iki.fi Whole thread Raw
In response to	Improved Cost Calculation for IndexOnlyScan (Hamid Akhtar <hamid.akhtar@gmail.com>)
Responses	Re: Improved Cost Calculation for IndexOnlyScan
List	pgsql-hackers

Tree view

On 29/09/2020 10:06, Hamid Akhtar wrote:
> In one of my earlier emails [1], I mentioned that there seems to be a 
> problem with how the cost for index only scans is being calculated.
> [1] 
> https://www.postgresql.org/message-id/CANugjhsnh0OBMOYc7qKcC%2BZsVvAXCeF7QiidLuFvg6zmHy1C7A%40mail.gmail.com
> 
> My concern is that there seems to be a bigger disconnect between the 
> cost of index only scan and the execution time. Having tested this on 3 
> different systems, docker, laptop and a server with RAID 5 SSD 
> configured, at the threshold where index only scan cost exceeds that of 
> sequential scan, index only is still around 30% faster than the 
> sequential scan.

A 30% discrepancy doesn't sound too bad, to be honest. The exact 
threshold depends on so many factors.

> My initial hunch was that perhaps we need to consider a different 
> approach when considering cost for index only scan. However, the 
> solution seems somewhat simple.
> 
> cost_index function in costsize.c, in case of indexonlyscan, multiplies 
> the number of pages fetched by a factor of (1.0 - baserel->allvisfrac) 
> which is then used to calculate the max_IO_cost and min_IO_cost.
> 
> This is very similar to the cost estimate methods for indexes internally 
> call genericostesimate function. This function primarily gets the number 
> of pages for the indexes and multiplies that with random page cost 
> (spc_random_page_cost) to get the total disk access cost.
> 
> I believe that in case of index only scan, we should adjust the 
> spc_random_page_cost in context of baserel->allvisfrac so that it 
> accounts for random pages for only the fraction that needs to be read 
> for the relation and excludes that the index page fetches.

That doesn't sound right to me. The genericcostestimate() function 
calculates the number of *index* pages fetched. It makes no difference 
if it's an Index Scan or an Index Only Scan.

genericcostestimate() could surely be made smarter. Currently, it 
multiplies the number of index pages fetched with random_page_cost, even 
though a freshly created index is mostly physically ordered by the keys. 
seq_page_cost with some fudge factor might be more appropriate, whether 
or not it's an Index Only Scan. Not sure what the exact formula should 
be, just replacing random_page_cost with seq_page_cost is surely not 
right either.

- Heikki

pgsql-hackers by date:

From: Daniel Gustafsson
Date: 29 September 2020, 07:52:17
Subject: Re: Support for NSS as a libpq TLS backend

From: Thomas Munro
Date: 29 September 2020, 08:12:13
Subject: Re: Assertion failure with barriers in parallel hash join

Re: Improved Cost Calculation for IndexOnlyScan - Mailing list pgsql-hackers

Previous

Next