Re: Inefficient queryplan for query with intersectable

From: Arjen van der Meijden
Subject: Re: Inefficient queryplan for query with intersectable
Date: ,
Msg-id: 4311CDD5.8060508@tweakers.net
(view: Whole thread, Raw)
In response to: Re: Inefficient queryplan for query with intersectable  (Tom Lane)
List: pgsql-performance

Tree view

Inefficient queryplan for query with intersectable subselects/joins  (Arjen van der Meijden, )
 Re: Inefficient queryplan for query with intersectable  (Richard Huxton, )
  Re: Inefficient queryplan for query with intersectable  (Arjen van der Meijden, )
   Re: Inefficient queryplan for query with intersectable  (Tom Lane, )
    Re: Inefficient queryplan for query with intersectable  (Arjen van der Meijden, )
     Re: Inefficient queryplan for query with intersectable  (Tom Lane, )
      Re: Inefficient queryplan for query with  (Ron, )
      Re: Inefficient queryplan for query with intersectable  (Arjen van der Meijden, )

On 27-8-2005 16:27, Tom Lane wrote:
> Arjen van der Meijden <> writes:
>
>>Is a nested loop normally so much (3x) more costly than a hash join? Or
>>is it just this query that gets estimated wronly?
>
> There's been some discussion that we are overestimating the cost of
> nestloops in general, because we don't take into account that successive
> scans of the inner relation are likely to find many pages already in
> cache from the earlier scans.  So far no one's come up with a good cost
> model to use for this, though.

Ah, that explains. I take it, there already is an estimation for the
cost of "the amount of pages that will be loaded for this operation".
For indexed lookups this will probably be something like "the amount of
expected pages to fetch * the random page cost"?

And appareantly for the nested loop its something like "iterations *
amount of pages per iteration * random page cost" ?
The naive approach seems to me, is to just calculate the probable amount
of pages to fetch from disk rather than from cache.

In this case there are 7692207 rows in 60569 pages and on average 234
rows per product (per nested loop) in the estimation. It estimates that
it'll have to do 565 iterations.
In worst case for the first 234 rows, no pages are already cached and
the rows are all in a seperate page. So thats 234 pages to fetch.
In the second iteration, you know already 234 pages are fetched and
that's about 0.386% of the total pages. So the expected amount of pages
for the next 234 pages expected to be in cache is 234 * 0.00386 = 1.
After that you'll have 234 + 233 pages in cache, etc, etc.
Following that approach, the 565th iteration only has to pull in about
27 new pages in the worst case of all records being perfectly scattered
over the pages, not 234.

Of course this has to be adjusted for the amount of available buffers
and cache and the expected amount of pages to fetch for the iterations,
which may be less than 234.

When a fetch of a random page costs 4 and one from cache 0.01, there is
quite a large difference: 565 * (234 * 4) = 530535 vs 215864,93

Actually the likeliness of a page being in cache is a bit higher, since
the expectation increases for each newly fetched page, not for batches
of 234. I didn't use that in my calculation here.

Anyway, this is probably been thought over already and there may be many
flaws in it. If not, please think it over.

Best regards,

Arjen


pgsql-performance by date:

From: "Ilia Kantor"
Date:
Subject: intarray is broken ? (8.1b1)
From: "Ilia Kantor"
Date:
Subject: Bitmap scan when it is not needed