Re: bitmaps and correlation - Mailing list pgsql-hackers

From Justin Pryzby
Subject Re: bitmaps and correlation
Date
Msg-id 20200107052606.GP12066@telsasoft.com
Whole thread Raw
In response to Re: bitmaps and correlation  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: bitmaps and correlation  (Justin Pryzby <pryzby@telsasoft.com>)
List pgsql-hackers
On Tue, Jan 07, 2020 at 09:21:03AM +0530, Dilip Kumar wrote:
> On Tue, Jan 7, 2020 at 1:29 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> >
> > Find attached cleaned up patch.
> > For now, I updated the regress/expected/, but I think the test maybe has to be
> > updated to do what it was written to do.
> 
> I have noticed that in "cost_index" we have used the indexCorrelation
> for computing the run_cost, not the number of pages whereas in your
> patch you have used it for computing the number of pages.  Any reason
> for the same?

As Jeff has pointed out, high correlation has two effects in cost_index():
1) the number of pages read will be less;
2) the pages will be read more sequentially;

cost_index reuses the pages_fetched variable, so (1) isn't particularly clear,
but does essentially:

                /* max_IO_cost is for the perfectly uncorrelated case (csquared=0) */
                pages_fetched(MIN) = index_pages_fetched(tuples_fetched,
                                                    baserel->pages,
                                                    (double) index->pages,
                                                    root);
                max_IO_cost = pages_fetchedMIN * spc_random_page_cost;

                /* min_IO_cost is for the perfectly correlated case (csquared=1) */
                pages_fetched(MAX) = ceil(indexSelectivity * (double) baserel->pages);
                min_IO_cost = (pages_fetchedMAX - 1) * spc_seq_page_cost;


My patch 1) changes compute_bitmap_pages() to interpolate pages_fetched using
the correlation; pages_fetchedMIN is new:

> Patch
> - pages_fetched = (2.0 * T * tuples_fetched) / (2.0 * T + tuples_fetched);
> + pages_fetchedMAX = (2.0 * T * tuples_fetched) / (2.0 * T + tuples_fetched);
> +
> + /* pages_fetchedMIN is for the perfectly correlated case (csquared=1) */
> + pages_fetchedMIN = ceil(indexSelectivity * (double) baserel->pages);
> +
> + pages_fetched = pages_fetchedMAX + indexCorrelation*indexCorrelation*(pages_fetchedMIN - pages_fetchedMAX);

And, 2) also computes cost_per_page by interpolation between seq_page and
random_page cost:

+               cost_per_page_corr = spc_random_page_cost -
+                       (spc_random_page_cost - spc_seq_page_cost)
+                       * (1-correlation*correlation);

Thanks for looking.  I'll update the name of pages_fetchedMIN/MAX in my patch
for consistency with cost_index.

Justin



pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: adding partitioned tables to publications
Next
From: Nagaraj Raj
Date:
Subject: pg_repack failure