Home > mailing lists

Thread: [Proposal] Extend TableAM routines for ANALYZE scan

[Proposal] Extend TableAM routines for ANALYZE scan

From

Pengzhou Tang

Date:

05 December 2019, 10:14:17

When hacking the Zedstore, we need to get a more accurate statistic for zedstore and we

faced some restrictions:

1) acquire_sample_rows() always use RelationGetNumberOfBlocks to generate sampling block

numbers, this is not friendly for zedstore which wants to use a logical block number and might also

not friendly to non-block-oriented Table AMs.

2) columns of zedstore table store separately, so columns in a row have a different physical position,

tid in a tuple is invalid for zedstore which means the correlation statistic is incorrect for zedstore.

3) RelOptInfo->pages is not correct for Zedstore if we only access partial of the columns which make

the IO cost much higher than the actual cost.

For 1) and 2), we propose to extend existing ANALYZE-scan table AM routines in patch

"0001-ANALYZE-tableam-API-change.patch" which add three more APIs:

scan_analyze_beginscan(), scan_analyze_sample_tuple(), scan_analyze_endscan(). This provides

more convenience and table AMs can take more control of every step of sampling rows. Meanwhile,

with the new structure named "AcquireSampleContext", we can acquire extra info (eg: physical position,

physical size) except the real columns values.

For 3), we hope we can have a similar mechanism with RelOptInfo->rows which is calculated from

(RelOptInfo->tuples * Selectivity), we can calculate RelOptInfo->pages with a page selectivity which

is base on the selected zedstore columns. 0002-Planner-can-estimate-the-pages-based-on-the-columns-.patch

shows one idea that adding the `stadiskfrac` to pg_statistic and planner use it to estimate the

RelOptInfo->pages.

0003-ZedStore-use-extended-ANAlYZE-API.patch is attached to only show how Zedstore use the

previous patches to achieve:

1. use logical block id to acquire the sample rows.

2. can only acquire sample rows from specified column c1, this is used when user only analyze table

on specified columns eg: "analyze zs (c1)".

3 when ANALYZE, zedstore table AM provided extra disksize info, then ANALYZE compute the

physical fraction statistic of each column and planner use it to estimate the IO cost based on

the selected columns.

Thanks,

Pengzhou

Attachment

Re: [Proposal] Extend TableAM routines for ANALYZE scan

From

Julien Rouhaud

Date:

23 December 2019, 12:51:28

Hello,

On Thu, Dec 5, 2019 at 11:14 AM Pengzhou Tang <ptang@pivotal.io> wrote:
>
> When hacking the Zedstore, we need to get a more accurate statistic for zedstore and we
> faced some restrictions:
> 1) acquire_sample_rows() always use RelationGetNumberOfBlocks to generate sampling block
>     numbers, this is not friendly for zedstore which wants to use a logical block number and might also
>     not friendly to non-block-oriented Table AMs.
> 2) columns of zedstore table store separately, so columns in a row have a different physical position,
>     tid in a tuple is invalid for zedstore which means the correlation statistic is incorrect for zedstore.
> 3) RelOptInfo->pages is not correct for Zedstore if we only access partial of the columns which make
>    the IO cost much higher than the actual cost.
>
> For 1) and 2), we propose to extend existing ANALYZE-scan table AM routines in patch
> "0001-ANALYZE-tableam-API-change.patch" which add three more APIs:
> scan_analyze_beginscan(), scan_analyze_sample_tuple(), scan_analyze_endscan(). This provides
> more convenience and table AMs can take more control of every step of sampling rows. Meanwhile,
> with the new structure named "AcquireSampleContext", we can acquire extra info (eg: physical position,
> physical size) except the real columns values.
>
> For 3), we hope we can have a similar mechanism with RelOptInfo->rows which is calculated from
>  (RelOptInfo->tuples * Selectivity), we can calculate RelOptInfo->pages with a page selectivity which
> is base on the selected zedstore columns.  0002-Planner-can-estimate-the-pages-based-on-the-columns-.patch
> shows one idea that adding the `stadiskfrac` to pg_statistic and planner use it to estimate the
> RelOptInfo->pages.
>
> 0003-ZedStore-use-extended-ANAlYZE-API.patch is attached to only show how Zedstore use the
> previous patches to achieve:
> 1. use logical block id to acquire the sample rows.
> 2. can only acquire sample rows from specified column c1, this is used when user only analyze table
>     on specified columns eg: "analyze zs (c1)".
> 3 when ANALYZE, zedstore table AM provided extra disksize info, then ANALYZE compute the
>     physical fraction statistic of each column and planner use it to estimate the IO cost based on
>     the selected columns.

I couldn't find an entry for that patchset in the next commitfest.
Could you register it so that it won't be forgotten?