Thread: [Proposal] Extend TableAM routines for ANALYZE scan

[Proposal] Extend TableAM routines for ANALYZE scan

From
Pengzhou Tang
Date:
When hacking the Zedstore, we need to get a more accurate statistic for zedstore and we
faced some restrictions:
1) acquire_sample_rows() always use RelationGetNumberOfBlocks to generate sampling block
    numbers, this is not friendly for zedstore which wants to use a logical block number and might also
    not friendly to non-block-oriented Table AMs.
2) columns of zedstore table store separately, so columns in a row have a different physical position,
    tid in a tuple is invalid for zedstore which means the correlation statistic is incorrect for zedstore.
3) RelOptInfo->pages is not correct for Zedstore if we only access partial of the columns which make
   the IO cost much higher than the actual cost. 

For 1) and 2), we propose to extend existing ANALYZE-scan table AM routines in patch
"0001-ANALYZE-tableam-API-change.patch" which add three more APIs:
scan_analyze_beginscan(), scan_analyze_sample_tuple(), scan_analyze_endscan(). This provides
more convenience and table AMs can take more control of every step of sampling rows. Meanwhile,
with the new structure named "AcquireSampleContext", we can acquire extra info (eg: physical position,
physical size) except the real columns values.

For 3), we hope we can have a similar mechanism with RelOptInfo->rows which is calculated from
 (RelOptInfo->tuples * Selectivity), we can calculate RelOptInfo->pages with a page selectivity which
is base on the selected zedstore columns.  0002-Planner-can-estimate-the-pages-based-on-the-columns-.patch
shows one idea that adding the `stadiskfrac` to pg_statistic and planner use it to estimate the 
RelOptInfo->pages.

0003-ZedStore-use-extended-ANAlYZE-API.patch is attached to only show how Zedstore use the
previous patches to achieve:
1. use logical block id to acquire the sample rows.
2. can only acquire sample rows from specified column c1, this is used when user only analyze table
    on specified columns eg: "analyze zs (c1)".
3 when ANALYZE, zedstore table AM provided extra disksize info, then ANALYZE compute the
    physical fraction statistic of each column and planner use it to estimate the IO cost based on
    the selected columns. 

Thanks,
Pengzhou
Attachment

Re: [Proposal] Extend TableAM routines for ANALYZE scan

From
Julien Rouhaud
Date:
Hello,

On Thu, Dec 5, 2019 at 11:14 AM Pengzhou Tang <ptang@pivotal.io> wrote:
>
> When hacking the Zedstore, we need to get a more accurate statistic for zedstore and we
> faced some restrictions:
> 1) acquire_sample_rows() always use RelationGetNumberOfBlocks to generate sampling block
>     numbers, this is not friendly for zedstore which wants to use a logical block number and might also
>     not friendly to non-block-oriented Table AMs.
> 2) columns of zedstore table store separately, so columns in a row have a different physical position,
>     tid in a tuple is invalid for zedstore which means the correlation statistic is incorrect for zedstore.
> 3) RelOptInfo->pages is not correct for Zedstore if we only access partial of the columns which make
>    the IO cost much higher than the actual cost.
>
> For 1) and 2), we propose to extend existing ANALYZE-scan table AM routines in patch
> "0001-ANALYZE-tableam-API-change.patch" which add three more APIs:
> scan_analyze_beginscan(), scan_analyze_sample_tuple(), scan_analyze_endscan(). This provides
> more convenience and table AMs can take more control of every step of sampling rows. Meanwhile,
> with the new structure named "AcquireSampleContext", we can acquire extra info (eg: physical position,
> physical size) except the real columns values.
>
> For 3), we hope we can have a similar mechanism with RelOptInfo->rows which is calculated from
>  (RelOptInfo->tuples * Selectivity), we can calculate RelOptInfo->pages with a page selectivity which
> is base on the selected zedstore columns.  0002-Planner-can-estimate-the-pages-based-on-the-columns-.patch
> shows one idea that adding the `stadiskfrac` to pg_statistic and planner use it to estimate the
> RelOptInfo->pages.
>
> 0003-ZedStore-use-extended-ANAlYZE-API.patch is attached to only show how Zedstore use the
> previous patches to achieve:
> 1. use logical block id to acquire the sample rows.
> 2. can only acquire sample rows from specified column c1, this is used when user only analyze table
>     on specified columns eg: "analyze zs (c1)".
> 3 when ANALYZE, zedstore table AM provided extra disksize info, then ANALYZE compute the
>     physical fraction statistic of each column and planner use it to estimate the IO cost based on
>     the selected columns.

I couldn't find an entry for that patchset in the next commitfest.
Could you register it so that it won't be forgotten?