Thread: [Proposal] Extend TableAM routines for ANALYZE scan
When hacking the Zedstore, we need to get a more accurate statistic for zedstore and we
faced some restrictions:
1) acquire_sample_rows() always use RelationGetNumberOfBlocks to generate sampling block
numbers, this is not friendly for zedstore which wants to use a logical block number and might also
not friendly to non-block-oriented Table AMs.
2) columns of zedstore table store separately, so columns in a row have a different physical position,
tid in a tuple is invalid for zedstore which means the correlation statistic is incorrect for zedstore.
3) RelOptInfo->pages is not correct for Zedstore if we only access partial of the columns which make
the IO cost much higher than the actual cost.
For 1) and 2), we propose to extend existing ANALYZE-scan table AM routines in patch
"0001-ANALYZE-tableam-API-change.patch" which add three more APIs:
scan_analyze_beginscan(), scan_analyze_sample_tuple(), scan_analyze_endscan(). This provides
more convenience and table AMs can take more control of every step of sampling rows. Meanwhile,
with the new structure named "AcquireSampleContext", we can acquire extra info (eg: physical position,
physical size) except the real columns values.
For 3), we hope we can have a similar mechanism with RelOptInfo->rows which is calculated from
(RelOptInfo->tuples * Selectivity), we can calculate RelOptInfo->pages with a page selectivity which
is base on the selected zedstore columns. 0002-Planner-can-estimate-the-pages-based-on-the-columns-.patch
shows one idea that adding the `stadiskfrac` to pg_statistic and planner use it to estimate the
RelOptInfo->pages.
0003-ZedStore-use-extended-ANAlYZE-API.patch is attached to only show how Zedstore use the
previous patches to achieve:
1. use logical block id to acquire the sample rows.
2. can only acquire sample rows from specified column c1, this is used when user only analyze table
on specified columns eg: "analyze zs (c1)".
3 when ANALYZE, zedstore table AM provided extra disksize info, then ANALYZE compute the
physical fraction statistic of each column and planner use it to estimate the IO cost based on
the selected columns.
Thanks,
Pengzhou
Attachment
Hello, On Thu, Dec 5, 2019 at 11:14 AM Pengzhou Tang <ptang@pivotal.io> wrote: > > When hacking the Zedstore, we need to get a more accurate statistic for zedstore and we > faced some restrictions: > 1) acquire_sample_rows() always use RelationGetNumberOfBlocks to generate sampling block > numbers, this is not friendly for zedstore which wants to use a logical block number and might also > not friendly to non-block-oriented Table AMs. > 2) columns of zedstore table store separately, so columns in a row have a different physical position, > tid in a tuple is invalid for zedstore which means the correlation statistic is incorrect for zedstore. > 3) RelOptInfo->pages is not correct for Zedstore if we only access partial of the columns which make > the IO cost much higher than the actual cost. > > For 1) and 2), we propose to extend existing ANALYZE-scan table AM routines in patch > "0001-ANALYZE-tableam-API-change.patch" which add three more APIs: > scan_analyze_beginscan(), scan_analyze_sample_tuple(), scan_analyze_endscan(). This provides > more convenience and table AMs can take more control of every step of sampling rows. Meanwhile, > with the new structure named "AcquireSampleContext", we can acquire extra info (eg: physical position, > physical size) except the real columns values. > > For 3), we hope we can have a similar mechanism with RelOptInfo->rows which is calculated from > (RelOptInfo->tuples * Selectivity), we can calculate RelOptInfo->pages with a page selectivity which > is base on the selected zedstore columns. 0002-Planner-can-estimate-the-pages-based-on-the-columns-.patch > shows one idea that adding the `stadiskfrac` to pg_statistic and planner use it to estimate the > RelOptInfo->pages. > > 0003-ZedStore-use-extended-ANAlYZE-API.patch is attached to only show how Zedstore use the > previous patches to achieve: > 1. use logical block id to acquire the sample rows. > 2. can only acquire sample rows from specified column c1, this is used when user only analyze table > on specified columns eg: "analyze zs (c1)". > 3 when ANALYZE, zedstore table AM provided extra disksize info, then ANALYZE compute the > physical fraction statistic of each column and planner use it to estimate the IO cost based on > the selected columns. I couldn't find an entry for that patchset in the next commitfest. Could you register it so that it won't be forgotten?