[Proposal] Extend TableAM routines for ANALYZE scan - Mailing list pgsql-hackers

From Pengzhou Tang
Subject [Proposal] Extend TableAM routines for ANALYZE scan
Date
Msg-id CAG4reARKmeZezUNc8YmSxht9q=FNe6Lw=+f_ui=Bs6a2vpLmHA@mail.gmail.com
Whole thread Raw
Responses Re: [Proposal] Extend TableAM routines for ANALYZE scan  (Julien Rouhaud <rjuju123@gmail.com>)
List pgsql-hackers
When hacking the Zedstore, we need to get a more accurate statistic for zedstore and we
faced some restrictions:
1) acquire_sample_rows() always use RelationGetNumberOfBlocks to generate sampling block
    numbers, this is not friendly for zedstore which wants to use a logical block number and might also
    not friendly to non-block-oriented Table AMs.
2) columns of zedstore table store separately, so columns in a row have a different physical position,
    tid in a tuple is invalid for zedstore which means the correlation statistic is incorrect for zedstore.
3) RelOptInfo->pages is not correct for Zedstore if we only access partial of the columns which make
   the IO cost much higher than the actual cost. 

For 1) and 2), we propose to extend existing ANALYZE-scan table AM routines in patch
"0001-ANALYZE-tableam-API-change.patch" which add three more APIs:
scan_analyze_beginscan(), scan_analyze_sample_tuple(), scan_analyze_endscan(). This provides
more convenience and table AMs can take more control of every step of sampling rows. Meanwhile,
with the new structure named "AcquireSampleContext", we can acquire extra info (eg: physical position,
physical size) except the real columns values.

For 3), we hope we can have a similar mechanism with RelOptInfo->rows which is calculated from
 (RelOptInfo->tuples * Selectivity), we can calculate RelOptInfo->pages with a page selectivity which
is base on the selected zedstore columns.  0002-Planner-can-estimate-the-pages-based-on-the-columns-.patch
shows one idea that adding the `stadiskfrac` to pg_statistic and planner use it to estimate the 
RelOptInfo->pages.

0003-ZedStore-use-extended-ANAlYZE-API.patch is attached to only show how Zedstore use the
previous patches to achieve:
1. use logical block id to acquire the sample rows.
2. can only acquire sample rows from specified column c1, this is used when user only analyze table
    on specified columns eg: "analyze zs (c1)".
3 when ANALYZE, zedstore table AM provided extra disksize info, then ANALYZE compute the
    physical fraction statistic of each column and planner use it to estimate the IO cost based on
    the selected columns. 

Thanks,
Pengzhou
Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Windows buildfarm members vs. new async-notify isolation test
Next
From: Amit Kapila
Date:
Subject: Re: logical decoding : exceeded maxAllocatedDescs for .spill files