Thread: PoC Refactor AM analyse API

PoC Refactor AM analyse API

From
Смирнов Денис
Date:
Hello all!

I suggest a refactoring of analyze AM API as it is too much heap specific at the moment. The problem was inspired by Greenplum’s analyze improvement for append-optimized row and column AM with variable size compressed blocks.
Currently we do analyze in two steps.

1. Sample fix size blocks with algorithm S from Knuth (BlockSampler function)
2. Collect tuples into reservoir with algorithm Z from Vitter.

So this doesn’t work for AMs using variable sized physical blocks for example. They need weight random sampling (WRS) algorithms like A-Chao or logical blocks to follow S-Knuth (and have a problem with RelationGetNumberOfBlocks() estimating a physical number of blocks). Another problem with columns - they are not passed to analyze begin scan and can’t benefit from column storage at ANALYZE TABLE (COL).

The suggestion is to replace table_scan_analyze_next_block() and table_scan_analyze_next_tuple() with a single function: table_acquire_sample_rows(). The AM implementation of table_acquire_sample_rows() can use the BlockSampler functions if it wants to, but if the AM is not block-oriented, it could do something else. This suggestion also passes VacAttrStats to table_acquire_sample_rows() for column-oriented AMs and removes PROGRESS_ANALYZE_BLOCKS_TOTAL and PROGRESS_ANALYZE_BLOCKS_DONE definitions as not all AMs can be block-oriented.




Best regards,
Denis Smirnov | Developer
sd@arenadata.io 
Arenadata | Godovikova 9-17, Moscow 129085 Russia

Attachment

Re: PoC Refactor AM analyse API

From
Denis Smirnov
Date:
It seems that my mailing client set wrong MIME types for attached patch and it was filtered by the web archive. So I
attachthe patch again for the web archive. 



> 7 дек. 2020 г., в 23:23, Смирнов Денис <sd@arenadata.io> написал(а):
>
> Hello all!
>
> I suggest a refactoring of analyze AM API as it is too much heap specific at the moment. The problem was inspired by
Greenplum’sanalyze improvement for append-optimized row and column AM with variable size compressed blocks. 
> Currently we do analyze in two steps.
>
> 1. Sample fix size blocks with algorithm S from Knuth (BlockSampler function)
> 2. Collect tuples into reservoir with algorithm Z from Vitter.
>
> So this doesn’t work for AMs using variable sized physical blocks for example. They need weight random sampling (WRS)
algorithmslike A-Chao or logical blocks to follow S-Knuth (and have a problem with RelationGetNumberOfBlocks()
estimatinga physical number of blocks). Another problem with columns - they are not passed to analyze begin scan and
can’tbenefit from column storage at ANALYZE TABLE (COL). 
>
> The suggestion is to replace table_scan_analyze_next_block() and table_scan_analyze_next_tuple() with a single
function:table_acquire_sample_rows(). The AM implementation of table_acquire_sample_rows() can use the BlockSampler
functionsif it wants to, but if the AM is not block-oriented, it could do something else. This suggestion also passes
VacAttrStatsto table_acquire_sample_rows() for column-oriented AMs and removes PROGRESS_ANALYZE_BLOCKS_TOTAL and
PROGRESS_ANALYZE_BLOCKS_DONEdefinitions as not all AMs can be block-oriented. 
>
> <am-analyze-1.patch>
>
>
>
> Best regards,
> Denis Smirnov | Developer
> sd@arenadata.io
> Arenadata | Godovikova 9-17, Moscow 129085 Russia
>


Attachment

Re: PoC Refactor AM analyse API

From
Andrey Borodin
Date:
Hi Denis!

> 7 дек. 2020 г., в 18:23, Смирнов Денис <sd@arenadata.io> написал(а):
>
> I suggest a refactoring of analyze AM API as it is too much heap specific at the moment. The problem was inspired by
Greenplum’sanalyze improvement for append-optimized row and column AM with variable size compressed blocks. 
> Currently we do analyze in two steps.
>
> 1. Sample fix size blocks with algorithm S from Knuth (BlockSampler function)
> 2. Collect tuples into reservoir with algorithm Z from Vitter.
>
> So this doesn’t work for AMs using variable sized physical blocks for example. They need weight random sampling (WRS)
algorithmslike A-Chao or logical blocks to follow S-Knuth (and have a problem with RelationGetNumberOfBlocks()
estimatinga physical number of blocks). Another problem with columns - they are not passed to analyze begin scan and
can’tbenefit from column storage at ANALYZE TABLE (COL). 
>
> The suggestion is to replace table_scan_analyze_next_block() and table_scan_analyze_next_tuple() with a single
function:table_acquire_sample_rows(). The AM implementation of table_acquire_sample_rows() can use the BlockSampler
functionsif it wants to, but if the AM is not block-oriented, it could do something else. This suggestion also passes
VacAttrStatsto table_acquire_sample_rows() for column-oriented AMs and removes PROGRESS_ANALYZE_BLOCKS_TOTAL and
PROGRESS_ANALYZE_BLOCKS_DONEdefinitions as not all AMs can be block-oriented. 

Just few random notes about the idea.
Heap pages are not of a fixed size, when measured in tuple count. And comment in the codes describes it.
 * Although every row has an equal chance of ending up in the final
 * sample, this sampling method is not perfect: not every possible
 * sample has an equal chance of being selected.  For large relations
 * the number of different blocks represented by the sample tends to be
 * too small.  We can live with that for now.  Improvements are welcome.

Current implementation provide framework with shared code. Though this framework is only suitable for block-of-tuples
AMs.And have statistical downsides when count of tuples varies too much. 
Maybe can we just provide a richer API? To have both: avoid copying code and provide flexibility.

Best regards, Andrey Borodin.


Re: PoC Refactor AM analyse API

From
Denis Smirnov
Date:
Andrey, thanks for your feedback!

I agree that AMs with fix sized blocks can have much alike code in acquire_sample_rows() (though it is not a rule). But
thereare several points about current master sampling. 

* It is not perfect - AM developers may want to improve it with other sampling algorithms.
* It is designed with a big influence of heap AM - for example, RelationGetNumberOfBlocks() returns uint32 while other
AMscan have a bigger amount of blocks. 
* heapam_acquire_sample_rows() is a small function - I don't think it is not a big trouble to write something alike for
anyAM developer. 
* Some AMs may have a single level sampling (only algorithm Z from Vitter for example) - why not?

As a result we get a single and clear method to acquire rows for statistics. If we don’t modify but rather extend
currentAPI ( for example in a manner it is done for FDW) the code becomes more complicated and difficult to understand. 

> 8 дек. 2020 г., в 18:42, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>
> Hi Denis!
>
>> 7 дек. 2020 г., в 18:23, Смирнов Денис <sd@arenadata.io> написал(а):
>>
>> I suggest a refactoring of analyze AM API as it is too much heap specific at the moment. The problem was inspired by
Greenplum’sanalyze improvement for append-optimized row and column AM with variable size compressed blocks. 
>> Currently we do analyze in two steps.
>>
>> 1. Sample fix size blocks with algorithm S from Knuth (BlockSampler function)
>> 2. Collect tuples into reservoir with algorithm Z from Vitter.
>>
>> So this doesn’t work for AMs using variable sized physical blocks for example. They need weight random sampling
(WRS)algorithms like A-Chao or logical blocks to follow S-Knuth (and have a problem with RelationGetNumberOfBlocks()
estimatinga physical number of blocks). Another problem with columns - they are not passed to analyze begin scan and
can’tbenefit from column storage at ANALYZE TABLE (COL). 
>>
>> The suggestion is to replace table_scan_analyze_next_block() and table_scan_analyze_next_tuple() with a single
function:table_acquire_sample_rows(). The AM implementation of table_acquire_sample_rows() can use the BlockSampler
functionsif it wants to, but if the AM is not block-oriented, it could do something else. This suggestion also passes
VacAttrStatsto table_acquire_sample_rows() for column-oriented AMs and removes PROGRESS_ANALYZE_BLOCKS_TOTAL and
PROGRESS_ANALYZE_BLOCKS_DONEdefinitions as not all AMs can be block-oriented. 
>
> Just few random notes about the idea.
> Heap pages are not of a fixed size, when measured in tuple count. And comment in the codes describes it.
> * Although every row has an equal chance of ending up in the final
> * sample, this sampling method is not perfect: not every possible
> * sample has an equal chance of being selected.  For large relations
> * the number of different blocks represented by the sample tends to be
> * too small.  We can live with that for now.  Improvements are welcome.
>
> Current implementation provide framework with shared code. Though this framework is only suitable for block-of-tuples
AMs.And have statistical downsides when count of tuples varies too much. 
> Maybe can we just provide a richer API? To have both: avoid copying code and provide flexibility.
>
> Best regards, Andrey Borodin.
>




Re: PoC Refactor AM analyse API

From
Andrey Borodin
Date:

> 8 дек. 2020 г., в 16:44, Denis Smirnov <sd@arenadata.io> написал(а):
>
> Andrey, thanks for your feedback!
>
> I agree that AMs with fix sized blocks can have much alike code in acquire_sample_rows() (though it is not a rule).
Butthere are several points about current master sampling. 
>
> * It is not perfect - AM developers may want to improve it with other sampling algorithms.
> * It is designed with a big influence of heap AM - for example, RelationGetNumberOfBlocks() returns uint32 while
otherAMs can have a bigger amount of blocks. 
> * heapam_acquire_sample_rows() is a small function - I don't think it is not a big trouble to write something alike
forany AM developer. 
> * Some AMs may have a single level sampling (only algorithm Z from Vitter for example) - why not?
>
> As a result we get a single and clear method to acquire rows for statistics. If we don’t modify but rather extend
currentAPI ( for example in a manner it is done for FDW) the code becomes more complicated and difficult to understand. 

This makes sense. Purpose of the API is to provide flexible abstraction. Current
table_scan_analyze_next_block()/table_scan_analyze_next_tuple()API assumes too much about AM implementation. 
But why do you pass int natts and VacAttrStats **stats to acquire_sample_rows()? Is it of any use? It seems to break
abstractiontoo. 

Best regards, Andrey Borodin.


Re: PoC Refactor AM analyse API

From
Denis Smirnov
Date:
> But why do you pass int natts and VacAttrStats **stats to acquire_sample_rows()? Is it of any use? It seems to break
abstractiontoo. 

Yes, it is really a kluge that should be discussed. The main problem is that we don’t pass projection information to
analyzescan (analyze begin scan relise only on relation information during initialization). And as a result we can’t
benefitfrom column AMs during «analyze t(col)» and consume data only from target columns. This parameters were added to
solvethis problem. 

Best regards,
Denis Smirnov | Developer
sd@arenadata.io
Arenadata | Godovikova 9-17, Moscow 129085 Russia




Re: PoC Refactor AM analyse API

From
Heikki Linnakangas
Date:
On 30/12/2020 11:12, Denis Smirnov wrote:
> 
>> But why do you pass int natts and VacAttrStats **stats to
>> acquire_sample_rows()? Is it of any use? It seems to break
>> abstraction too.
> 
> Yes, it is really a kluge that should be discussed. The main problem
> is that we don’t pass projection information to analyze scan (analyze
> begin scan relise only on relation information during
> initialization). And as a result we can’t benefit from column AMs
> during «analyze t(col)» and consume data only from target columns.
> This parameters were added to solve this problem.

The documentation needs to be updated accordingly, see 
AcquireSampleRowsFunc in fdwhandler.sgml.

This part of the patch, adding the list of columns being analyzed, seems 
a bit unfinished. I'd suggest to leave that out for now, and add it as 
part of the "Table AM modifications to accept column projection lists" 
patch that's also in this commitfest [1]

> This suggestion also ... removes PROGRESS_ANALYZE_BLOCKS_TOTAL and
> PROGRESS_ANALYZE_BLOCKS_DONE definitions as not all AMs can be
> block-oriented.

We shouldn't just remove it, a progress indicator is nice. It's true 
that not all AMs are block-oriented, but those that are can still use 
those. Perhaps we can add ther PROGRESS_ANALYZE_* states for 
non-block-oriented AMs, but that can wait until there is a concrete use 
for it.

> static int
> acquire_sample_rows(Relation onerel, int elevel,
>                     HeapTuple *rows, int targrows,
>                     double *totalrows, double *totaldeadrows)
> {
>     int            numrows = 0;    /* # rows now in reservoir */
>     TableScanDesc scan;
> 
>     Assert(targrows > 0);
> 
>     scan = table_beginscan_analyze(onerel);
> 
>     numrows = table_acquire_sample_rows(scan, elevel,
>                                         natts, stats,
>                                         vac_strategy, rows,
>                                         targrows, totalrows,
>                                         totaldeadrows);
> 
>     table_endscan(scan);
> 
>     /*
>      * If we didn't find as many tuples as we wanted then we're done. No sort
>      * is needed, since they're already in order.
>      *
>      * Otherwise we need to sort the collected tuples by position
>      * (itempointer). It's not worth worrying about corner cases where the
>      * tuples are already sorted.
>      */
>     if (numrows == targrows)
>         qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);
> 
>     return numrows;
> }

Perhaps better to move the qsort() into heapam_acquire_sample_rows(), 
and document that the acquire_sample_rows() AM function must return the 
tuples in 'ctid' order. In a generic API, it seems like a shady 
assumption that they must be in order if we didn't find as many rows as 
we wanted. Or always call qsort(); if the tuples are already in order, 
that should be pretty quick.

The table_beginscan_analyze() call seems a bit pointless now. Let's 
remove it, and pass the Relation to table_acquire_sample_rows directly.

>     /*
>      * This callback needs to fill reservour with sample rows during analyze
>      * scan.
>      */
>     int            (*acquire_sample_rows) (TableScanDesc scan,

The "reservoir" is related to the block sampler, but that's just an 
implementation detail of the function. Perhaps something like "Acquire a 
sample of rows from the table, for ANALYZE". And explain the arguments 
here, or in table_acquire_sample_rows().

Overall, I like where this patch is going.

[1] https://commitfest.postgresql.org/31/2922/

- Heikki



Re: PoC Refactor AM analyse API

From
Denis Smirnov
Date:
Thanks for your review, Heikki.

I have made the changes you have requested.

1. All modifications interconnected with column projection were reverted (they should be added in
https://commitfest.postgresql.org/31/2922if the current patch would be merged one day). 
2. I have returned PROGRESS_ANALYZE_* states.
3. qsort() was moved into heapam_acquire_sample_rows(). Also a comment was added, that the acquire_sample_rows() AM
functionmust return the tuples in a physical table order. 
4. table_beginscan_analyze() was removed as a redundant function.
5. acquire_sample_rows() comment about reservoir was changed.


Best regards,
Denis Smirnov | Developer
sd@arenadata.io 
Arenadata | Godovikova 9-17, Moscow 129085 Russia


Attachment

Re: PoC Refactor AM analyse API

From
Zhihong Yu
Date:
Hi,

+       *totalrows = floor((liverows / bs.m) * totalblocks + 0.5);

Is the above equivalent to:

+       *totalrows = ceil((liverows / bs.m) * totalblocks);

For compare_rows(), it seems the computation of oa and ob can be delayed to when ba == bb (after the first two if statements).

Cheers

On Thu, Feb 18, 2021 at 6:06 PM Denis Smirnov <sd@arenadata.io> wrote:
Thanks for your review, Heikki.

I have made the changes you have requested.

1. All modifications interconnected with column projection were reverted (they should be added in https://commitfest.postgresql.org/31/2922 if the current patch would be merged one day).
2. I have returned PROGRESS_ANALYZE_* states.
3. qsort() was moved into heapam_acquire_sample_rows(). Also a comment was added, that the acquire_sample_rows() AM function must return the tuples in a physical table order.
4. table_beginscan_analyze() was removed as a redundant function.
5. acquire_sample_rows() comment about reservoir was changed.


Best regards,
Denis Smirnov | Developer
sd@arenadata.io
Arenadata | Godovikova 9-17, Moscow 129085 Russia

Re: PoC Refactor AM analyse API

From
Denis Smirnov
Date:
Hello,  Zhihong.

Thanks for your comments.

1. I am afraid that an equivalence of "floor(val + 0.5)" to "cell(val)" is incorrect: "floor(2.1 + 0.5)" returns 2
while "cell(2.1)" returns 3. We can’t use such replacement, as you have suggested. 

2. >> For compare_rows(), it seems the computation of oa and ob can be delayed to when ba == bb (after the first two if
statements).
I have checked some examples of ASM code generated by different compilers with -O2/O3 flags on https://gcc.godbolt.org
anddidn’t see any big benefit in result CPU instructions. You can check yourself an attached example below. 



Best regards,
Denis Smirnov | Developer
sd@arenadata.io 
Arenadata | Godovikova 9-17, Moscow 129085 Russia


Attachment

Re: PoC Refactor AM analyse API

From
Zhihong Yu
Date:
Denis:
Thanks for considering my suggestion.

For #1, I didn't take your example into account. Thanks for pointing that out.

Cheers

On Thu, Feb 18, 2021 at 11:59 PM Denis Smirnov <sd@arenadata.io> wrote:
Hello,  Zhihong.

Thanks for your comments.

1. I am afraid that an equivalence of "floor(val + 0.5)" to "cell(val)" is incorrect: "floor(2.1 + 0.5)" returns 2 while  "cell(2.1)" returns 3. We can’t use such replacement, as you have suggested.

2. >> For compare_rows(), it seems the computation of oa and ob can be delayed to when ba == bb (after the first two if statements).
I have checked some examples of ASM code generated by different compilers with -O2/O3 flags on https://gcc.godbolt.org and didn’t see any big benefit in result CPU instructions. You can check yourself an attached example below.



Best regards,
Denis Smirnov | Developer
sd@arenadata.io
Arenadata | Godovikova 9-17, Moscow 129085 Russia

Re: PoC Refactor AM analyse API

From
Jaime Casanova
Date:
On Fri, Feb 19, 2021 at 12:06:12PM +1000, Denis Smirnov wrote:
> Thanks for your review, Heikki.
> 
> I have made the changes you have requested.
> 
> 1. All modifications interconnected with column projection were reverted (they should be added in
https://commitfest.postgresql.org/31/2922if the current patch would be merged one day).
 
> 2. I have returned PROGRESS_ANALYZE_* states.
> 3. qsort() was moved into heapam_acquire_sample_rows(). Also a comment was added, that the acquire_sample_rows() AM
functionmust return the tuples in a physical table order.
 
> 4. table_beginscan_analyze() was removed as a redundant function.
> 5. acquire_sample_rows() comment about reservoir was changed.
> 

Hi Denis,

This doesn't apply anymore because of c6fc50c, can you resubmit a new
patch?

Please note that the patch must be submitted with a .patch extension
instead of .txt, that way the CI at http://commitfest.cputube.org/ will
be able to execute automatic tests on it.

Regards,

-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL



Re: PoC Refactor AM analyse API

From
Michael Paquier
Date:
On Wed, Sep 08, 2021 at 10:06:25AM -0500, Jaime Casanova wrote:
> This doesn't apply anymore because of c6fc50c, can you resubmit a new
> patch?

Activity has stalled here, so I have marked the entry as RwF in the CF
app.
--
Michael

Attachment

Re: PoC Refactor AM analyse API

From
Денис Смирнов
Date:
I think this patch should be totally redesigned and removed from the upcoming CF. The problem is that vanilla PG has a single storage manager implementation, that works with fix size blocks. Current commit didn’t take this aspect into account. We should first decide whether PG needs an ability to implement custom storage managers with variable size blocks and custom block buffers (or without any for OLAP). And only after that we should move to the variable size block analyze.

Best regards,
Denis Smirnov | Developer
sd@arenadata.io 
Arenadata | Bldg. 3, Block 1 Skladochnaya St.  Moscow, 127018