Home > mailing lists

Re: PoC Refactor AM analyse API - Mailing list pgsql-hackers

From	Andrey Borodin
Subject	Re: PoC Refactor AM analyse API
Date	December 8, 2020 08:42:12
Msg-id	95D509FA-45AC-48BE-82BF-8377B9A4F687@yandex-team.ru Whole thread Raw
In response to	PoC Refactor AM analyse API (Смирнов Денис <sd@arenadata.io>)
Responses	Re: PoC Refactor AM analyse API
List	pgsql-hackers

Tree view

Hi Denis!

> 7 дек. 2020 г., в 18:23, Смирнов Денис <sd@arenadata.io> написал(а):
>
> I suggest a refactoring of analyze AM API as it is too much heap specific at the moment. The problem was inspired by
Greenplum’sanalyze improvement for append-optimized row and column AM with variable size compressed blocks. 
> Currently we do analyze in two steps.
>
> 1. Sample fix size blocks with algorithm S from Knuth (BlockSampler function)
> 2. Collect tuples into reservoir with algorithm Z from Vitter.
>
> So this doesn’t work for AMs using variable sized physical blocks for example. They need weight random sampling (WRS)
algorithmslike A-Chao or logical blocks to follow S-Knuth (and have a problem with RelationGetNumberOfBlocks()
estimatinga physical number of blocks). Another problem with columns - they are not passed to analyze begin scan and
can’tbenefit from column storage at ANALYZE TABLE (COL). 
>
> The suggestion is to replace table_scan_analyze_next_block() and table_scan_analyze_next_tuple() with a single
function:table_acquire_sample_rows(). The AM implementation of table_acquire_sample_rows() can use the BlockSampler
functionsif it wants to, but if the AM is not block-oriented, it could do something else. This suggestion also passes
VacAttrStatsto table_acquire_sample_rows() for column-oriented AMs and removes PROGRESS_ANALYZE_BLOCKS_TOTAL and
PROGRESS_ANALYZE_BLOCKS_DONEdefinitions as not all AMs can be block-oriented. 

Just few random notes about the idea.
Heap pages are not of a fixed size, when measured in tuple count. And comment in the codes describes it.
 * Although every row has an equal chance of ending up in the final
 * sample, this sampling method is not perfect: not every possible
 * sample has an equal chance of being selected.  For large relations
 * the number of different blocks represented by the sample tends to be
 * too small.  We can live with that for now.  Improvements are welcome.

Current implementation provide framework with shared code. Though this framework is only suitable for block-of-tuples
AMs.And have statistical downsides when count of tuples varies too much. 
Maybe can we just provide a richer API? To have both: avoid copying code and provide flexibility.

Best regards, Andrey Borodin.

pgsql-hackers by date:

From: Ajin Cherian
Date: 08 December 2020, 08:31:15
Subject: Re: [HACKERS] logical decoding of two-phase transactions

From: Krunal Bauskar
Date: 08 December 2020, 09:03:59
Subject: Re: Improving spin-lock implementation on ARM.

Re: PoC Refactor AM analyse API - Mailing list pgsql-hackers

Previous

Next