Re: PoC Refactor AM analyse API - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: PoC Refactor AM analyse API
Date
Msg-id 45641a84-11c8-5173-2f12-4efa48b18811@iki.fi
Whole thread Raw
In response to Re: PoC Refactor AM analyse API  (Denis Smirnov <sd@arenadata.io>)
Responses Re: PoC Refactor AM analyse API  (Denis Smirnov <sd@arenadata.io>)
List pgsql-hackers
On 30/12/2020 11:12, Denis Smirnov wrote:
> 
>> But why do you pass int natts and VacAttrStats **stats to
>> acquire_sample_rows()? Is it of any use? It seems to break
>> abstraction too.
> 
> Yes, it is really a kluge that should be discussed. The main problem
> is that we don’t pass projection information to analyze scan (analyze
> begin scan relise only on relation information during
> initialization). And as a result we can’t benefit from column AMs
> during «analyze t(col)» and consume data only from target columns.
> This parameters were added to solve this problem.

The documentation needs to be updated accordingly, see 
AcquireSampleRowsFunc in fdwhandler.sgml.

This part of the patch, adding the list of columns being analyzed, seems 
a bit unfinished. I'd suggest to leave that out for now, and add it as 
part of the "Table AM modifications to accept column projection lists" 
patch that's also in this commitfest [1]

> This suggestion also ... removes PROGRESS_ANALYZE_BLOCKS_TOTAL and
> PROGRESS_ANALYZE_BLOCKS_DONE definitions as not all AMs can be
> block-oriented.

We shouldn't just remove it, a progress indicator is nice. It's true 
that not all AMs are block-oriented, but those that are can still use 
those. Perhaps we can add ther PROGRESS_ANALYZE_* states for 
non-block-oriented AMs, but that can wait until there is a concrete use 
for it.

> static int
> acquire_sample_rows(Relation onerel, int elevel,
>                     HeapTuple *rows, int targrows,
>                     double *totalrows, double *totaldeadrows)
> {
>     int            numrows = 0;    /* # rows now in reservoir */
>     TableScanDesc scan;
> 
>     Assert(targrows > 0);
> 
>     scan = table_beginscan_analyze(onerel);
> 
>     numrows = table_acquire_sample_rows(scan, elevel,
>                                         natts, stats,
>                                         vac_strategy, rows,
>                                         targrows, totalrows,
>                                         totaldeadrows);
> 
>     table_endscan(scan);
> 
>     /*
>      * If we didn't find as many tuples as we wanted then we're done. No sort
>      * is needed, since they're already in order.
>      *
>      * Otherwise we need to sort the collected tuples by position
>      * (itempointer). It's not worth worrying about corner cases where the
>      * tuples are already sorted.
>      */
>     if (numrows == targrows)
>         qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);
> 
>     return numrows;
> }

Perhaps better to move the qsort() into heapam_acquire_sample_rows(), 
and document that the acquire_sample_rows() AM function must return the 
tuples in 'ctid' order. In a generic API, it seems like a shady 
assumption that they must be in order if we didn't find as many rows as 
we wanted. Or always call qsort(); if the tuples are already in order, 
that should be pretty quick.

The table_beginscan_analyze() call seems a bit pointless now. Let's 
remove it, and pass the Relation to table_acquire_sample_rows directly.

>     /*
>      * This callback needs to fill reservour with sample rows during analyze
>      * scan.
>      */
>     int            (*acquire_sample_rows) (TableScanDesc scan,

The "reservoir" is related to the block sampler, but that's just an 
implementation detail of the function. Perhaps something like "Acquire a 
sample of rows from the table, for ANALYZE". And explain the arguments 
here, or in table_acquire_sample_rows().

Overall, I like where this patch is going.

[1] https://commitfest.postgresql.org/31/2922/

- Heikki



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: PoC/WIP: Extended statistics on expressions
Next
From: Fujii Masao
Date:
Subject: Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit