Re: Import Statistics in postgres_fdw before resorting to sampling. - Mailing list pgsql-hackers

From Etsuro Fujita
Subject Re: Import Statistics in postgres_fdw before resorting to sampling.
Date
Msg-id CAPmGK15Lk0ynP_b46sgfV1pmByzWUKVQoZBSV8fbN0rKb_6AnA@mail.gmail.com
Whole thread Raw
In response to Re: Import Statistics in postgres_fdw before resorting to sampling.  (Corey Huinker <corey.huinker@gmail.com>)
List pgsql-hackers
Hi Corey,

This is an important feature.  Thanks for working on it.

On Mon, Nov 3, 2025 at 2:26 PM Corey Huinker <corey.huinker@gmail.com> wrote:

>> My point is: rather than trying to implement a second solution with a
>> new callback, shouldn't we try to rework the existing callback so as
>> it would fit better with what you are trying to do here: feed data
>> that ANALYZE would then be in charge of inserting?

> To do that, we would have to somehow generate fake data locally from the pg_stats data that we did pull over the
wire,just to have ANALYZE compute it back down to the data we already had. 

>>   Relying on
>> pg_restore_relation_stats() for the job feels incorrect to me knowing
>> that ANALYZE is the code path in charge of updating the stats of a
>> relation.

> That sounds like an argument for expanding ANALYZE to have syntax that will digest pre-computed rows, essentially
takingover what pg_restore_relation_stats and pg_restore_attribute_stats already do, and that idea was dismissed fairly
earlyon in development, though I wasn't against it at the time. 
>
> As it is, those two functions were developed to match what ANALYZE does. pg_restore_relation_stats even briefly had
inplaceupdates because that's what ANALYZE did. 

To me it seems like a good idea to rely on pg_restore_relation_stats
and pg_restore_attribute_stats, rather than doing some rework on
analyze.c; IMO I don't think it's a good idea to do such a thing for
something rather special like this.

>> The new callback is a shortcut that bypasses what ANALYZE
>> does, so the problem, at least it seems to me, is that we want to
>> retrieve all the data in a single step, like your new callback, not in
>> two steps, something that only the existing callback allows.  Hence,
>> wouldn't it be more natural to retrieve the total number of pages and
>> reltuples from one callback, meaning that for local relations we
>> should delay RelationGetNumberOfBlocks() inside the existing
>> "acquire_sample_rows" callback (renaming it would make sense)?  This
>> requires some redesign of AnalyzeForeignTable and the internals of
>> analyze.c, but it would let FDW extensions know about the potential
>> efficiency gain.

> I wanted to make minimal disruption to the existing callbacks, but that may have been misguided.

+1 for the minimal disruption.

Other initial comments:

The commit message says:

    This is managed via two new options, fetch_stats and remote_analyze,
    both are available at the server level and table level. If fetch_stats
    is true, then the ANALYZE command will first attempt to fetch statistics
    from the remote table and import those statistics locally.

    If remote_analyze is true, and if the first attempt to fetch remote
    statistics found no attribute statistics, then an attempt will be made
    to ANALYZE the remote table before a second and final attempt to fetch
    remote statistics.

    If no statistics are found, then ANALYZE will fall back to the normal
    behavior of sampling and local analysis.

I think the first step assumes that the remote stats are up-to-date;
if they aren't, it would cause a regression.  (If the remote relation
is a plain table, they are likely to be up-to-date, but for example,
if it is a foreign table, it's possible that they are stale.)  So how
about making it the user's responsibility to make them up-to-date?  If
doing so, we wouldn't need to do the second and third steps anymore,
making the patch simple.

On the other hand:

    This operation will only work on remote relations that can have stored
    statistics: tables, partitioned tables, and materialized views. If the
    remote relation is a view then remote fetching/analyzing is just wasted
    effort and the user is better of setting fetch_stats to false for that
    table.

I'm not sure the waste effort is acceptable; IMO, if the remote table
is a view, I think that the system should detect that in some way, and
then just do the normal ANALYZE processing.

That's it for now.

My apologies for the delayed response.

Best regards,
Etsuro Fujita



pgsql-hackers by date:

Previous
From: Jacob Champion
Date:
Subject: Re: RFC 9266: Channel Bindings for TLS 1.3 support
Next
From: Tomas Vondra
Date:
Subject: Re: Adding basic NUMA awareness