Re: Import Statistics in postgres_fdw before resorting to sampling. - Mailing list pgsql-hackers

From Etsuro Fujita
Subject Re: Import Statistics in postgres_fdw before resorting to sampling.
Date
Msg-id CAPmGK16AQzGsEe6pD7m42AO1KRCuXQu=U_abXaAf3juCS5U+mA@mail.gmail.com
Whole thread Raw
In response to Re: Import Statistics in postgres_fdw before resorting to sampling.  (Corey Huinker <corey.huinker@gmail.com>)
Responses Re: Import Statistics in postgres_fdw before resorting to sampling.
Re: Import Statistics in postgres_fdw before resorting to sampling.
List pgsql-hackers
On Sun, Jan 4, 2026 at 11:56 AM Corey Huinker <corey.huinker@gmail.com> wrote:

>> This version of the patch wouldn't fall back to the normal ANALYZE
>> processing anymore, so this documentation should be updated as such.
>> Also, as I think it's the user's responsibility to ensure the existing
>> statistics are up-to-date, as I said before, I think we should add a
>> note about that here.  Also, as some users wouldn't be able to ensure
>> it, I'm wondering if the default should be false.

> I agree, if there is no fallback, then the default should be false. When I was initially brainstorming this patch,
NathanBossart had suggested making it the default because 1) that would be an automatic benefit to users and 2) the
costfor attempting to import stats was small in comparison to a table stample, so it was worth the attempt. I still
wantusers to get that automatic benefit, but if there is no fallback to sampling then the default only makes sense as
false.

I think that the FDW API that I proposed could actually allow us to
fall back to sampling, by modifying StatisticsAreImportable so that it
also checks if 1) there are statistics on the remote server and 2) the
data is fresh enough, and if so, returns true; otherwise, returns
false; in the latter case we could fall back to sampling.  And if we
modified it as such, I think we could change the default to true.
(Checking #2 is necessary to avoid importing stale data, which would
degrade plan quality.)

>> If the user wasn't able to ensure the statistics are up-to-date, I
>> think he/she might want to do remote ANALYZE *before* fetching the
>> statistics, not after trying to do so.  So I think we could instead
>> provide this option as such.  What do you think about that?  Anyway,
>> I'd vote for leaving this for another patch, as I think it's a
>> nice-to-have option rather than a must-have one, as I said before.

> I hadn't thought of that one. I think it has some merit, but I think we'd want the try-after case as well. So the
settingswould be: "off" (never remote analyze, default), "on" (always analyze before fetching), and "retry" (analyze if
nostats were found). This feature feeds into the same thinking that the default setting did, which was to make this
featurejust do what is usually the smart thing, and do it automatically. Building it in pieces might be easier to get
committed,but it takes away the dream of seamless automatic improvement, and establishes defaults that can't be changed
infuture versions. That dream was probably unrealistic, but I wanted to try. 

Remote ANALYZE would be an interesting idea, but to get the automatic
improvement, I think we should first work on the issue I mentioned
above.  So I still think we should leave this for future work.

From a different perspective, even without the automatic improvement
including remote ANALYZE, I think this feature is useful for many
users.

>> +static const char *attstats_query_17 =
>> +   "SELECT DISTINCT ON (s.attname) attname, s.null_frac, s.avg_width, "
>> +   "s.n_distinct, s.most_common_vals, s.most_common_freqs, "
>> +   "s.histogram_bounds, s.correlation, s.most_common_elems, "
>> +   "s.most_common_elem_freqs, s.elem_count_histogram, "
>> +   "s.range_length_histogram, s.range_empty_frac, s.range_bounds_histogram "
>> +   "FROM pg_catalog.pg_stats AS s "
>> +   "WHERE s.schemaname = $1 AND s.tablename = $2 "
>> +   "ORDER BY s.attname, s.inherited DESC";
>>
>> I think we should retrieve the attribute statistics for only the
>> referenced columns of the remote table, not all the columns of it, to
>> reduce the data transfer and the cost of matching local/remote
>> attributes in import_fetched_statistics().

> I thought about this, and decided that we wanted to 1) avoid per-column round trips and 2) keep the remote queries
simple.It's not such a big deal to add "AND s.attname = ANY($3)" and construct a '{att1,att2,"ATt3"}' string, as we
alreadydo in pg_dump in a few places. 

Yeah, I was also thinking of modifying the query as you proposed.

>> In fetch_remote_statistics()
>>
>> +   if (server_version_num >= 180000)
>> +       relation_sql = relstats_query_18;
>> +   else if (server_version_num >= 140000)
>> +       relation_sql = relstats_query_14;
>> +   else
>> +       relation_sql = relstats_query_default;
>>
>> I think that having the definition for each of relstats_query_18,
>> relstats_query_14 and relstats_query_default makes the code redundant
>> and the maintenance hard.  To avoid that, how about building the
>> relation_sql query dynamically as done for the query to fetch all
>> table data from the remote server in postgresImportForeignSchema().
>> Same for the attribute_sql query.

> It may be a moot point. If we're not fetching relallfrozen, then the 14 & 18 cases are now the same, and since the
pre-14case concerns differentiating between analyzed and unanalyzed tables, we would just map that to 0 IF we kept
thosestats, but we almost never would because an unanalyzed remote table would not have the attribute stats necessary
toqualify as a good remote fetch. So we're down to just one static query. 

You are right.  As the relation_sql query is only used in
fetch_remote_statistics(), shouldn't the query be defined within that
function?

Best regards,
Etsuro Fujita



pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: Implement waiting for wal lsn replay: reloaded
Next
From: Dave Cramer
Date:
Subject: Re: Proposal to allow setting cursor options on Portals