Re: Import Statistics in postgres_fdw before resorting to sampling. - Mailing list pgsql-hackers

From Corey Huinker
Subject Re: Import Statistics in postgres_fdw before resorting to sampling.
Date
Msg-id CADkLM=cVZH+SDUuvsyMHzuWTK5XkK1i_L5dgTNrgRXWN1S-syQ@mail.gmail.com
Whole thread Raw
In response to Re: Import Statistics in postgres_fdw before resorting to sampling.  (Etsuro Fujita <etsuro.fujita@gmail.com>)
Responses Re: Import Statistics in postgres_fdw before resorting to sampling.
List pgsql-hackers

This version of the patch wouldn't fall back to the normal ANALYZE
processing anymore, so this documentation should be updated as such.
Also, as I think it's the user's responsibility to ensure the existing
statistics are up-to-date, as I said before, I think we should add a
note about that here.  Also, as some users wouldn't be able to ensure
it, I'm wondering if the default should be false.

I agree, if there is no fallback, then the default should be false. When I was initially brainstorming this patch, Nathan Bossart had suggested making it the default because 1) that would be an automatic benefit to users and 2) the cost for attempting to import stats was small in comparison to a table stample, so it was worth the attempt. I still want users to get that automatic benefit, but if there is no fallback to sampling then the default only makes sense as false.
 
If the user wasn't able to ensure the statistics are up-to-date, I
think he/she might want to do remote ANALYZE *before* fetching the
statistics, not after trying to do so.  So I think we could instead
provide this option as such.  What do you think about that?  Anyway,
I'd vote for leaving this for another patch, as I think it's a
nice-to-have option rather than a must-have one, as I said before.

I hadn't thought of that one. I think it has some merit, but I think we'd want the try-after case as well. So the settings would be: "off" (never remote analyze, default), "on" (always analyze before fetching), and "retry" (analyze if no stats were found). This feature feeds into the same thinking that the default setting did, which was to make this feature just do what is usually the smart thing, and do it automatically. Building it in pieces might be easier to get committed, but it takes away the dream of seamless automatic improvement, and establishes defaults that can't be changed in future versions. That dream was probably unrealistic, but I wanted to try.
 

ISTM that the code is well organized overall.  Here are a few comments:

Thanks!
 

We don't use relallvisible and relallfrozen for foreign tables (note
that do_analyze_rel() calls vac_update_relstats() with relallvisible=0
and relallfrozen=0 for them).  Do we really need to retrieve (and
restore) them?

No, and as you stated, we wouldn't want to. The query was lifted verbatim from pg_dump, with a vague hope of moving the queries to a common library that both pg_dump and postgres_fdw could draw upon. But that no longer makes sense, so I'll fix.
 
+static const char *attstats_query_17 =
+   "SELECT DISTINCT ON (s.attname) attname, s.null_frac, s.avg_width, "
+   "s.n_distinct, s.most_common_vals, s.most_common_freqs, "
+   "s.histogram_bounds, s.correlation, s.most_common_elems, "
+   "s.most_common_elem_freqs, s.elem_count_histogram, "
+   "s.range_length_histogram, s.range_empty_frac, s.range_bounds_histogram "
+   "FROM pg_catalog.pg_stats AS s "
+   "WHERE s.schemaname = $1 AND s.tablename = $2 "
+   "ORDER BY s.attname, s.inherited DESC";

I think we should retrieve the attribute statistics for only the
referenced columns of the remote table, not all the columns of it, to
reduce the data transfer and the cost of matching local/remote
attributes in import_fetched_statistics().

I thought about this, and decided that we wanted to 1) avoid per-column round trips and 2) keep the remote queries simple. It's not such a big deal to add "AND s.attname = ANY($3)" and construct a '{att1,att2,"ATt3"}' string, as we already do in pg_dump in a few places.
 

In fetch_remote_statistics()

+   if (server_version_num >= 180000)
+       relation_sql = relstats_query_18;
+   else if (server_version_num >= 140000)
+       relation_sql = relstats_query_14;
+   else
+       relation_sql = relstats_query_default;

I think that having the definition for each of relstats_query_18,
relstats_query_14 and relstats_query_default makes the code redundant
and the maintenance hard.  To avoid that, how about building the
relation_sql query dynamically as done for the query to fetch all
table data from the remote server in postgresImportForeignSchema().
Same for the attribute_sql query.

It may be a moot point. If we're not fetching relallfrozen, then the 14 & 18 cases are now the same, and since the pre-14 case concerns differentiating between analyzed and unanalyzed tables, we would just map that to 0 IF we kept those stats, but we almost never would because an unanalyzed remote table would not have the attribute stats necessary to qualify as a good remote fetch. So we're down to just one static query.
 

That's all I have for now.  I will continue to review the changes.

Much appreciated.

pgsql-hackers by date:

Previous
From: Adam Brusselback
Date:
Subject: Re: [Patch] Add WHERE clause support to REFRESH MATERIALIZED VIEW
Next
From: Chao Li
Date:
Subject: intarray: fix an edge case int32 overflow bug