Re: Import Statistics in postgres_fdw before resorting to sampling. - Mailing list pgsql-hackers

From Corey Huinker
Subject Re: Import Statistics in postgres_fdw before resorting to sampling.
Date
Msg-id CADkLM=dSBfGT-J5xmiwkMg6FU9CByKWA1XVwC2P+-_-pZ_3OsA@mail.gmail.com
Whole thread Raw
In response to Re: Import Statistics in postgres_fdw before resorting to sampling.  (Etsuro Fujita <etsuro.fujita@gmail.com>)
Responses Re: Import Statistics in postgres_fdw before resorting to sampling.
List pgsql-hackers

I think that the FDW API that I proposed could actually allow us to
fall back to sampling, by modifying StatisticsAreImportable so that it
also checks if 1) there are statistics on the remote server and 2) the
data is fresh enough, and if so, returns true; otherwise, returns
false; in the latter case we could fall back to sampling.  And if we
modified it as such, I think we could change the default to true.
(Checking #2 is necessary to avoid importing stale data, which would
degrade plan quality.)

So while I haven't checked for "freshness" of the statistics, I have added checks that ensure that every asked-for column in the local table with attstattarget != 0 will be send in the column filter, and we:

1. Find remote stats for all the columns that made our list
2. Do not get any stats over the wire with no matching target column.
3. Sort the list of expected remote column names, which means the list matching is effectively a merge, so O(N) vs O(N^2). This is done with a name+attnum structure, but it could just as easily have been done with a local_name+remote_name, as pg_restore_attribute_stats() will take either attnum or attname as a parameter.

Aside from a pre-emptive ANALYZE, how would you propose we check for and/or measure "freshness" of the remote statistics?
 
Remote ANALYZE would be an interesting idea, but to get the automatic
improvement, I think we should first work on the issue I mentioned
above.  So I still think we should leave this for future work.

From a different perspective, even without the automatic improvement
including remote ANALYZE, I think this feature is useful for many
users.

I'm still hoping to hear from Nathan on this subject.
 
>> I think we should retrieve the attribute statistics for only the
>> referenced columns of the remote table, not all the columns of it, to
>> reduce the data transfer and the cost of matching local/remote
>> attributes in import_fetched_statistics().

> I thought about this, and decided that we wanted to 1) avoid per-column round trips and 2) keep the remote queries simple. It's not such a big deal to add "AND s.attname = ANY($3)" and construct a '{att1,att2,"ATt3"}' string, as we already do in pg_dump in a few places.

Yeah, I was also thinking of modifying the query as you proposed.

That is done in the patch attached.
 
> It may be a moot point. If we're not fetching relallfrozen, then the 14 & 18 cases are now the same, and since the pre-14 case concerns differentiating between analyzed and unanalyzed tables, we would just map that to 0 IF we kept those stats, but we almost never would because an unanalyzed remote table would not have the attribute stats necessary to qualify as a good remote fetch. So we're down to just one static query.

You are right.  As the relation_sql query is only used in
fetch_remote_statistics(), shouldn't the query be defined within that
function?

Oddly enough, I already moved it inside of fetch_remote_statistics() in the newest patch.

I wasn't planning on posting this patch until we had heard back from Nathan, but since I'd already been working on a few of the items you mentioned in your last email, I thought I'd show you that work in progress. Some issues like the documentation haven't been updated, so it's more of a work in progress, but it does pass the tests.

Summary of key points of this  v7 WIP:

* Reduced columns on relstats query, query moved inside calling function.
* Per-column filters on attstats queries, filtering on destination attstattarget != 0.
* Verification that all filtered-for columns have stats, and that all stats in the result set have a matching target column.
* Expected column list is now sorted by remote-side attname, allowing a merge join of the two lists.
* No changes to the documentation, but I know they are needed.
Attachment

pgsql-hackers by date:

Previous
From: Kirill Reshke
Date:
Subject: Re: Define DatumGetInt8 function.
Next
From: Aleksander Alekseev
Date:
Subject: Re: Define DatumGetInt8 function.