estimation problems for DISTINCT ON with FDW - Mailing list pgsql-hackers

From Jeff Janes
Subject estimation problems for DISTINCT ON with FDW
Date
Msg-id CAMkU=1xNo9cnan+Npxgz0eK7394xmjmKg-QEm8wYG9P5-CcaqQ@mail.gmail.com
Whole thread Raw
Responses Re: estimation problems for DISTINCT ON with FDW
Re: estimation problems for DISTINCT ON with FDW
List pgsql-hackers
If I use the attached sql file to set up the database with loop-back postgres_fdw, and then turn on use_remote_estimate for this query:

distinct on (id) id, z from fgn.priority order by id, priority desc,z

It issues two queries for the foreign estimate, one with a sort and one without:

EXPLAIN SELECT id, priority, z FROM public.priority

EXPLAIN SELECT id, priority, z FROM public.priority ORDER BY id ASC NULLS LAST, priority DESC NULLS FIRST, z ASC NULLS LAST

It doesn't cost out the plan of pushing the DISTINCT ON down to the foreign side, which is probably the best way to run the query.  I guess it makes sense that FDW machinery in general doesn't want to try to push a PostgreSQL specific construct.

But much worse than that, it horribly misestmates the number of unique rows it will get back, having never asked the remote side for an estimate of that.

 Result  (cost=100.51..88635.90 rows=1 width=16)
   ->  Unique  (cost=100.51..88635.90 rows=1 width=16)
         ->  Foreign Scan on priority  (cost=100.51..86135.90 rows=1000000 width=16)

Where does it come up with the idea that these 1,000,000 rows will DISTINCT/Unique down to just 1 row?   I can't find the place in the code where that happens.  I suspect it is happening somewhere in the core code based on data fed into it by postgres_fdw, not in postgres_fdw itself.

This leads to horrible plans if the DISTINCT ON is actually in a subquery which is joined to other tables, for example.

If you don't use the remote estimates, it at least comes up with a roughly sane estimate of 200 distinct rows, which is enough to inhibit selection of the worst plans. Why does an uninformative remote estimate do so much worse than no remote estimate at all?

Of course I could just disable remote estimates for this table, but then other queries that use the table without DISTINCT ON suffer.  Another solution is to ANALYZE the foreign table, but that opens up a can of worms of its own.

I see this behavior in all supported or in-development versions.

Cheers,

Jeff
Attachment

pgsql-hackers by date:

Previous
From: torikoshia
Date:
Subject: Re: Creating a function for exposing memory usage of backend process
Next
From: Amit Kapila
Date:
Subject: Re: pgsql: Enable Unix-domain sockets support on Windows