Thread: [GENERAL] Equivalence Classes when using IN
I have this query where I think it's strange that the join doesn't pull the where condition in since RHS is equal to LHS. It might be easier to expain with an example
Setup
CREATE TABLE customer (
customer_id INTEGER PRIMARY KEY
);
CREATE TABLE product (
product_id INTEGER PRIMARY KEY,
customer_id INTEGER NOT NULL REFERENCES customer (customer_id)
);
INSERT INTO customer (SELECT generate_series FROM generate_series(0, 1000000));
INSERT INTO product (product_id, customer_id) (SELECT generate_series, generate_series / 2 FROM generate_series(0, 2000));
Query
EXPLAIN ANALYSE
SELECT *
FROM customer c
JOIN (SELECT DISTINCT ON (customer_id) * FROM product ORDER BY customer_id, product_id) p
ON c.customer_id = p.customer_id
WHERE c.customer_id IN (500, 501);
----------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=172.43..186.25 rows=1 width=12) (actual time=1.350..1.353 rows=2 loops=1)
Merge Cond: (c.customer_id = product.customer_id)
-> Sort (cost=13.93..13.93 rows=2 width=4) (actual time=0.036..0.036 rows=2 loops=1)
Sort Key: c.customer_id
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on customer c (cost=8.58..13.92 rows=2 width=4) (actual time=0.026..0.027 rows=2 loops=1)
Recheck Cond: (customer_id = ANY ('{500,501}'::integer[]))
Heap Blocks: exact=1
-> Bitmap Index Scan on customer_pkey (cost=0.00..8.58 rows=2 width=0) (actual time=0.018..0.018 rows=2 loops=1)
Index Cond: (customer_id = ANY ('{500,501}'::integer[]))
-> Unique (cost=158.51..169.81 rows=200 width=8) (actual time=0.783..1.221 rows=502 loops=1)
-> Sort (cost=158.51..164.16 rows=2260 width=8) (actual time=0.782..0.929 rows=1003 loops=1)
Sort Key: product.customer_id, product.product_id
Sort Method: quicksort Memory: 142kB
-> Seq Scan on product (cost=0.00..32.60 rows=2260 width=8) (actual time=0.015..0.366 rows=2001 loops=1)
Planning time: 0.281 ms
Execution time: 1.432 ms
I would expect that since c.customer_id = p.customer_id then p.customer_id IN (500, 501). If I apply this rule myself, I get a much nicer plan (and it could be even better with an index on product_id).
----------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=52.70..53.11 rows=1 width=12) (actual time=0.686..0.693 rows=2 loops=1)
Merge Cond: (product.customer_id = c.customer_id)
-> Unique (cost=38.77..38.89 rows=22 width=8) (actual time=0.647..0.651 rows=2 loops=1)
-> Sort (cost=38.77..38.83 rows=23 width=8) (actual time=0.646..0.647 rows=4 loops=1)
Sort Key: product.customer_id, product.product_id
Sort Method: quicksort Memory: 25kB
-> Seq Scan on product (cost=0.00..38.25 rows=23 width=8) (actual time=0.331..0.632 rows=4 loops=1)
Filter: (customer_id = ANY ('{500,501}'::integer[]))
Rows Removed by Filter: 1997
-> Sort (cost=13.93..13.93 rows=2 width=4) (actual time=0.033..0.033 rows=2 loops=1)
Sort Key: c.customer_id
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on customer c (cost=8.58..13.92 rows=2 width=4) (actual time=0.025..0.026 rows=2 loops=1)
Recheck Cond: (customer_id = ANY ('{500,501}'::integer[]))
Heap Blocks: exact=1
-> Bitmap Index Scan on customer_pkey (cost=0.00..8.58 rows=2 width=0) (actual time=0.018..0.018 rows=2 loops=1)
Index Cond: (customer_id = ANY ('{500,501}'::integer[]))
Planning time: 0.386 ms
Execution time: 0.774 ms
(19 rows)
Are there any interests in adding such rule?
My idea is to wrap this in a view
SELECT c.customer_id,
p.product_id
FROM customer c
LEFT JOIN (SELECT DISTINCT ON (customer_id) * FROM product ORDER BY customer_id, product_id) p
ON c.customer_id = p.customer_id
Where the LEFT JOIN can be pruned if there is no explicit need for product_id. Here I loose the power to express that both c.customer_id and p.customer_id is the same.
Best regards
Kim Carlsen
On 9 October 2017 at 08:01, Kim Rose Carlsen <krc@hiper.dk> wrote: > Is this because postgres never consider IN clause when building equivalence > class's? Only btree equality operators are considered at the moment. > Are there any interests in adding such rule? There's been some discussion on it previously, although there is lots to still be worked out, for example, it's not that clear if it will always be a win to always apply the qual. There are more details of the discussion in [1], although there's probably lots more threads to be found if you search the archives. [1] https://www.postgresql.org/message-id/flat/CAKJS1f9FK_X_5HKcPcSeimy16Owe3EmPmmGsGWLcKkj_rW9s6A%40mail.gmail.com#CAKJS1f9FK_X_5HKcPcSeimy16Owe3EmPmmGsGWLcKkj_rW9s6A@mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
David Rowley <david.rowley@2ndquadrant.com> writes: > There are more details of the discussion in [1], although there's > probably lots more threads to be found if you search the archives. > [1] https://www.postgresql.org/message-id/flat/CAKJS1f9FK_X_5HKcPcSeimy16Owe3EmPmmGsGWLcKkj_rW9s6A%40mail.gmail.com#CAKJS1f9FK_X_5HKcPcSeimy16Owe3EmPmmGsGWLcKkj_rW9s6A@mail.gmail.com That thread seems to be about transitively applying inequalities (ie, given "a = b and a < c", deduce "b < c"), which seems like a bit of a different animal than IN. Some of the issues are similar perhaps, but I'd think that being able to deduce "b IN somelist" from "a = b and a IN somelist" is more valuable, simply because the IN would typically be a much stronger constraint than an inequality. So that idea suggests that it's more worth expending planner cycles to chase the possibility. I do vaguely recall discussions specifically around IN, though I didn't have any luck finding one in the archives. There's also the recent thread https://www.postgresql.org/message-id/flat/CAMjNa7cC4X9YR-vAJS-jSYCajhRDvJQnN7m2sLH1wLh-_Z2bsw@mail.gmail.com which suggests being able to simplify "a IN somelist AND a IN someotherlist". If we wanted to do that, making the "lists" be treatable as eclass members would be a good place to start, because that would naturally result in intersect-able lists ending up in the same eclass. regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
>On 9 October 2017 at 08:01, Kim Rose Carlsen <krc@hiper.dk> wrote: >> Is this because postgres never consider IN clause when building equivalence >> class's? > >Only btree equality operators are considered at the moment. After good night sleep and reading the previous discussion, I am no longer sure I have reduced my original problem to theright example. If we continue from previous setup and add the following: ALTER TABLE customer ADD COLUMN age INTEGER; UPDATE customer SET age = customer_id / 5; CREATE INDEX ON customer (age); CREATE INDEX ON product (customer_id); SET enable_hashjoin = false; SET enable_mergejoin = false; EXPLAIN ANALYZE SELECT * FROM customer JOIN view_customer ON customer.customer_id = view_customer.customer_id WHERE age < 20; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------Nested LoopLeft Join (cost=139.00..10392.96 rows=668 width=16) (actual time=0.528..35.120 rows=200 loops=1) Join Filter: (c.customer_id= product.customer_id) Rows Removed by Join Filter: 199900 -> Nested Loop (cost=0.28..199.21 rows=334 width=12)(actual time=0.075..1.146 rows=100 loops=1) -> Seq Scan on customer (cost=0.00..21.51 rows=334 width=8)(actual time=0.067..0.282 rows=100 loops=1) Filter: (age < 20) Rows Removed by Filter:901 -> Index Only Scan using customer_pkey on customer c (cost=0.28..0.53 rows=1 width=4) (actual time=0.006..0.006rows=1 loops=100) Index Cond: (customer_id = customer.customer_id) Heap Fetches:100 -> Materialize (cost=138.73..173.75 rows=2001 width=8) (actual time=0.005..0.130 rows=2001 loops=100) -> Sort (cost=138.73..143.73 rows=2001 width=8) (actual time=0.448..0.588 rows=2001 loops=1) Sort Key:product.customer_id, product.product_id Sort Method: quicksort Memory: 142kB -> Seq Scanon product (cost=0.00..29.01 rows=2001 width=8) (actual time=0.006..0.215 rows=2001 loops=1)Planning time: 0.214 msExecutiontime: 35.284 ms The planner prefer to use hash and merge joins which is ok, when many rows are to be joined, I don't think any conditioncan be merged to make these case faster. I have disabled merge and hash joins to get to a nested loop join instead,in this case it would be much better if customer_id can be pulled inside the loop, so it can look at only the relevantrows and not all rows for each loop. I somehow inferred that this would be the same as selecting from the view usingIN clause, now I'm not so sure anymore. I can see there is a trade off between planner time and how exotic the case is. If you want to be able to hide abstractionthrough views I guess the nature becomes more OLAP oriented than OLTP. Best Regards Kim Carlsen -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On 9 October 2017 at 22:39, Kim Rose Carlsen <krc@hiper.dk> wrote: > EXPLAIN ANALYZE > SELECT * > FROM customer > JOIN view_customer > ON customer.customer_id = view_customer.customer_id > WHERE age < 20; > > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------------------------- > Nested Loop Left Join (cost=139.00..10392.96 rows=668 width=16) (actual time=0.528..35.120 rows=200 loops=1) > Join Filter: (c.customer_id = product.customer_id) > Rows Removed by Join Filter: 199900 > -> Nested Loop (cost=0.28..199.21 rows=334 width=12) (actual time=0.075..1.146 rows=100 loops=1) > -> Seq Scan on customer (cost=0.00..21.51 rows=334 width=8) (actual time=0.067..0.282 rows=100 loops=1) > Filter: (age < 20) > Rows Removed by Filter: 901 > -> Index Only Scan using customer_pkey on customer c (cost=0.28..0.53 rows=1 width=4) (actual time=0.006..0.006rows=1 loops=100) > Index Cond: (customer_id = customer.customer_id) > Heap Fetches: 100 > -> Materialize (cost=138.73..173.75 rows=2001 width=8) (actual time=0.005..0.130 rows=2001 loops=100) > -> Sort (cost=138.73..143.73 rows=2001 width=8) (actual time=0.448..0.588 rows=2001 loops=1) > Sort Key: product.customer_id, product.product_id > Sort Method: quicksort Memory: 142kB > -> Seq Scan on product (cost=0.00..29.01 rows=2001 width=8) (actual time=0.006..0.215 rows=2001 loops=1) > Planning time: 0.214 ms > Execution time: 35.284 ms You would benefit from adding the age column to view_customer, or at least consider having some view which contains all the columns you'll ever need from those tables and if you need special views with only a subset of columns due to some software doing "select * from viewname;", then you could just create some. Joining to the same table again seems like a bit of a waste of effort for the planner and executor. I'd assume customer_id is the PRIMARY KEY of customer and is unique. It's not all that clear what your view is doing here. Confusingly there's a Sort in the plan, yet nothing in the query asked for that, so I guess that the view must have an ORDER BY. If you get rid of that the planner would likely use an index on product (customer_id) to parameterise the nested loop, at least, it likely would, if you have one. It's pretty bad practice to have ORDER BY in views. I kinda wish we didn't even allow it, but that ship sailed many years ago... -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
David Rowley <david.rowley@2ndquadrant.com> writes: > It's not all that clear what your view is doing here. Confusingly > there's a Sort in the plan, yet nothing in the query asked for that, > so I guess that the view must have an ORDER BY. If you get rid of that > the planner would likely use an index on product (customer_id) to > parameterise the nested loop, at least, it likely would, if you have > one. Yeah. The ORDER BY creates a partial optimization fence, preventing any such plan from being considered. > It's pretty bad practice to have ORDER BY in views. I kinda wish we > didn't even allow it, but that ship sailed many years ago... I think it's actually disallowed by the SQL spec (although so are many other things we support). IMO it's a useful facility to have for views that are meant for direct presentation to clients --- but if you'd like joins to the view to be optimized, you don't want an ORDER BY in there. regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On 10 October 2017 at 02:51, Tom Lane <tgl@sss.pgh.pa.us> wrote: > David Rowley <david.rowley@2ndquadrant.com> writes: >> It's pretty bad practice to have ORDER BY in views. I kinda wish we >> didn't even allow it, but that ship sailed many years ago... > > I think it's actually disallowed by the SQL spec (although so are > many other things we support). IMO it's a useful facility to have > for views that are meant for direct presentation to clients --- > but if you'd like joins to the view to be optimized, you don't > want an ORDER BY in there. If the only reason that is_simple_subquery() rejects subqueries with ORDER BY is due to wanting to keep the order by of a view, then couldn't we make is_simple_subquery() a bit smarter and have it check if the subquery is going to be joined to something else, which likely would destroy the order, or at least it would remove any guarantees of it. Something like the attached? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Attachment
David Rowley <david.rowley@2ndquadrant.com> writes: > If the only reason that is_simple_subquery() rejects subqueries with > ORDER BY is due to wanting to keep the order by of a view, then > couldn't we make is_simple_subquery() a bit smarter and have it check > if the subquery is going to be joined to something else, which likely > would destroy the order, or at least it would remove any guarantees of > it. I'm not on board with this. The assumption is that if the user put an ORDER BY there, that means they want that subquery to be computed in that order. It's not for us to decide they didn't mean what they said. Moreover, there are cases where the ORDER BY would be semantically significant, eg if there's a LIMIT or volatile functions or tSRFs involved. BTW, I noticed that I was wrong upthread about ORDER BY in subqueries being disallowed by spec --- that was true originally, but they allow it as of SQL:2008 or thereabouts. It might be interesting to see if the spec says anything concrete about the semantics of that. regards, tom lane -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On 10 October 2017 at 12:44, Tom Lane <tgl@sss.pgh.pa.us> wrote: > David Rowley <david.rowley@2ndquadrant.com> writes: >> If the only reason that is_simple_subquery() rejects subqueries with >> ORDER BY is due to wanting to keep the order by of a view, then >> couldn't we make is_simple_subquery() a bit smarter and have it check >> if the subquery is going to be joined to something else, which likely >> would destroy the order, or at least it would remove any guarantees of >> it. > > I'm not on board with this. The assumption is that if the user put an > ORDER BY there, that means they want that subquery to be computed in that > order. It's not for us to decide they didn't mean what they said. > > Moreover, there are cases where the ORDER BY would be semantically > significant, eg if there's a LIMIT or volatile functions or tSRFs > involved. Ok, thanks for looking, although, FWIW, LIMIT and tSRFs are still disabled. -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
> least consider having some view which contains all the columns you'll
> ever need from those tables and if you need special views with only a
> subset of columns due to some software doing "select * from
> viewname;", then you could just create some. Joining to the same table
> again seems like a bit of a waste of effort for the planner and
> executor.
I would argue that the anti pattern would be the software that
insist on using "select * from viewname;" from a view that has
calculated columns that you do not care for. I recommend
introducing both lightweight views and heavyweight views, so you
can join up probably for what you need.
create more confusion than clarity in my example. My point was
only to see if anything could be added to the fabricated
execution path. I agree that the listed example does not make
sense. So I will try and give some more context to real use
cases.
base data and a relation for invoice_line. The invoice has some
invoice_id, customer_id, due_date, paid_date and invoice_line
contains each line with a invoice_id, display_name, amount. A
view (view_invoice_with_amount) where you calculate the total.
so a query could be
SELECT c.customer_id,
i.invoice_amount_total
FROM view_customer c
JOIN view_invoice_with_amount i
ON c.customer_id = i.customer_id
WHERE c.first_name = 'John';
If you ever need to filter by invoice_amount_total, it might be
necesary denormalize the relations and cache the amount in the
invoice table.
> is unique.
This is a continuation of the previous example, maybe I should have
included it all to make it more clear. But customer_id is declared
as a primary key.
> there's a Sort in the plan, yet nothing in the query asked for that,
> so I guess that the view must have an ORDER BY. If you get rid of that
> the planner would likely use an index on product (customer_id) to
> parameterise the nested loop, at least, it likely would, if you have
> one.
The view is defined in the original post. What I was trying to illustrate
was a DISTINCT ON clause to prioritize multiple products pr customer
to a somewhat "main" product for the customer. The ORDER BY on product_id
would in this case then map the first product a customer gets to its
"main" product. It could also be the most valuable product or newest ordered
active product etc. It is just some way of mapping one to many relation to a
one to one. Again the example is simplified and fabricated and maybe looses
its power to explain its intents.
> It's pretty bad practice to have ORDER BY in views. I kinda wish we
> didn't even allow it, but that ship sailed many years ago...
It is required by DISTINCT ON and as soon as you go into
reporting, datawarehouse then it gets difficult to avoid these
along with group by. Instead of writing each query from the
ground up you get a huge benefit by factorizing each query into
meaningful entities that can stand alone and make sense by
themself, and from these build up the query to answer your
questions. That way you gain lots of re-use of code and
definition doesn't change between queries. The down side is it
leaves alot of work to the planner. It's a trade off between
optimization, readability and simplicity.
> ORDER BY is due to wanting to keep the order by of a view, then
> couldn't we make is_simple_subquery() a bit smarter and have it check
> if the subquery is going to be joined to something else, which likely
> would destroy the order, or at least it would remove any guarantees of
> it.
>
> Something like the attached?
I dont know if it makes any difference that the ORDER BY is used in a
DISTINCT ON clause. In this case the ORDER BY is important.
- Kim
On Mon, Oct 09, 2017 at 07:44:50PM -0400, Tom Lane wrote: > David Rowley <david.rowley@2ndquadrant.com> writes: > > If the only reason that is_simple_subquery() rejects subqueries with > > ORDER BY is due to wanting to keep the order by of a view, then > > couldn't we make is_simple_subquery() a bit smarter and have it check > > if the subquery is going to be joined to something else, which likely > > would destroy the order, or at least it would remove any guarantees of > > it. > > I'm not on board with this. The assumption is that if the user put an > ORDER BY there, that means they want that subquery to be computed in that > order. It's not for us to decide they didn't mean what they said. > > Moreover, there are cases where the ORDER BY would be semantically > significant, eg if there's a LIMIT or volatile functions or tSRFs > involved. Or where the order is meaningful to an aggregate function applied to columns of a view result set. I'm not sure what the full set of cases where the ORDER BY on the inner query is meaningful, but I'm sure there are cases it is not. If there are no such constraints on dropping the ORDER BY, then the it could be dropped, making the view query simpler. Nico -- -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
> Yeah. The ORDER BY creates a partial optimization fence, preventing > any such plan from being considered. >> I can see in the general case it semanticly means different things If you allow the WHERE to pass through ORDER BY. A special case can be allowed for WHERE to pass the ORDER BY if the column is part of DISTINCT ON. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On 12 October 2017 at 08:37, Kim Rose Carlsen <krc@hiper.dk> wrote: > >> Yeah. The ORDER BY creates a partial optimization fence, preventing >> any such plan from being considered. >>> > > I can see in the general case it semanticly means different things If you allow the WHERE to pass through ORDER BY. > > A special case can be allowed for WHERE to pass the ORDER BY if the column is part of DISTINCT ON. Yeah, we do allow predicates to be pushed down in that case. -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
> On 11 Oct 2017, at 21.46, David Rowley <david.rowley@2ndquadrant.com> wrote: > >> On 12 October 2017 at 08:37, Kim Rose Carlsen <krc@hiper.dk> wrote: >> >>> Yeah. The ORDER BY creates a partial optimization fence, preventing >>> any such plan from being considered. >>>> >> >> I can see in the general case it semanticly means different things If you allow the WHERE to pass through ORDER BY. >> >> A special case can be allowed for WHERE to pass the ORDER BY if the column is part of DISTINCT ON. > > Yeah, we do allow predicates to be pushed down in that case. > Let's ignore that it's not a very useful query I have written. Why don't I see that predicate (customer_id) pushed into the outer nested loop so we don't have to sort the whole table oneach loop. (See original post and follow up for definitions) QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop Left Join (cost=139.00..10392.96 rows=668 width=16) (actual time=0.528..35.120 rows=200 loops=1) Join Filter:(c.customer_id = product.customer_id) Rows Removed by Join Filter: 199900 -> Nested Loop (cost=0.28..199.21 rows=334width=12) (actual time=0.075..1.146 rows=100 loops=1) -> Seq Scan on customer (cost=0.00..21.51 rows=334width=8) (actual time=0.067..0.282 rows=100 loops=1) Filter: (age < 20) Rows Removed by Filter:901 -> Index Only Scan using customer_pkey on customer c (cost=0.28..0.53 rows=1 width=4) (actual time=0.006..0.006rows=1 loops=100) Index Cond: (customer_id = customer.customer_id) Heap Fetches:100 -> Materialize (cost=138.73..173.75 rows=2001 width=8) (actual time=0.005..0.130 rows=2001 loops=100) -> Sort (cost=138.73..143.73 rows=2001 width=8) (actual time=0.448..0.588 rows=2001 loops=1) Sort Key: product.customer_id,product.product_id Sort Method: quicksort Memory: 142kB -> Seq Scan on product (cost=0.00..29.01 rows=2001 width=8) (actual time=0.006..0.215 rows=2001 loops=1) Planning time: 0.214 ms Execution time: 35.284 ms -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On 12 October 2017 at 10:15, Kim Rose Carlsen <krc@hiper.dk> wrote: > Why don't I see that predicate (customer_id) pushed into the outer nested loop so we don't have to sort the whole tableon each loop. > > (See original post and follow up for definitions) > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------------------------- > Nested Loop Left Join (cost=139.00..10392.96 rows=668 width=16) (actual time=0.528..35.120 rows=200 loops=1) > Join Filter: (c.customer_id = product.customer_id) > Rows Removed by Join Filter: 199900 > -> Nested Loop (cost=0.28..199.21 rows=334 width=12) (actual time=0.075..1.146 rows=100 loops=1) > -> Seq Scan on customer (cost=0.00..21.51 rows=334 width=8) (actual time=0.067..0.282 rows=100 loops=1) > Filter: (age < 20) > Rows Removed by Filter: 901 > -> Index Only Scan using customer_pkey on customer c (cost=0.28..0.53 rows=1 width=4) (actual time=0.006..0.006rows=1 loops=100) > Index Cond: (customer_id = customer.customer_id) > Heap Fetches: 100 > -> Materialize (cost=138.73..173.75 rows=2001 width=8) (actual time=0.005..0.130 rows=2001 loops=100) > -> Sort (cost=138.73..143.73 rows=2001 width=8) (actual time=0.448..0.588 rows=2001 loops=1) > Sort Key: product.customer_id, product.product_id > Sort Method: quicksort Memory: 142kB > -> Seq Scan on product (cost=0.00..29.01 rows=2001 width=8) (actual time=0.006..0.215 rows=2001 loops=1) > Planning time: 0.214 ms > Execution time: 35.284 ms I don't really see any blockers that would mean we couldn't support this, it's just that we don't currently support it. The predicates that we do pushdown are just ones we deem as safe to pushdown of the ones that appear in the query, or ones that can be derived through equivalence. (e.g. ab.a = ab.b and ab.b = 1 --> ab.a = 1) For example, consider the difference between the following: create table ab(a int, b int); insert into ab select x,x from generate_series(1,1000000)x; create index on ab(a); create index on ab(b); postgres=# explain select * from (select distinct on (a) a,b from ab order by a,b) ab where ab.b < 10; QUERY PLAN -----------------------------------------------------------------------------------Subquery Scan on ab (cost=127757.34..145257.34rows=333333 width=8) Filter: (ab.b < 10) -> Unique (cost=127757.34..132757.34 rows=1000000width=8) -> Sort (cost=127757.34..130257.34 rows=1000000 width=8) Sort Key: ab_1.a, ab_1.b -> Seq Scan on ab ab_1 (cost=0.00..14425.00 rows=1000000 width=8) (6 rows) postgres=# explain select * from (select distinct on (a) a,b from ab order by a,b) ab where ab.a < 10; QUERY PLAN -------------------------------------------------------------------------------Unique (cost=8.73..8.77 rows=9 width=8) -> Sort (cost=8.73..8.75 rows=9 width=8) Sort Key: ab.a, ab.b -> Index Scan using ab_a_idx on ab (cost=0.42..8.58rows=9 width=8) Index Cond: (a < 10) (5 rows) The "a < 10" was pushed down as we're distinct on (a), but pushing down "ab.b < 10" would be invalid and could cause wrong results. The predicate you'd like to see pushed down is actually a parameter in a parameterized Path and we don't currently generate any parameterized paths outside of each query level. Likely there's no good reason for this other than it's not been done yet, but it's really only been since 9.6 that the query planner has been flexible enough to possibly allow something like this to be done at all. The reason the planner may appear to push down the predicate when there's no DISTINCT ON clause is that the planner was able to pull the subquery (or view) up a level. When the planner is able to do this it's much more flexible to the types of plans it can generate. It's just that we don't ever pull up subqueries with DISTINCT ON, plus a bunch of other reasons. -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general