Thread: [GENERAL] Equivalence Classes when using IN

[GENERAL] Equivalence Classes when using IN

From
Kim Rose Carlsen
Date:
Hi

I have this query where I think it's strange that the join doesn't pull the where condition in since RHS is equal to LHS. It might be easier to expain with an example

Setup
            CREATE TABLE customer (
              customer_id INTEGER PRIMARY KEY
            );

            CREATE TABLE product (
              product_id  INTEGER PRIMARY KEY,
              customer_id INTEGER NOT NULL REFERENCES customer (customer_id)
            );
            
            INSERT INTO customer (SELECT generate_series FROM generate_series(0, 1000000));

            INSERT INTO product (product_id, customer_id) (SELECT generate_series, generate_series / 2 FROM generate_series(0, 2000));


Query

           EXPLAIN ANALYSE
            SELECT *
              FROM customer c
              JOIN (SELECT DISTINCT ON (customer_id) * FROM product ORDER BY customer_id, product_id) p
                ON c.customer_id = p.customer_id
             WHERE c.customer_id IN (500, 501);

                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=172.43..186.25 rows=1 width=12) (actual time=1.350..1.353 rows=2 loops=1)
   Merge Cond: (c.customer_id = product.customer_id)
   ->  Sort  (cost=13.93..13.93 rows=2 width=4) (actual time=0.036..0.036 rows=2 loops=1)
         Sort Key: c.customer_id
         Sort Method: quicksort  Memory: 25kB
         ->  Bitmap Heap Scan on customer c  (cost=8.58..13.92 rows=2 width=4) (actual time=0.026..0.027 rows=2 loops=1)
               Recheck Cond: (customer_id = ANY ('{500,501}'::integer[]))
               Heap Blocks: exact=1
               ->  Bitmap Index Scan on customer_pkey  (cost=0.00..8.58 rows=2 width=0) (actual time=0.018..0.018 rows=2 loops=1)
                     Index Cond: (customer_id = ANY ('{500,501}'::integer[]))
   ->  Unique  (cost=158.51..169.81 rows=200 width=8) (actual time=0.783..1.221 rows=502 loops=1)
         ->  Sort  (cost=158.51..164.16 rows=2260 width=8) (actual time=0.782..0.929 rows=1003 loops=1)
               Sort Key: product.customer_id, product.product_id
               Sort Method: quicksort  Memory: 142kB
               ->  Seq Scan on product  (cost=0.00..32.60 rows=2260 width=8) (actual time=0.015..0.366 rows=2001 loops=1)
 Planning time: 0.281 ms
 Execution time: 1.432 ms

I would expect that since c.customer_id = p.customer_id then p.customer_id IN (500, 501). If I apply this rule myself, I get a much nicer plan (and it could be even better with an index on product_id).


                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=52.70..53.11 rows=1 width=12) (actual time=0.686..0.693 rows=2 loops=1)
   Merge Cond: (product.customer_id = c.customer_id)
   ->  Unique  (cost=38.77..38.89 rows=22 width=8) (actual time=0.647..0.651 rows=2 loops=1)
         ->  Sort  (cost=38.77..38.83 rows=23 width=8) (actual time=0.646..0.647 rows=4 loops=1)
               Sort Key: product.customer_id, product.product_id
               Sort Method: quicksort  Memory: 25kB
               ->  Seq Scan on product  (cost=0.00..38.25 rows=23 width=8) (actual time=0.331..0.632 rows=4 loops=1)
                     Filter: (customer_id = ANY ('{500,501}'::integer[]))
                     Rows Removed by Filter: 1997
   ->  Sort  (cost=13.93..13.93 rows=2 width=4) (actual time=0.033..0.033 rows=2 loops=1)
         Sort Key: c.customer_id
         Sort Method: quicksort  Memory: 25kB
         ->  Bitmap Heap Scan on customer c  (cost=8.58..13.92 rows=2 width=4) (actual time=0.025..0.026 rows=2 loops=1)
               Recheck Cond: (customer_id = ANY ('{500,501}'::integer[]))
               Heap Blocks: exact=1
               ->  Bitmap Index Scan on customer_pkey  (cost=0.00..8.58 rows=2 width=0) (actual time=0.018..0.018 rows=2 loops=1)
                     Index Cond: (customer_id = ANY ('{500,501}'::integer[]))
 Planning time: 0.386 ms
 Execution time: 0.774 ms
(19 rows)

Is this because postgres never consider IN clause when building equivalence class's?

Are there any interests in adding such rule?


My idea is to wrap this in a view

          CREATE VIEW view_customer AS
            SELECT c.customer_id,
                   p.product_id
              FROM customer c
         LEFT JOIN (SELECT DISTINCT ON (customer_id) * FROM product ORDER BY customer_id, product_id) p
                ON c.customer_id = p.customer_id

Where the LEFT JOIN can be pruned if there is no explicit need for product_id. Here I loose the power to express that both c.customer_id and p.customer_id is the same.

Best regards

Kim Carlsen

Re: [GENERAL] Equivalence Classes when using IN

From
David Rowley
Date:
On 9 October 2017 at 08:01, Kim Rose Carlsen <krc@hiper.dk> wrote:
> Is this because postgres never consider IN clause when building equivalence
> class's?

Only btree equality operators are considered at the moment.

> Are there any interests in adding such rule?

There's been some discussion on it previously, although there is lots
to still be worked out, for example, it's not that clear if it will
always be a win to always apply the qual.

There are more details of the discussion in [1], although there's
probably lots more threads to be found if you search the archives.

[1]
https://www.postgresql.org/message-id/flat/CAKJS1f9FK_X_5HKcPcSeimy16Owe3EmPmmGsGWLcKkj_rW9s6A%40mail.gmail.com#CAKJS1f9FK_X_5HKcPcSeimy16Owe3EmPmmGsGWLcKkj_rW9s6A@mail.gmail.com

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Equivalence Classes when using IN

From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes:
> There are more details of the discussion in [1], although there's
> probably lots more threads to be found if you search the archives.
> [1]
https://www.postgresql.org/message-id/flat/CAKJS1f9FK_X_5HKcPcSeimy16Owe3EmPmmGsGWLcKkj_rW9s6A%40mail.gmail.com#CAKJS1f9FK_X_5HKcPcSeimy16Owe3EmPmmGsGWLcKkj_rW9s6A@mail.gmail.com

That thread seems to be about transitively applying inequalities
(ie, given "a = b and a < c", deduce "b < c"), which seems like a bit
of a different animal than IN.  Some of the issues are similar
perhaps, but I'd think that being able to deduce "b IN somelist"
from "a = b and a IN somelist" is more valuable, simply because the
IN would typically be a much stronger constraint than an inequality.
So that idea suggests that it's more worth expending planner cycles
to chase the possibility.

I do vaguely recall discussions specifically around IN, though
I didn't have any luck finding one in the archives.  There's also
the recent thread
https://www.postgresql.org/message-id/flat/CAMjNa7cC4X9YR-vAJS-jSYCajhRDvJQnN7m2sLH1wLh-_Z2bsw@mail.gmail.com
which suggests being able to simplify "a IN somelist AND
a IN someotherlist".  If we wanted to do that, making the
"lists" be treatable as eclass members would be a good
place to start, because that would naturally result in
intersect-able lists ending up in the same eclass.
        regards, tom lane


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Equivalence Classes when using IN

From
Kim Rose Carlsen
Date:
>On 9 October 2017 at 08:01, Kim Rose Carlsen <krc@hiper.dk> wrote:


>> Is this because postgres never consider IN clause when building equivalence
>> class's?
>
>Only btree equality operators are considered at the moment.

After good night sleep and reading the previous discussion, I am no longer sure I have reduced my original problem to
theright example. If we continue from previous setup and add the following: 

           ALTER TABLE customer ADD COLUMN age INTEGER;
           UPDATE customer SET age = customer_id / 5;
           CREATE INDEX ON customer (age);           CREATE INDEX ON product (customer_id);

               SET enable_hashjoin = false;
               SET enable_mergejoin = false;
                
            EXPLAIN ANALYZE
             SELECT *
               FROM customer
               JOIN view_customer
                 ON customer.customer_id = view_customer.customer_id
              WHERE age < 20;
                                                                QUERY PLAN
                    

---------------------------------------------------------------------------------------------------------------------------------------------Nested
LoopLeft Join  (cost=139.00..10392.96 rows=668 width=16) (actual time=0.528..35.120 rows=200 loops=1)  Join Filter:
(c.customer_id= product.customer_id)  Rows Removed by Join Filter: 199900  ->  Nested Loop  (cost=0.28..199.21 rows=334
width=12)(actual time=0.075..1.146 rows=100 loops=1)        ->  Seq Scan on customer  (cost=0.00..21.51 rows=334
width=8)(actual time=0.067..0.282 rows=100 loops=1)              Filter: (age < 20)              Rows Removed by
Filter:901        ->  Index Only Scan using customer_pkey on customer c  (cost=0.28..0.53 rows=1 width=4) (actual
time=0.006..0.006rows=1 loops=100)              Index Cond: (customer_id = customer.customer_id)              Heap
Fetches:100  ->  Materialize  (cost=138.73..173.75 rows=2001 width=8) (actual time=0.005..0.130 rows=2001 loops=100)
   ->  Sort  (cost=138.73..143.73 rows=2001 width=8) (actual time=0.448..0.588 rows=2001 loops=1)              Sort
Key:product.customer_id, product.product_id              Sort Method: quicksort  Memory: 142kB              ->  Seq
Scanon product  (cost=0.00..29.01 rows=2001 width=8) (actual time=0.006..0.215 rows=2001 loops=1)Planning time: 0.214
msExecutiontime: 35.284 ms 


The planner prefer to use hash and merge joins which is ok, when many rows are to be joined, I don't think any
conditioncan be merged to make these case faster. I have disabled merge and hash joins to get to a nested loop join
instead,in this case it would be much better if customer_id can be pulled inside the loop, so it can look at only the
relevantrows and not all rows for each loop. I somehow inferred that this would be the same as selecting from the view
usingIN clause, now I'm not so sure anymore. 

I can see there is a trade off between planner time and how exotic the case is. If you want to be able to hide
abstractionthrough views I guess the nature becomes more OLAP oriented than OLTP.  

Best Regards
Kim Carlsen

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Equivalence Classes when using IN

From
David Rowley
Date:
On 9 October 2017 at 22:39, Kim Rose Carlsen <krc@hiper.dk> wrote:
>             EXPLAIN ANALYZE
>              SELECT *
>                FROM customer
>                JOIN view_customer
>                  ON customer.customer_id = view_customer.customer_id
>               WHERE age < 20;
>
>                                                                  QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------------------------------
>  Nested Loop Left Join  (cost=139.00..10392.96 rows=668 width=16) (actual time=0.528..35.120 rows=200 loops=1)
>    Join Filter: (c.customer_id = product.customer_id)
>    Rows Removed by Join Filter: 199900
>    ->  Nested Loop  (cost=0.28..199.21 rows=334 width=12) (actual time=0.075..1.146 rows=100 loops=1)
>          ->  Seq Scan on customer  (cost=0.00..21.51 rows=334 width=8) (actual time=0.067..0.282 rows=100 loops=1)
>                Filter: (age < 20)
>                Rows Removed by Filter: 901
>          ->  Index Only Scan using customer_pkey on customer c  (cost=0.28..0.53 rows=1 width=4) (actual
time=0.006..0.006rows=1 loops=100)
 
>                Index Cond: (customer_id = customer.customer_id)
>                Heap Fetches: 100
>    ->  Materialize  (cost=138.73..173.75 rows=2001 width=8) (actual time=0.005..0.130 rows=2001 loops=100)
>          ->  Sort  (cost=138.73..143.73 rows=2001 width=8) (actual time=0.448..0.588 rows=2001 loops=1)
>                Sort Key: product.customer_id, product.product_id
>                Sort Method: quicksort  Memory: 142kB
>                ->  Seq Scan on product  (cost=0.00..29.01 rows=2001 width=8) (actual time=0.006..0.215 rows=2001
loops=1)
>  Planning time: 0.214 ms
>  Execution time: 35.284 ms

You would benefit from adding the age column to view_customer, or at
least consider having some view which contains all the columns you'll
ever need from those tables and if you need special views with only a
subset of columns due to some software doing "select * from
viewname;", then you could just create some. Joining to the same table
again seems like a bit of a waste of effort for the planner and
executor.  I'd assume customer_id is the PRIMARY KEY of customer and
is unique.

It's not all that clear what your view is doing here. Confusingly
there's a Sort in the plan, yet nothing in the query asked for that,
so I guess that the view must have an ORDER BY. If you get rid of that
the planner would likely use an index on product (customer_id) to
parameterise the nested loop, at least, it likely would, if you have
one.

It's pretty bad practice to have ORDER BY in views. I kinda wish we
didn't even allow it, but that ship sailed many years ago...

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Equivalence Classes when using IN

From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes:
> It's not all that clear what your view is doing here. Confusingly
> there's a Sort in the plan, yet nothing in the query asked for that,
> so I guess that the view must have an ORDER BY. If you get rid of that
> the planner would likely use an index on product (customer_id) to
> parameterise the nested loop, at least, it likely would, if you have
> one.

Yeah.  The ORDER BY creates a partial optimization fence, preventing
any such plan from being considered.

> It's pretty bad practice to have ORDER BY in views. I kinda wish we
> didn't even allow it, but that ship sailed many years ago...

I think it's actually disallowed by the SQL spec (although so are
many other things we support).  IMO it's a useful facility to have
for views that are meant for direct presentation to clients ---
but if you'd like joins to the view to be optimized, you don't
want an ORDER BY in there.
        regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Equivalence Classes when using IN

From
David Rowley
Date:
On 10 October 2017 at 02:51, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> David Rowley <david.rowley@2ndquadrant.com> writes:
>> It's pretty bad practice to have ORDER BY in views. I kinda wish we
>> didn't even allow it, but that ship sailed many years ago...
>
> I think it's actually disallowed by the SQL spec (although so are
> many other things we support).  IMO it's a useful facility to have
> for views that are meant for direct presentation to clients ---
> but if you'd like joins to the view to be optimized, you don't
> want an ORDER BY in there.

If the only reason that is_simple_subquery() rejects subqueries with
ORDER BY is due to wanting to keep the order by of a view, then
couldn't we make is_simple_subquery() a bit smarter and have it check
if the subquery is going to be joined to something else, which likely
would destroy the order, or at least it would remove any guarantees of
it.

Something like the attached?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Attachment

Re: [GENERAL] Equivalence Classes when using IN

From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes:
> If the only reason that is_simple_subquery() rejects subqueries with
> ORDER BY is due to wanting to keep the order by of a view, then
> couldn't we make is_simple_subquery() a bit smarter and have it check
> if the subquery is going to be joined to something else, which likely
> would destroy the order, or at least it would remove any guarantees of
> it.

I'm not on board with this.  The assumption is that if the user put an
ORDER BY there, that means they want that subquery to be computed in that
order.  It's not for us to decide they didn't mean what they said.

Moreover, there are cases where the ORDER BY would be semantically
significant, eg if there's a LIMIT or volatile functions or tSRFs
involved.

BTW, I noticed that I was wrong upthread about ORDER BY in subqueries
being disallowed by spec --- that was true originally, but they allow
it as of SQL:2008 or thereabouts.  It might be interesting to see if
the spec says anything concrete about the semantics of that.
        regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Equivalence Classes when using IN

From
David Rowley
Date:
On 10 October 2017 at 12:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> David Rowley <david.rowley@2ndquadrant.com> writes:
>> If the only reason that is_simple_subquery() rejects subqueries with
>> ORDER BY is due to wanting to keep the order by of a view, then
>> couldn't we make is_simple_subquery() a bit smarter and have it check
>> if the subquery is going to be joined to something else, which likely
>> would destroy the order, or at least it would remove any guarantees of
>> it.
>
> I'm not on board with this.  The assumption is that if the user put an
> ORDER BY there, that means they want that subquery to be computed in that
> order.  It's not for us to decide they didn't mean what they said.
>
> Moreover, there are cases where the ORDER BY would be semantically
> significant, eg if there's a LIMIT or volatile functions or tSRFs
> involved.

Ok, thanks for looking, although, FWIW, LIMIT and tSRFs are still disabled.


-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Equivalence Classes when using IN

From
Kim Rose Carlsen
Date:
> You would benefit from adding the age column to view_customer, or at
> least consider having some view which contains all the columns you'll
> ever need from those tables and if you need special views with only a
> subset of columns due to some software doing "select * from
> viewname;", then you could just create some. Joining to the same table
> again seems like a bit of a waste of effort for the planner and
> executor.  

I would argue that the anti pattern would be the software that
insist on using "select * from viewname;" from a view that has
calculated columns that you do not care for. I recommend
introducing both lightweight views and heavyweight views, so you
can join up probably for what you need.

My example is fabricated trying to simplify things, but I seem to
create more confusion than clarity in my example. My point was
only to see if anything could be added to the fabricated
execution path. I agree that the listed example does not make
sense. So I will try and give some more context to real use
cases.

Imagine an invoice entity where you have one relation for invoice
base data and a relation for invoice_line. The invoice has some
invoice_id, customer_id, due_date, paid_date and invoice_line
contains each line with a invoice_id, display_name, amount. A
view (view_invoice_with_amount) where you calculate the total.

so a query could be
SELECT c.customer_id,
       i.invoice_amount_total
  FROM view_customer c
  JOIN view_invoice_with_amount i
    ON c.customer_id = i.customer_id
 WHERE c.first_name = 'John';

If you ever need to filter by invoice_amount_total, it might be
necesary denormalize the relations and cache the amount in the
invoice table.

> I'd assume customer_id is the PRIMARY KEY of customer and
> is unique.

This is a continuation of the previous example, maybe I should have
included it all to make it more clear. But customer_id is declared
as a primary key.

> It's not all that clear what your view is doing here. Confusingly
> there's a Sort in the plan, yet nothing in the query asked for that,
> so I guess that the view must have an ORDER BY. If you get rid of that
> the planner would likely use an index on product (customer_id) to
> parameterise the nested loop, at least, it likely would, if you have
> one.

The view is defined in the original post. What I was trying to illustrate
was a DISTINCT ON clause to prioritize multiple products pr customer
to a somewhat "main" product for the customer. The ORDER BY on product_id
would in this case then map the first product a customer gets to its
"main" product. It could also be the most valuable product or newest ordered
active product etc. It is just some way of mapping one to many relation to a
one to one. Again the example is simplified and fabricated and maybe looses
its power to explain its intents.

> It's pretty bad practice to have ORDER BY in views. I kinda wish we
> didn't even allow it, but that ship sailed many years ago...

It is required by DISTINCT ON and as soon as you go into
reporting, datawarehouse then it gets difficult to avoid these
along with group by. Instead of writing each query from the
ground up you get a huge benefit by factorizing each query into
meaningful entities that can stand alone and make sense by
themself, and from these build up the query to answer your
questions. That way you gain lots of re-use of code and
definition doesn't change between queries. The down side is it
leaves alot of work to the planner. It's a trade off between
optimization, readability and simplicity.

I hope I make more sense now.

- Kim

Re: [GENERAL] Equivalence Classes when using IN

From
Kim Rose Carlsen
Date:
> If the only reason that is_simple_subquery() rejects subqueries with
> ORDER BY is due to wanting to keep the order by of a view, then
> couldn't we make is_simple_subquery() a bit smarter and have it check
> if the subquery is going to be joined to something else, which likely
> would destroy the order, or at least it would remove any guarantees of
> it.
>
> Something like the attached?

I dont know if it makes any difference that the ORDER BY is used in a
DISTINCT ON clause. In this case the ORDER BY is important.

- Kim

Re: [GENERAL] Equivalence Classes when using IN

From
Nico Williams
Date:
On Mon, Oct 09, 2017 at 07:44:50PM -0400, Tom Lane wrote:
> David Rowley <david.rowley@2ndquadrant.com> writes:
> > If the only reason that is_simple_subquery() rejects subqueries with
> > ORDER BY is due to wanting to keep the order by of a view, then
> > couldn't we make is_simple_subquery() a bit smarter and have it check
> > if the subquery is going to be joined to something else, which likely
> > would destroy the order, or at least it would remove any guarantees of
> > it.
> 
> I'm not on board with this.  The assumption is that if the user put an
> ORDER BY there, that means they want that subquery to be computed in that
> order.  It's not for us to decide they didn't mean what they said.
> 
> Moreover, there are cases where the ORDER BY would be semantically
> significant, eg if there's a LIMIT or volatile functions or tSRFs
> involved.

Or where the order is meaningful to an aggregate function applied to
columns of a view result set.  I'm not sure what the full set of cases
where the ORDER BY on the inner query is meaningful, but I'm sure there
are cases it is not.

If there are no such constraints on dropping the ORDER BY, then the it
could be dropped, making the view query simpler.

Nico
-- 


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Equivalence Classes when using IN

From
Kim Rose Carlsen
Date:
> Yeah.  The ORDER BY creates a partial optimization fence, preventing
> any such plan from being considered.
>>

I can see in the general case it semanticly means different things If you allow the WHERE to pass through ORDER BY.

A special case can be allowed for WHERE to pass the ORDER BY if the column is part of DISTINCT ON.




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Equivalence Classes when using IN

From
David Rowley
Date:
On 12 October 2017 at 08:37, Kim Rose Carlsen <krc@hiper.dk> wrote:
>
>> Yeah.  The ORDER BY creates a partial optimization fence, preventing
>> any such plan from being considered.
>>>
>
> I can see in the general case it semanticly means different things If you allow the WHERE to pass through ORDER BY.
>
> A special case can be allowed for WHERE to pass the ORDER BY if the column is part of DISTINCT ON.

Yeah, we do allow predicates to be pushed down in that case.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Equivalence Classes when using IN

From
Kim Rose Carlsen
Date:
> On 11 Oct 2017, at 21.46, David Rowley <david.rowley@2ndquadrant.com> wrote:
>
>> On 12 October 2017 at 08:37, Kim Rose Carlsen <krc@hiper.dk> wrote:
>>
>>> Yeah.  The ORDER BY creates a partial optimization fence, preventing
>>> any such plan from being considered.
>>>>
>>
>> I can see in the general case it semanticly means different things If you allow the WHERE to pass through ORDER BY.
>>
>> A special case can be allowed for WHERE to pass the ORDER BY if the column is part of DISTINCT ON.
>
> Yeah, we do allow predicates to be pushed down in that case.
>

Let's ignore that it's not a very useful query I have written.

Why don't I see that predicate (customer_id) pushed into the outer nested loop so we don't have to sort the whole table
oneach loop.  

(See original post and follow up for definitions)                                                                QUERY
PLAN                                                                  

---------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Left Join  (cost=139.00..10392.96 rows=668 width=16) (actual time=0.528..35.120 rows=200 loops=1) Join
Filter:(c.customer_id = product.customer_id) Rows Removed by Join Filter: 199900 ->  Nested Loop  (cost=0.28..199.21
rows=334width=12) (actual time=0.075..1.146 rows=100 loops=1)       ->  Seq Scan on customer  (cost=0.00..21.51
rows=334width=8) (actual time=0.067..0.282 rows=100 loops=1)             Filter: (age < 20)             Rows Removed by
Filter:901       ->  Index Only Scan using customer_pkey on customer c  (cost=0.28..0.53 rows=1 width=4) (actual
time=0.006..0.006rows=1 loops=100)             Index Cond: (customer_id = customer.customer_id)             Heap
Fetches:100 ->  Materialize  (cost=138.73..173.75 rows=2001 width=8) (actual time=0.005..0.130 rows=2001 loops=100)
 ->  Sort  (cost=138.73..143.73 rows=2001 width=8) (actual time=0.448..0.588 rows=2001 loops=1)             Sort Key:
product.customer_id,product.product_id             Sort Method: quicksort  Memory: 142kB             ->  Seq Scan on
product (cost=0.00..29.01 rows=2001 width=8) (actual time=0.006..0.215 rows=2001 loops=1) 
Planning time: 0.214 ms
Execution time: 35.284 ms

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Equivalence Classes when using IN

From
David Rowley
Date:
On 12 October 2017 at 10:15, Kim Rose Carlsen <krc@hiper.dk> wrote:
> Why don't I see that predicate (customer_id) pushed into the outer nested loop so we don't have to sort the whole
tableon each loop.
 
>
> (See original post and follow up for definitions)
>                                                                 QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------------------------------
> Nested Loop Left Join  (cost=139.00..10392.96 rows=668 width=16) (actual time=0.528..35.120 rows=200 loops=1)
>   Join Filter: (c.customer_id = product.customer_id)
>   Rows Removed by Join Filter: 199900
>   ->  Nested Loop  (cost=0.28..199.21 rows=334 width=12) (actual time=0.075..1.146 rows=100 loops=1)
>         ->  Seq Scan on customer  (cost=0.00..21.51 rows=334 width=8) (actual time=0.067..0.282 rows=100 loops=1)
>               Filter: (age < 20)
>               Rows Removed by Filter: 901
>         ->  Index Only Scan using customer_pkey on customer c  (cost=0.28..0.53 rows=1 width=4) (actual
time=0.006..0.006rows=1 loops=100)
 
>               Index Cond: (customer_id = customer.customer_id)
>               Heap Fetches: 100
>   ->  Materialize  (cost=138.73..173.75 rows=2001 width=8) (actual time=0.005..0.130 rows=2001 loops=100)
>         ->  Sort  (cost=138.73..143.73 rows=2001 width=8) (actual time=0.448..0.588 rows=2001 loops=1)
>               Sort Key: product.customer_id, product.product_id
>               Sort Method: quicksort  Memory: 142kB
>               ->  Seq Scan on product  (cost=0.00..29.01 rows=2001 width=8) (actual time=0.006..0.215 rows=2001
loops=1)
> Planning time: 0.214 ms
> Execution time: 35.284 ms

I don't really see any blockers that would mean we couldn't support
this, it's just that we don't currently support it. The predicates
that we do pushdown are just ones we deem as safe to pushdown of the
ones that appear in the query, or ones that can be derived through
equivalence. (e.g. ab.a = ab.b and ab.b = 1 --> ab.a = 1)

For example, consider the difference between the following:

create table ab(a int, b int);
insert into ab select x,x from generate_series(1,1000000)x;
create index on ab(a);
create index on ab(b);

postgres=# explain select * from (select distinct on (a) a,b from ab
order by a,b) ab where ab.b < 10;                                   QUERY PLAN
-----------------------------------------------------------------------------------Subquery Scan on ab
(cost=127757.34..145257.34rows=333333 width=8)  Filter: (ab.b < 10)  ->  Unique  (cost=127757.34..132757.34
rows=1000000width=8)        ->  Sort  (cost=127757.34..130257.34 rows=1000000 width=8)              Sort Key: ab_1.a,
ab_1.b             ->  Seq Scan on ab ab_1  (cost=0.00..14425.00
 
rows=1000000 width=8)
(6 rows)


postgres=# explain select * from (select distinct on (a) a,b from ab
order by a,b) ab where ab.a < 10;                                 QUERY PLAN
-------------------------------------------------------------------------------Unique  (cost=8.73..8.77 rows=9 width=8)
->  Sort  (cost=8.73..8.75 rows=9 width=8)        Sort Key: ab.a, ab.b        ->  Index Scan using ab_a_idx on ab
(cost=0.42..8.58rows=9 width=8)              Index Cond: (a < 10)
 
(5 rows)

The "a < 10" was pushed down as we're distinct on (a), but pushing
down "ab.b < 10" would be invalid and could cause wrong results.

The predicate you'd like to see pushed down is actually a parameter in
a parameterized Path and we don't currently generate any parameterized
paths outside of each query level. Likely there's no good reason for
this other than it's not been done yet, but it's really only been
since 9.6 that the query planner has been flexible enough to possibly
allow something like this to be done at all.

The reason the planner may appear to push down the predicate when
there's no DISTINCT ON clause is that the planner was able to pull the
subquery (or view) up a level. When the planner is able to do this
it's much more flexible to the types of plans it can generate. It's
just that we don't ever pull up subqueries with DISTINCT ON, plus a
bunch of other reasons.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general