Re: Performance on update from join - Mailing list pgsql-sql

From Tom Lane
Subject Re: Performance on update from join
Date
Msg-id 29566.1021052657@sss.pgh.pa.us
Whole thread Raw
In response to Re: Performance on update from join  (Jean-Luc Lachance <jllachan@nsd.ca>)
List pgsql-sql
Jean-Luc Lachance <jllachan@nsd.ca> writes:
> Try:

> create table a (a1 int primary key, a2 int, a3 int, a4 int);
> create table b (b1 int, b2 int, b3 int, b4 int, primary key (b1, b2));
> create table d (d1 int, d2 int, d3 int, d4 int, primary key (d1, d2));
> explain update a set a4 = d.d2 from b,d where a.a2 = b.b1 and a.a3 =
> b.b2 and
>  b.b3 = d.d1 and b.b4 = d.d2 and a.a4 >= d.d3 and a.a4 <= d.d4;

> Which is closer to what I have.

>             +-----------------------------------
>             |                                  /\
> A1  A2  A3  A4     B1  B2  B3  B4    D1  D2  D3  D4
>     |   |          |   |   |   |     |   |
>     +--------------+   |   +---------+   |
>         |              |       |         |
>         +--------------+       +---------+

Well, I still don't see any tendency to want to form a "cartesian
product", if by that you mean an unconstrained join.  I tried these
cases:

After creating tables as above:
Hash Join  (cost=129.77..162.27 rows=1 width=54)  Hash Cond: ("outer".a2 = "inner".b1)  Join Filter: (("outer".a3 =
"inner".b2)AND ("outer".a4 >= "inner".d3) AND ("outer".a4 <= "inner".d4))  ->  Seq Scan on a  (cost=0.00..20.00
rows=1000width=22)  ->  Hash  (cost=129.70..129.70 rows=25 width=32)        ->  Merge Join  (cost=69.83..129.70 rows=25
width=32)             Merge Cond: (("outer".d1 = "inner".b3) AND ("outer".d2 = "inner".b4))              ->  Index Scan
usingd_pkey on d  (cost=0.00..52.00 rows=1000 width=16)              ->  Sort  (cost=69.83..72.33 rows=1000 width=16)
                ->  Seq Scan on b  (cost=0.00..20.00 rows=1000 width=16)
 

(I'm using current development sources here because of the nicer EXPLAIN
display, but the planner behavior hasn't changed from 7.2 AFAIR.)

This doesn't tell us very much because the default assumptions for the
table sizes are all the same, 1000 rows/10 pages.  I tried goosing it
up:

test=# update pg_class set reltuples=1000000,relpages=10000 where relname = 'a';
UPDATE 1
test=# update pg_class set reltuples=1000000,relpages=10000 where relname = 'b';
UPDATE 1
test=# update pg_class set reltuples=1000000,relpages=10000 where relname = 'd';
UPDATE 1
test=# explain ...

Merge Join  (cost=8294766.62..20924766.62 rows=69444444 width=54)  Merge Cond: (("outer".b2 = "inner".a3) AND
("outer".b1= "inner".a2))  Join Filter: (("inner".a4 >= "outer".d3) AND ("inner".a4 <= "outer".d4))  ->  Sort
(cost=8093208.78..8155708.78rows=25000000 width=32)        ->  Merge Join  (cost=373805.69..758805.69 rows=25000000
width=32)             Merge Cond: (("outer".b4 = "inner".d2) AND ("outer".b3 = "inner".d1))              ->  Sort
(cost=186902.84..189402.84rows=1000000 width=16)                    ->  Seq Scan on b  (cost=0.00..20000.00
rows=1000000width=16)              ->  Sort  (cost=186902.84..189402.84 rows=1000000 width=16)                    ->
SeqScan on d  (cost=0.00..20000.00 rows=1000000 width=16)  ->  Sort  (cost=201557.84..204057.84 rows=1000000 width=22)
     ->  Seq Scan on a  (cost=0.00..20000.00 rows=1000000 width=22)
 

Or if we make table a smaller than the other two:

test=# update pg_class set reltuples=100000,relpages=1000 where relname = 'a';
UPDATE 1
test=# explain ...
Merge Join  (cost=981132.44..2248632.44 rows=6944444 width=54)  Merge Cond: (("outer".d2 = "inner".b4) AND ("outer".d1
="inner".b3))  Join Filter: (("inner".a4 >= "outer".d3) AND ("inner".a4 <= "outer".d4))  ->  Sort
(cost=186902.84..189402.84rows=1000000 width=16)        ->  Seq Scan on d  (cost=0.00..20000.00 rows=1000000 width=16)
-> Sort  (cost=794229.60..800479.60 rows=2500000 width=38)        ->  Merge Join  (cost=198580.89..241580.89
rows=2500000width=38)              Merge Cond: (("outer".a3 = "inner".b2) AND ("outer".a2 = "inner".b1))
-> Sort  (cost=11678.05..11928.05 rows=100000 width=22)                    ->  Seq Scan on a  (cost=0.00..2000.00
rows=100000width=22)              ->  Sort  (cost=186902.84..189402.84 rows=1000000 width=16)                    ->
SeqScan on b  (cost=0.00..20000.00 rows=1000000 width=16)
 

The only thing I see here that is unhappy-making is that the planner
does not recognize that joining both columns of a 2-column primary key
guarantees a unique result (for instance, the join of a and b here
should be estimated at 100000 rows, not 2500000, since there can't be
more than one b row joined to any a row).  It would do better if it had
any ANALYZE statistics to work with, but ideally it ought to be able to
deduce that even without stats, just from the presence of a unique
index.  It will do so for single-column unique indexes but I haven't
yet figured out a reasonable (read inexpensive) way to make the
corresponding deduction for the multi-column case.

In any case, toy examples like this aren't going to prove much
about the behavior with real tables and real statistics ...
if you want constructive answers you're going to need to show
us your actual EXPLAIN ANALYZE results and the pg_stats entries
for the tables.
        regards, tom lane


pgsql-sql by date:

Previous
From: Stephan Szabo
Date:
Subject: Re: Bad time external representation ''
Next
From: "Samuel J. Sutjiono"
Date:
Subject: Why very high CPU usage