Re: Avoiding cartesian product - Mailing list pgsql-performance
From | Virag Saksena |
---|---|
Subject | Re: Avoiding cartesian product |
Date | |
Msg-id | 002c01c635e8$cfc225d0$3100000a@demo1 Whole thread Raw |
In response to | Avoiding cartesian product ("Virag Saksena" <virag@auptyma.com>) |
List | pgsql-performance |
> Dear Virag,
>
> AFAIK aggregates aren't indexed in postgres (at least not before 8.1, which
> indexes min and max, iirc).
>
> Also, I don't think you need to exactly determine the trace_id. Try this one
> (OTOH; might be wrong):
>
> select DISTINCT ON (a.trace_id, a.seq_no) -- See below
> b.gc_minor - a.gc_minor, b.gc_major - a.gc_major
> from jam_trace_sys a, jam_trace_sys b
> where a.trace_id = 22
> and b.trace_id = a.trace_id
> and b.seq_no > a.seq_no -- Simply ">" is enough
> order by a.trace_id, a.seq_no, b.seq_no; -- DISTINCT, see below
>
> The trick is that DISTINCT takes the first one in each group (IIRC it is
> granted, at least someone told me on one of these lists :) ) so if you order
> by the DISTINCT attributes and then by b.seq_no, you'll get the smallest of
> appropriate b.seq_no values for each DISTINCT values.
>
> The idea of DISTINCTing by both columns is to make sure the planner finds
> the index. (lately I had a similar problem: WHERE a=1 ORDER BY b LIMIT 1
> used an index on b, instead of an (a,b) index. Using ORDER BY a,b solved it)
>
> HTH,
>
> --
> G.
>
>
> On 2006.01.04. 5:12, Virag Saksena wrote:
> >
> > I have a table which stores cumulative values
> > I would like to display/chart the deltas between successive data
> > collections
> >
> > If my primary key only increments by 1, I could write a simple query
> >
> > select b.gc_minor - a.gc_minor, b.gc_major - a.gc_major
> > from jam_trace_sys a, jam_trace_sys b
> > where a.trace_id = 22
> > and b.trace_id = a.trace_id
> > and b.seq_no = a.seq_no + 1
> > order by a.seq_no;
> >
> > However the difference in sequence number is variable.
> > So (in Oracle) I used to extract the next seq_no using a correlated
> > sub-query
> >
> > select b.gc_minor - a.gc_minor, b.gc_major - a.gc_major
> > from jam_trace_sys a, jam_trace_sys b
> > where a.trace_id = 22
> > and (b.trace_id, b.seq_no) =
> > (select a.trace_id, min(c.seq_no) from jam_trace_sys c
> > where c.trace_id = a.trace_id and c.seq_no > a.seq_no)
> > order by a.seq_no;
> >
> > For every row in A, The correlated sub-query from C will execute
> > With an appropriate index, it will just descend the index Btree
> > go one row to the right and return that row (min > :value)
> > and join to table B
> >
> > SELECT STATEMENT
> > SORT ORDER BY
> > TABLE ACCESS BY INDEX ROWID JAM_TRACE_SYS B
> > NESTED LOOPS
> > TABLE ACCESS BY INDEX ROWID JAM_TRACE_SYS A
> > INDEX RANGE SCAN JAM_TRACE_SYS_N1 A
> > INDEX RANGE SCAN JAM_TRACE_SYS_N1 B
> > SORT AGGREGATE
> > INDEX RANGE SCAN JAM_TRACE_SYS_N1 C
> >
> > In postgreSQL A and B are doing a cartesian product
> > then C gets executed for every row in this cartesian product
> > and most of the extra rows get thrown out.
> > Is there any way to force an execution plan like above where the
> > correlated subquery runs before going to B.
> > The table is small right now, but it will grow to have millions of rows
> > QUERY PLAN
> > -----------------------------------------------------------------------------------------------------------------------------------
> > Sort (cost=124911.81..124944.84 rows=13213 width=20) (actual
> > time=13096.754..13097.053 rows=149 loops=1)
> > Sort Key: a.seq_no
> > -> Nested Loop (cost=4.34..124007.40 rows=13213 width=20) (actual
> > time=1948.300..13096.329 rows=149 loops=1)
> > Join Filter: (subplan)
> > -> Seq Scan on jam_trace_sys b (cost=0.00..3.75 rows=175
> > width=16) (actual time=0.005..0.534 rows=175 loops=1)
> > -> Materialize (cost=4.34..5.85 rows=151 width=16) (actual
> > time=0.002..0.324 rows=150 loops=175)
> > -> Seq Scan on jam_trace_sys a (cost=0.00..4.19
> > rows=151 width=16) (actual time=0.022..0.687 rows=150 loops=1)
> > Filter: (trace_id = 22)
> > SubPlan
> > -> Aggregate (cost=4.67..4.67 rows=1 width=4) (actual
> > time=0.486..0.488 rows=1 loops=26250)
> > -> Seq Scan on jam_trace_sys c (cost=0.00..4.62
> > rows=15 width=4) (actual time=0.058..0.311 rows=74 loops=26250)
> > Filter: ((trace_id = $0) AND (seq_no > $1))
> > Total runtime: 13097.557 ms
> > (13 rows)
> >
> > pglnx01=> \d jam_trace_sys
> > Table "public.jam_trace_sys"
> > Column | Type | Modifiers
> > -----------------+---------+-----------
> > trace_id | integer |
> > seq_no | integer |
> > cpu_utilization | integer |
> > gc_minor | integer |
> > gc_major | integer |
> > heap_used | integer |
> > Indexes:
> > "jam_trace_sys_n1" btree (trace_id, seq_no)
> >
> > pglnx01=> select count(*) from jam_trace_Sys ;
> > count
> > -------
> > 175
> > (1 row)
> >
> > pglnx01=> select trace_id, count(*) from jam_trace_sys group by trace_id ;
> > trace_id | count
> > ----------+-------
> > 15 | 2
> > 18 | 21
> > 22 | 150
> > 16 | 2
> > (4 rows)
>
Szűcs,
Thanks for your suggestion, I guess there is more than one way to attack the problem.
I ended up using a trick with limit to get the next row ...
select (b.gc_minor- a.gc_minor), (b.gc_major- a.gc_major)
from jam_trace_sys a join jam_trace_sys b on
(b.seq_no = (select c.seq_no from jam_trace_sys c
where c.trace_id = a.trace_id and c.seq_no > a.seq_no
order by c.trace_id, c.seq_no limit 1)
and b.trace_id = a.trace_id),
jam_tracesnap s1, jam_tracesnap s2
where s1.trace_id = a.trace_id
and s1.seq_no = a.seq_no
and s2.trace_id = b.trace_id
and s2.seq_no = b.seq_no
and a.trace_id = 22
order by a.seq_no;
from jam_trace_sys a join jam_trace_sys b on
(b.seq_no = (select c.seq_no from jam_trace_sys c
where c.trace_id = a.trace_id and c.seq_no > a.seq_no
order by c.trace_id, c.seq_no limit 1)
and b.trace_id = a.trace_id),
jam_tracesnap s1, jam_tracesnap s2
where s1.trace_id = a.trace_id
and s1.seq_no = a.seq_no
and s2.trace_id = b.trace_id
and s2.seq_no = b.seq_no
and a.trace_id = 22
order by a.seq_no;
This gave me a nice clean execution plan (there are some extra sources listed, but that is a bug in postgresql) ...
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=11.24..11.25 rows=1 width=20) (actual time=0.040..0.040 rows=0 loops=1)
Sort Key: a.seq_no
->Nested Loop (cost=0.00..11.23 rows=1 width=20) (actual time=0.028..0.028 rows=0 loops=1)
Join Filter: ("inner".seq_no = "outer".seq_no)
->Nested Loop (cost=0.00..9.20 rows=1 width=32) (actual time=0.024..0.024 rows=0 loops=1)
Join Filter: ("inner".seq_no = "outer".seq_no)
->Nested Loop (cost=0.00..7.17 rows=1 width=32) (actual time=0.020..0.020 rows=0 loops=1)
Join Filter: ("inner".seq_no = (subplan))
->Index Scan using jam_trace_sys_n1 on jam_trace_sys a (cost=0.00..3.41 rows=1 width=16) (actual time=0.016..0.016 rows=0 loops=1)
Index Cond: (trace_id = 22)
->Index Scan using jam_trace_sys_n1 on jam_trace_sys b (cost=0.00..3.41 rows=1 width=16) (never executed)
Index Cond: (22 = trace_id)
SubPlan
->Limit (cost=0.00..0.33 rows=1 width=8) (never executed)
->Index Scan using jam_trace_sys_n1 on jam_trace_sys c (cost=0.00..6.36 rows=19 width=8) (never executed)
Index Cond: ((trace_id = $0) AND (seq_no > $1))
->Index Scan using jam_tracesnap_n1 on jam_tracesnap s1 (cost=0.00..2.01 rows=1 width=8) (never executed)
Index Cond: (22 = trace_id)
->Index Scan using jam_tracesnap_n1 on jam_tracesnap s2 (cost=0.00..2.01 rows=1 width=8) (never executed)
Index Cond: (22 = trace_id)
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=11.24..11.25 rows=1 width=20) (actual time=0.040..0.040 rows=0 loops=1)
Sort Key: a.seq_no
->Nested Loop (cost=0.00..11.23 rows=1 width=20) (actual time=0.028..0.028 rows=0 loops=1)
Join Filter: ("inner".seq_no = "outer".seq_no)
->Nested Loop (cost=0.00..9.20 rows=1 width=32) (actual time=0.024..0.024 rows=0 loops=1)
Join Filter: ("inner".seq_no = "outer".seq_no)
->Nested Loop (cost=0.00..7.17 rows=1 width=32) (actual time=0.020..0.020 rows=0 loops=1)
Join Filter: ("inner".seq_no = (subplan))
->Index Scan using jam_trace_sys_n1 on jam_trace_sys a (cost=0.00..3.41 rows=1 width=16) (actual time=0.016..0.016 rows=0 loops=1)
Index Cond: (trace_id = 22)
->Index Scan using jam_trace_sys_n1 on jam_trace_sys b (cost=0.00..3.41 rows=1 width=16) (never executed)
Index Cond: (22 = trace_id)
SubPlan
->Limit (cost=0.00..0.33 rows=1 width=8) (never executed)
->Index Scan using jam_trace_sys_n1 on jam_trace_sys c (cost=0.00..6.36 rows=19 width=8) (never executed)
Index Cond: ((trace_id = $0) AND (seq_no > $1))
->Index Scan using jam_tracesnap_n1 on jam_tracesnap s1 (cost=0.00..2.01 rows=1 width=8) (never executed)
Index Cond: (22 = trace_id)
->Index Scan using jam_tracesnap_n1 on jam_tracesnap s2 (cost=0.00..2.01 rows=1 width=8) (never executed)
Index Cond: (22 = trace_id)
Regards,
Virag
----- Original Message -----
From: "Szűcs Gábor" <surrano@gmail.com>
Sent: Monday, January 09, 2006 7:59 AM
Subject: Re: Avoiding cartesian product
>
> AFAIK aggregates aren't indexed in postgres (at least not before 8.1, which
> indexes min and max, iirc).
>
> Also, I don't think you need to exactly determine the trace_id. Try this one
> (OTOH; might be wrong):
>
> select DISTINCT ON (a.trace_id, a.seq_no) -- See below
> b.gc_minor - a.gc_minor, b.gc_major - a.gc_major
> from jam_trace_sys a, jam_trace_sys b
> where a.trace_id = 22
> and b.trace_id = a.trace_id
> and b.seq_no > a.seq_no -- Simply ">" is enough
> order by a.trace_id, a.seq_no, b.seq_no; -- DISTINCT, see below
>
> The trick is that DISTINCT takes the first one in each group (IIRC it is
> granted, at least someone told me on one of these lists :) ) so if you order
> by the DISTINCT attributes and then by b.seq_no, you'll get the smallest of
> appropriate b.seq_no values for each DISTINCT values.
>
> The idea of DISTINCTing by both columns is to make sure the planner finds
> the index. (lately I had a similar problem: WHERE a=1 ORDER BY b LIMIT 1
> used an index on b, instead of an (a,b) index. Using ORDER BY a,b solved it)
>
> HTH,
>
> --
> G.
>
>
> On 2006.01.04. 5:12, Virag Saksena wrote:
> >
> > I have a table which stores cumulative values
> > I would like to display/chart the deltas between successive data
> > collections
> >
> > If my primary key only increments by 1, I could write a simple query
> >
> > select b.gc_minor - a.gc_minor, b.gc_major - a.gc_major
> > from jam_trace_sys a, jam_trace_sys b
> > where a.trace_id = 22
> > and b.trace_id = a.trace_id
> > and b.seq_no = a.seq_no + 1
> > order by a.seq_no;
> >
> > However the difference in sequence number is variable.
> > So (in Oracle) I used to extract the next seq_no using a correlated
> > sub-query
> >
> > select b.gc_minor - a.gc_minor, b.gc_major - a.gc_major
> > from jam_trace_sys a, jam_trace_sys b
> > where a.trace_id = 22
> > and (b.trace_id, b.seq_no) =
> > (select a.trace_id, min(c.seq_no) from jam_trace_sys c
> > where c.trace_id = a.trace_id and c.seq_no > a.seq_no)
> > order by a.seq_no;
> >
> > For every row in A, The correlated sub-query from C will execute
> > With an appropriate index, it will just descend the index Btree
> > go one row to the right and return that row (min > :value)
> > and join to table B
> >
> > SELECT STATEMENT
> > SORT ORDER BY
> > TABLE ACCESS BY INDEX ROWID JAM_TRACE_SYS B
> > NESTED LOOPS
> > TABLE ACCESS BY INDEX ROWID JAM_TRACE_SYS A
> > INDEX RANGE SCAN JAM_TRACE_SYS_N1 A
> > INDEX RANGE SCAN JAM_TRACE_SYS_N1 B
> > SORT AGGREGATE
> > INDEX RANGE SCAN JAM_TRACE_SYS_N1 C
> >
> > In postgreSQL A and B are doing a cartesian product
> > then C gets executed for every row in this cartesian product
> > and most of the extra rows get thrown out.
> > Is there any way to force an execution plan like above where the
> > correlated subquery runs before going to B.
> > The table is small right now, but it will grow to have millions of rows
> > QUERY PLAN
> > -----------------------------------------------------------------------------------------------------------------------------------
> > Sort (cost=124911.81..124944.84 rows=13213 width=20) (actual
> > time=13096.754..13097.053 rows=149 loops=1)
> > Sort Key: a.seq_no
> > -> Nested Loop (cost=4.34..124007.40 rows=13213 width=20) (actual
> > time=1948.300..13096.329 rows=149 loops=1)
> > Join Filter: (subplan)
> > -> Seq Scan on jam_trace_sys b (cost=0.00..3.75 rows=175
> > width=16) (actual time=0.005..0.534 rows=175 loops=1)
> > -> Materialize (cost=4.34..5.85 rows=151 width=16) (actual
> > time=0.002..0.324 rows=150 loops=175)
> > -> Seq Scan on jam_trace_sys a (cost=0.00..4.19
> > rows=151 width=16) (actual time=0.022..0.687 rows=150 loops=1)
> > Filter: (trace_id = 22)
> > SubPlan
> > -> Aggregate (cost=4.67..4.67 rows=1 width=4) (actual
> > time=0.486..0.488 rows=1 loops=26250)
> > -> Seq Scan on jam_trace_sys c (cost=0.00..4.62
> > rows=15 width=4) (actual time=0.058..0.311 rows=74 loops=26250)
> > Filter: ((trace_id = $0) AND (seq_no > $1))
> > Total runtime: 13097.557 ms
> > (13 rows)
> >
> > pglnx01=> \d jam_trace_sys
> > Table "public.jam_trace_sys"
> > Column | Type | Modifiers
> > -----------------+---------+-----------
> > trace_id | integer |
> > seq_no | integer |
> > cpu_utilization | integer |
> > gc_minor | integer |
> > gc_major | integer |
> > heap_used | integer |
> > Indexes:
> > "jam_trace_sys_n1" btree (trace_id, seq_no)
> >
> > pglnx01=> select count(*) from jam_trace_Sys ;
> > count
> > -------
> > 175
> > (1 row)
> >
> > pglnx01=> select trace_id, count(*) from jam_trace_sys group by trace_id ;
> > trace_id | count
> > ----------+-------
> > 15 | 2
> > 18 | 21
> > 22 | 150
> > 16 | 2
> > (4 rows)
>
pgsql-performance by date: