Thread: The nested view from hell - Restricting a subquerry
I've got a legacy app with a hefty performance problem. The basic problem is stupid design. It takes 10-15 seconds of CPUtime to look up an invoice.<br /> Basically it's trying to mash up extra columns on an otherwise simple query, and thoseextra columns are subtotals. Simplified (this looks best in a fixed width font):<br /><br /><tt> SELECT max(order_view.order_id),max(order_view.invoice_id),sum(order_view.mileage)<br /> FROM (SELECT order_id,invoice_id,0 as miles FROM eg_order<br /> UNION <br /> SELECT order_id,0 , miles FROM eg_order_line)<br /> order_view GROUP BY order_view.order_id;<br /></tt><br /> A select byorder_id is fast. The problem is the application uses "select * from view where invoice_id=x", and the second part ofthe UNION returns all possible rows in the database. These get filtered out later, but at considerable performance hit.<br/><br /> Is there a way to get the "where invoice_id=x" into the subquery? "select distinct order_id from eg_orderwhere invoice_id=x" would do it.<br /> I can't redesign the view, because it all goes into an object relational mapperthat thinks it's a real table.<br /><br /> -Bryce Nesbitt<br /><br /><br /><tt>stage=# \d eg_invoice_summary_view<br/> View "public.eg_invoice_summary_view"<br /> Column | Type | Modifiers <br /> ----------------+------------------------+-----------<br /> invoice_id | integer | <br /> cso_id | integer | <br /> period_id | integer | <br /> account_id | integer | <br /> invoice_number | character varying(192)| <br /> invoice_date | date | <br /> amount | numeric | <br /> tax | bigint | <br /> invoice_style | integer | <br /> plan_name | charactervarying(128) | <br /> View definition:<br /> SELECT i.invoice_id, i.cso_id, i.period_id, i.account_id, i.invoice_number,i.invoice_date, i.amount, sum(tax.tax_amount) AS tax, i.invoice_style, i.plan_name<br /> FROM ( SELECTi.invoice_id, i.cso_id, i.period_id, i.account_id, i.invoice_number, i.invoice_date, sum(o.amount) AS amount, i.invoice_style,i.plan_name<br /> FROM eg_invoice i<br /> LEFT JOIN <b>eg_order_summary_view</b> o ON i.invoice_id= o.invoice_id<br /> GROUP BY i.invoice_id, i.cso_id, i.period_id, i.account_id, i.invoice_number, i.invoice_date,i.invoice_style, i.plan_name) i<br /> LEFT JOIN eg_invoice_tax tax ON i.invoice_id = tax.invoice_id<br/> GROUP BY i.invoice_id, i.cso_id, i.period_id, i.account_id, i.invoice_number, i.invoice_date, i.amount,i.invoice_style, i.plan_name;<br /><br /> stage=# \d eg_order_summary_view<br /> View "public.eg_order_summary_view"<br/> Column | Type | Modifiers <br /> ------------+--------------------------+-----------<br/> order_id | integer | <br /> d | "unknown" | <br /> cso_id | integer | <br /> invoice_id | integer |<br /> period_id | integer | <br /> ref_id | integer | <br /> order_type | integer | <br /> desc1 | text | <br /> desc2 | text |<br /> desc3 | text | <br /> desc4 | text | <br /> desc5 | text | <br /> desc6 | text | <br /> desc7 | text |<br /> desc8 | text | <br /> order_from | timestamp with time zone | <br /> order_to | timestampwith time zone | <br /> hours | double precision | <br /> mileage | double precision |<br /> amount | bigint | <br /> View definition:<br /> SELECT <b>order_view.order_id</b>, 'D' ASd, max(order_view.cso_id) AS cso_id, <b>max(order_view.invoice_id) AS invoice_id</b>, max(order_view.period_id) AS period_id,max(order_view.ref_id) AS ref_id, max(order_view.order_type) AS order_type, max(order_view.desc1::text) AS desc1,max(order_view.desc2::text) AS desc2, max(order_view.desc3::text) AS desc3, max(order_view.desc4::text) AS desc4, max(order_view.desc5::text)AS desc5, max(order_view.desc6::text) AS desc6, max(order_view.desc7::text) AS desc7, max(order_view.desc8::text)AS desc8, max(order_view.order_from) AS order_from, max(order_view.order_to) AS order_to, sum(order_view.hours)AS hours, sum(order_view.mileage) AS mileage, sum(order_view.amount) AS amount<br /> FROM ( SELECTeg_order.order_id, eg_order.cso_id, e<b>g_order.invoice_id</b>, eg_order.period_id, eg_order.ref_id, eg_order.order_type,eg_order.desc1, eg_order.desc2, eg_order.desc3, eg_order.desc4, eg_order.desc5, eg_order.desc6, eg_order.desc7,eg_order.desc8, eg_order.order_from, eg_order.order_to, 0 AS hours, 0 AS mileage, 0 AS amount<br /> FROM eg_order<br /> UNION <br /> ( SELECT <b>eg_order_line.order_id</b>, 0, <b>0</b>, 0, 0, 0, NULL::"unknown"AS "unknown", NULL::"unknown" AS "unknown", NULL::"unknown" AS "unknown", NULL::"unknown" AS "unknown", NULL::"unknown"AS "unknown", NULL::"unknown" AS "unknown", NULL::"unknown" AS "unknown", NULL::"unknown" AS "unknown", to_timestamp('1970-01-01'::text,'YYYY-MM-DD'::text) AS to_timestamp, to_timestamp('1970-01-01'::text, 'YYYY-MM-DD'::text)AS to_timestamp, 0 AS hours, eg_order_line.quantity AS mileage, eg_order_line.amt_value<br /> FROM eg_order_line<br /> WHERE eg_order_line.order_line_type = 20<br /> UNION <br /> SELECT<b>eg_order_line.order_id</b>, 0, <b>0,</b> 0, 0, 0, NULL::"unknown" AS "unknown", NULL::"unknown" AS "unknown", NULL::"unknown"AS "unknown", NULL::"unknown" AS "unknown", NULL::"unknown" AS "unknown", NULL::"unknown" AS "unknown", NULL::"unknown"AS "unknown", NULL::"unknown" AS "unknown", to_timestamp('1970-01-01'::text, 'YYYY-MM-DD'::text) AS to_timestamp,to_timestamp('1970-01-01'::text, 'YYYY-MM-DD'::text) AS to_timestamp, eg_order_line.quantity AS hours, 0 ASmileage, eg_order_line.amt_value<br /> FROM eg_order_line<br /> WHERE eg_order_line.order_line_type<> 20)) order_view<br /> <b>GROUP BY order_view.order_id</b>;<br /><br /> stage=# explainselect * from eg_invoice_summary_view where invoice_id=5;<br /> <br /> QUERY PLAN <br /> <br /> ---------------------------------------------------------------------------------------------------------------<br/> ---------------------------------------------------------------------------------------------------------------<br/> -------------------<br/> GroupAggregate <b>(cost=485551.60..485551.80 rows=1 width=736)</b><br /> -> Sort (cost=485551.60..485551.61rows=7 width=736)<br /> Sort Key: i.invoice_id, i.cso_id, i.period_id, i.account_id, i.invoice_number,i.invoice_date, i.amoun<br /> t, i.invoice_style, i.plan_name<br /> -> Nested Loop Left Join (cost=485541.78..485551.50 rows=7 width=736)<br /> -> HashAggregate (cost=485540.75..485540.77 rows=1width=56)<br /> -> Nested Loop Left Join (cost=417895.29..485536.25 rows=200 width=56)<br/> -> Index Scan using eg_invoice_pkey on eg_invoice i (cost=0.00..3.01 rows=1width=<br /> 48)<br /> Index Cond: (invoice_id = 5)<br /> -> GroupAggregate (cost=417895.29..485529.24 rows=200 width=316)<br /> Filter: (max("?column3?") = 5)<br /> -> Unique (cost=417895.29..448632.54rows=614745 width=200)<br /> -> Sort (cost=417895.29..419432.15rows=614745 width=200)<br /> Sort Key: order_id, cso_id,invoice_id, period_id, ref_id, order_t<br /> ype, desc1, desc2, desc3, desc4, desc5, desc6, desc7, desc8, order_from,order_to, hours, mileage, amount<br /> -> Append (cost=0.00..106638.70rows=614745 width=200)<br /> -> Subquery Scan"*SELECT* 1" (cost=0.00..9621.64 rows=233<br /> 432 width=200)<br /> -> Seq Scan on eg_order (cost=0.00..7287.32 rows=233<br /> 432width=200)<br /> -> Result (cost=77951.41..97017.06 rows=381313width=16)<br /> -> Unique (cost=77951.41..97017.06rows=381313 width<br /> =16)<br /> -> Sort (cost=77951.41..78904.69 rows=381313 w<br /> idth=16)<br/> Sort Key: order_id, cso_id, invoice_id,pe<br /> riod_id, ref_id, order_type, desc1, desc2, desc3, desc4, desc5, desc6, desc7, desc8, order_from, order_to,hour<br /> s, mileage, amount<br /> -> Append (cost=0.00..25112.52 rows=3813<br /> 13 width=16)<br /> -> Subquery Scan "*SELECT* 2" (cos<br /> t=0.00..11887.06rows=146043 width=16)<br /> -> Seq Scan on eg_order_line <br /> (cost=0.00..10426.63rows=146043 width=16)<br /> Filter: (order_line_type<br /> = 20)<br/> -> Subquery Scan "*SELECT* 3" (cos<br/> t=0.00..13225.46 rows=235270 width=16)<br /> -> Seq Scan on eg_order_line <br /> (cost=0.00..10872.76rows=235270 width=16)<br /> Filter: (order_line_type<br /> <>20)<br /> -> Bitmap Heap Scan on eg_invoice_tax tax (cost=1.03..10.65 rows=7 width=8)<br /> Recheck Cond: (invoice_id = 5)<br /> -> Bitmap Index Scan on ix2f10773c8edf278d (cost=0.00..1.03 rows=7 width=0)<br /> Index Cond: (invoice_id = 5)<br /> (31rows)<br /><br /></tt><br /><br /><pre class="moz-signature" cols="100">-- ---- Visit <a class="moz-txt-link-freetext" href="http://www.obviously.com/">http://www.obviously.com/</a> </pre>
Bryce Nesbitt skrev: > I've got a legacy app with a hefty performance problem. The basic > problem is stupid design. It takes 10-15 seconds of CPU time to look up > an invoice. > Basically it's trying to mash up extra columns on an otherwise simple > query, and those extra columns are subtotals. Simplified (this looks > best in a fixed width font): > > SELECT max(order_view.order_id),max(order_view.invoice_id) > ,sum(order_view.mileage) > FROM (SELECT order_id,invoice_id, 0 as miles FROM eg_order > UNION > SELECT order_id,0 , miles FROM eg_order_line) > order_view GROUP BY order_view.order_id; > > A select by order_id is fast. The problem is the application uses > "select * from view where invoice_id=x", and the second part of the > UNION returns all possible rows in the database. These get filtered out > later, but at considerable performance hit. Just for the record, I believe your simplified example should look like this (changed "max(order_id)" to "order_id" in outer select , changed "miles" to "mileage"): SELECT order_id,max(order_view.invoice_id),sum(order_view.mileage) FROM (SELECT order_id,invoice_id, 0 as mileageFROM eg_order UNION SELECT order_id, 0, mileage FROM eg_order_line) order_view GROUPBY order_view.order_id; It is pretty clear that the problem comes from joining on the result of an aggregate. PG apparently is not smart enough to recognize that the result of a max must be one of the values of the column (meaning that it can use an index) It is not clear whether there is a FK relation between eg_order and eg_order_line and what the PK of eg_order is. If there is a FK, you can do something along the lines of SELECT order_id,invoice_idCOALESCE(sum(mileage),0) as mileage FROM eg_order LEFT JOIN eg_order_line USING order_id GROUP BY order_id, invoice_id If there can be more than one invoice_id per order_id, you might need to add HAVING invoice_id = (SELECT max(invoice_id) FROM eg_order eo2 WHERE eg_order.order_id = eo2.order_id) or similar. Hope this helps, Nis
Nis Jørgensen <nis@superlativ.dk> writes: > Bryce Nesbitt skrev: >> I've got a legacy app with a hefty performance problem. > It is not clear whether there is a FK relation between eg_order and > eg_order_line and what the PK of eg_order is. Or in English: can there really be more than one invoice_id per order_id or vice versa? AFAICS your only hope of making searches on invoice_id be fast is if you can GROUP BY both order_id and invoice_id, and get rid of both max() calls. > PG apparently is not smart enough to recognize that the > result of a max must be one of the values of the column (meaning that it > can use an index) That's because it can't. As written, the query demands sums over groups that *include* a specific invoice_id --- but each sum has to include contributions from rows that could have another invoice_id. So the condition on invoice_id cannot be pushed down to the individual scans. If, in fact, the correct answer could be had by fetching only rows with the specified invoice_id, then you need to fix the view to make that clear. BTW, wouldn't UNION ALL be better than UNION here? regards, tom lane
Great analysis Gregory & Tom... UNION ALL will make a difference. --------------------------- Here invoices consist of orders, orders consist of order lines. Thus, each order_id corresponds to just one invoice_id. One possibility is to add an invoice_id to the order_line. That way the optimizer need not push anything... the rows will get filtered out early. Gregory Stark wrote: > Two things are going wrong. > > First, does an order_id belong to precisely one invoice_id? In which case > instead of grouping just y order_id you need to group by invoice_id,order_id > and remove the MAX() from around invoice_id. The optimizer can't push the > invoice_id=? clause down inside the group by because normally to calculate > max() it needs th entire set of records. It doesn't know there will be only > one value. > > Secondly it seems to me each branch of the union generates distinct values. > That is there can't be any duplicates or overlap. In which case you can change > the UNION to a UNION ALL. > > There might be more problems but at first glance it looks like the optimizer > would be able to push the invoice_id=? clause into the subqueries once those > two changes are made which would throw away the subtotals and reduce to a > simple index lookup on invoice_id.
Tom Lane skrev: >> PG apparently is not smart enough to recognize that the >> result of a max must be one of the values of the column (meaning that it >> can use an index) > > That's because it can't. As written, the query demands sums over groups > that *include* a specific invoice_id --- but each sum has to include > contributions from rows that could have another invoice_id. So the > condition on invoice_id cannot be pushed down to the individual scans. > If, in fact, the correct answer could be had by fetching only rows with > the specified invoice_id, then you need to fix the view to make that > clear. Well, the query can be satisfied by looking only at the rows with an order_id matching the invoice_id given. The condition that this is the largest invoice in the group then needs to be checked afterwards. I certainly did not expect the query planner to be able to deduce this, though. Nis
Nis Jørgensen <nis@superlativ.dk> writes: > Well, the query can be satisfied by looking only at the rows with an > order_id matching the invoice_id given. The condition that this is the > largest invoice in the group then needs to be checked afterwards. > > I certainly did not expect the query planner to be able to deduce this, > though. No, that's not true. If you had two records in eg_order with the same order_id but different invoice_ids then the query would need both records to satisfy the query. The query planner can't deduce that this can't happen because it simply does not have that information. The more I look at this view the more I think it's just seriously broken. Why is it grouping by order_id at all if, I suspect, there will only be one record per order_id in eg_orders?? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark skrev: > Nis Jørgensen <nis@superlativ.dk> writes: > >> Well, the query can be satisfied by looking only at the rows with an >> order_id matching the invoice_id given. The condition that this is the >> largest invoice in the group then needs to be checked afterwards. >> >> I certainly did not expect the query planner to be able to deduce this, >> though. > > No, that's not true. If you had two records in eg_order with the same order_id > but different invoice_ids then the query would need both records to satisfy > the query. I assume you mean "... then both records are necessary in order to calculate the results of the query". This does not contradict what I wrote. If you mean "... then both records need to satisfy <some criteria>" then I don't understand which criteria you are talking about. The query in question was: SELECT order_id,max(order_view.invoice_id),sum(order_view.mileage) FROM (SELECT order_id,invoice_id, 0 as mileageFROM eg_order UNION SELECT order_id, 0, mileage FROM eg_order_line) order_view GROUPBY order_view.order_id; This is then restricted on max(invoice_id) As far as I can tell, these steps produce the correct results (without the later information about primary keys provided by Bryce) INPUT: my_invoice_id 1. Look up all order_ids for which (order_id,my_invoice_id) appear in eg_orders 2. Find all rows (in both branches of the UNION) with these id_s 3. Group the rows, and calculate max(invoice_id) 4. Filter the result rows on max(invoice_id) = my_invoice_id. > The query planner can't deduce that this can't happen because it simply does > not have that information. Well, I realize that the existing query planner can't. Since I could arrive at the plan above by deduction so could a hypothetical different query planner. Whether it is faster is of course unknown - my guess is that it would be in this case. > The more I look at this view the more I think it's just seriously broken. > Why is it grouping by order_id at all if, I suspect, there will only be one > record per order_id in eg_orders?? Bryce has confirmed this. The above is only of academic interest. Yours, Nis Jorgensen
Nis Jørgensen <nis@superlativ.dk> writes: >>> Well, the query can be satisfied by looking only at the rows with an >>> order_id matching the invoice_id given. The condition that this is the >>> largest invoice in the group then needs to be checked afterwards. >>> >>> I certainly did not expect the query planner to be able to deduce this, >>> though. >> >> No, that's not true. If you had two records in eg_order with the same order_id >> but different invoice_ids then the query would need both records to satisfy >> the query. > > I assume you mean "... then both records are necessary in order to > calculate the results of the query". This does not contradict what I wrote. Sorry I meant, "the query as written can not be satisfied by looking only at the rows with the specified invoice_id". > SELECT order_id, > max(order_view.invoice_id), > sum(order_view.mileage) > FROM (SELECT order_id,invoice_id, 0 as mileage FROM eg_order > UNION > SELECT order_id, 0, mileage FROM eg_order_line) > order_view GROUP BY order_view.order_id; > > This is then restricted on max(invoice_id) > > As far as I can tell, these steps produce the correct results (without > the later information about primary keys provided by Bryce) > > INPUT: my_invoice_id > > 1. Look up all order_ids for which (order_id,my_invoice_id) appear in > eg_orders > > 2. Find all rows (in both branches of the UNION) with these id_s > > 3. Group the rows, and calculate max(invoice_id) > > 4. Filter the result rows on max(invoice_id) = my_invoice_id. So here's a hypothetical data set for which this algorithm fails: order_id invoice_id mileage -------------------------------------------- 1 1 100 1 2 100 Your algorithm would produce order_id max(invoice_id) sum(mileage) -------------------------------------------- 1 1 100 Whereas the correct output would be to output no records at all. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark skrev: >> 1. Look up all order_ids for which (order_id,my_invoice_id) appear in >> eg_orders >> >> 2. Find all rows (in both branches of the UNION) with these id_s >> >> 3. Group the rows, and calculate max(invoice_id) >> >> 4. Filter the result rows on max(invoice_id) = my_invoice_id. > > So here's a hypothetical data set for which this algorithm fails: > > order_id invoice_id mileage > -------------------------------------------- > 1 1 100 > 1 2 100 > > Your algorithm would produce > > order_id max(invoice_id) sum(mileage) > -------------------------------------------- > 1 1 100 > > Whereas the correct output would be to output no records at all. (I assume you are using "1" as the parameter to the query). You seem to have interpreted one or more of the steps differently than I intended them. I have tried to clarify (resorting to SQL for the first step). 1. SELECT DISTINCT order_id FROM eg_order WHERE invoice_id = my_invoice_id 2. Find all rows (in both branches of the UNION) with these order_ids. 3. Group the rows on order_id, and calculate max(invoice_id) 4. Filter the result rows on max(invoice_id) = my_invoice_id. Thus, step 1 produces order_id ----------1 step 2 produces order_id invoice_id mileage--------------------------------------------1 1 1001 2 100 Step 3 producesorder_id max(invoice_id) sum(mileage)--------------------------------------------1 2 200 Step 4 eliminates this row, since it does not satisfy the criteria. Nis
"Gregory Stark" <stark@enterprisedb.com> writes: > Nis Jørgensen <nis@superlativ.dk> writes: > >> 1. Look up all order_ids for which (order_id,my_invoice_id) appear in >> eg_orders >> >> 2. Find all rows (in both branches of the UNION) with these id_s Oh, did you mean look up the order_ids for which there was at least one record with the invoice_id specified, then look up all records with those order_ids regardless of invoice_id? That would work but as you say it would be hard to tell whether it will be any faster than just processing all the order_ids. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
One down. Total runtime of the simplest query went from 34661.572 ms to .634 ms (45,000 times faster). stage=> explain analyze select * from eg_order_summary_view where invoice_id=1432655; QUERY PLAN -----------------------------------------------------------------------------------------------------------------------------------------------HashAggregate (cost=47.75..48.60 rows=13 width=214) (actual time=0.444..0.467 rows=9 loops=1) -> Nested Loop Left Join (cost=0.00..46.76 rows=21 width=214) (actual time=0.037..0.175 rows=14 loops=1) -> Index Scan using ix522779518edf278d on eg_order (cost=0.00..4.70 rows=13 width=200) (actual time=0.020..0.034 rows=9 loops=1) Index Cond: (invoice_id = 1432655) -> Index Scan using ixf8331222783867cc on eg_order_line (cost=0.00..3.21 rows=2 width=18) (actual time=0.007..0.010 rows=2 loops=9) Index Cond: ("outer".order_id =eg_order_line.order_id)Total runtime: 0.645 ms (7 rows) stage=> \d eg_order_summary_view; View definition:SELECT eg_order.order_id, 'D' AS d, max(eg_order.cso_id) AS cso_id, eg_order.invoice_id, max(eg_order.period_id) AS period_id, max(eg_order.ref_id) AS ref_id, max(eg_order.order_type::integer) AS order_type, max(eg_order.desc1::text) AS desc1, max(eg_order.desc2::text) AS desc2, max(eg_order.desc3::text) AS desc3, max(eg_order.desc4::text) AS desc4, max(eg_order.desc5::text) AS desc5, max(eg_order.desc6::text) AS desc6, max(eg_order.desc7::text) AS desc7, max(eg_order.desc8::text) AS desc8, max(timezone('PST8PDT'::text, eg_order.order_from)) AS order_from, max(timezone('PST8PDT'::text, eg_order.order_to)) AS order_to, sum( CASE WHEN eg_order_line.order_line_type <> 20 THEN eg_order_line.quantity ELSE 0::double precision END) AS hours, sum( CASE WHEN eg_order_line.order_line_type= 20 THEN eg_order_line.quantity ELSE 0::double precision END) AS mileage, sum(eg_order_line.amt_value) AS amount FROM eg_order LEFT JOIN eg_order_line USING (order_id) GROUP BY eg_order.order_id, eg_order.invoice_id;