Thread: Query works when kludged, but would prefer "best practice" solution
Hi all, Please see the section marked as "PROBLEM" in "ORIGINAL QUERY" plan below. You can see it's pretty slow. Oddly enough, an index for facility_address_id is available but not being used, but I suspect it's questionable whether it would be an improvement. I knew that the filter was best applied to the results of the join - my attempts to restructure the query with subqueries, etc didn't fool the planner - it always figured out a plan that had this problem SEQ SCAN + FILTER in it. Finally, I "hid" the condition from the planner with a coalesce function - see "SOLUTION" in the "KLUDGED QUERY" plan below. Sure enough, a new plan appeared with a remarkable performance improvement! The purpose of this query is to find facilities within a geographical area when the complete address data is missing (hence the facility_address_id is NULL). PG is 8.4.2 on RH linux server with 1GB ram, HDD is RAID 1. I don't like kludging like this - so any and all help or advice is appreciated! Carlo ORIGINAL QUERY select pp.provider_id, pp.provider_practice_id, nearby.distance from mdx_core.provider_practice as pp join mdx_core.facility as f on f.facility_id = pp.facility_id join (select * from mdx_core.zips_in_mile_range('08820', 10)) as nearby on f.default_country_code = 'US' and f.default_postal_code = nearby.zip where facility_address_id is null Hash Join (cost=30258.99..107702.53 rows=9438 width=16) (actual time=169.516..3064.188 rows=872 loops=1) Hash Cond: (pp.facility_id = f.facility_id) PROBLEM: ------------ -> Seq Scan on provider_practice pp (cost=0.00..74632.55 rows=724429 width=12) (actual time=0.039..1999.457 rows=728396 loops=1) Filter: (facility_address_id IS NULL) ------------ -> Hash (cost=29954.15..29954.15 rows=24387 width=12) (actual time=156.668..156.668 rows=907 loops=1) -> Nested Loop (cost=0.00..29954.15 rows=24387 width=12) (actual time=149.891..155.343 rows=907 loops=1) -> Function Scan on zips_in_mile_range (cost=0.00..12.50 rows=1000 width=40) (actual time=149.850..149.920 rows=66 loops=1) -> Index Scan using facility_country_postal_code_idx on facility f (cost=0.00..29.64 rows=24 width=15) (actual time=0.015..0.048 rows=14 loops=66) Index Cond: ((f.default_country_code = 'US'::bpchar) AND ((f.default_postal_code)::text = zips_in_mile_range.zip)) Total runtime: 3065.338 ms KLUDGED QUERY select pp.provider_id, pp.provider_practice_id, nearby.distance from mdx_core.provider_practice as pp join mdx_core.facility as f on f.facility_id = pp.facility_id join (select * from mdx_core.zips_in_mile_range('08820', 10)) as nearby on f.default_country_code = 'US' and f.default_postal_code = nearby.zip and coalesce(pp.facility_address_id, -1) = -1 Nested Loop (cost=0.00..112618.87 rows=180 width=16) (actual time=149.680..167.261 rows=872 loops=1) -> Nested Loop (cost=0.00..29954.15 rows=24387 width=12) (actual time=149.659..155.018 rows=907 loops=1) -> Function Scan on zips_in_mile_range (cost=0.00..12.50 rows=1000 width=40) (actual time=149.620..149.698 rows=66 loops=1) -> Index Scan using facility_country_postal_code_idx on facility f (cost=0.00..29.64 rows=24 width=15) (actual time=0.015..0.045 rows=14 loops=66) Index Cond: ((f.default_country_code = 'US'::bpchar) AND ((f.default_postal_code)::text = zips_in_mile_range.zip)) SOLUTION ------------- -> Index Scan using provider_practice_facility_idx on provider_practice pp (cost=0.00..3.38 rows=1 width=12) (actual time=0.007..0.009 rows=1 loops=907) Index Cond: (f.facility_id = pp.facility_id) Filter: (COALESCE(facility_address_id, -1) = -1) ------------- Total runtime: 168.275 ms
On 9/17/07, Carlo Stonebanks <stonec.register@sympatico.ca> wrote: > Hi all, > > Please see the section marked as "PROBLEM" in "ORIGINAL QUERY" plan below. > You can see it's pretty slow. Oddly enough, an index for facility_address_id > is available but not being used, but I suspect it's questionable whether it > would be an improvement. This looks like it might be the problem tom caught and rigged a solution to: http://people.planetpostgresql.org/dfetter/index.php?/archives/134-PostgreSQL-Weekly-News-September-03-2007.html (look fro band-aid). If that's the case, the solution is to wait for 8.2.5 (coming soon). merlin
Re: Query works when kludged, but would prefer "best practice" solution
Well, there goes my dream of getting a recommendation that will deliver a blinding insight into how to speed up all of my queries a thousand-fold. Thanks Merlin! -----Original Message----- From: Merlin Moncure [mailto:mmoncure@gmail.com] Sent: September 17, 2007 8:03 PM To: Carlo Stonebanks Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Query works when kludged, but would prefer "best practice" solution On 9/17/07, Carlo Stonebanks <stonec.register@sympatico.ca> wrote: > Hi all, > > Please see the section marked as "PROBLEM" in "ORIGINAL QUERY" plan below. > You can see it's pretty slow. Oddly enough, an index for facility_address_id > is available but not being used, but I suspect it's questionable whether it > would be an improvement. This looks like it might be the problem tom caught and rigged a solution to: http://people.planetpostgresql.org/dfetter/index.php?/archives/134-PostgreSQ L-Weekly-News-September-03-2007.html (look fro band-aid). If that's the case, the solution is to wait for 8.2.5 (coming soon). merlin
On 9/17/07, Carlo Stonebanks <stonec.register@sympatico.ca> wrote: > Well, there goes my dream of getting a recommendation that will deliver a > blinding insight into how to speed up all of my queries a thousand-fold. that's easy...delete your data! :-) merlin
Re: Query works when kludged, but would prefer "best practice" solution
Thanks, it worked. Client happy. Big bonus in the mail. -----Original Message----- From: Merlin Moncure [mailto:mmoncure@gmail.com] Sent: September 17, 2007 8:18 PM To: Carlo Stonebanks Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Query works when kludged, but would prefer "best practice" solution On 9/17/07, Carlo Stonebanks <stonec.register@sympatico.ca> wrote: > Well, there goes my dream of getting a recommendation that will deliver a > blinding insight into how to speed up all of my queries a thousand-fold. that's easy...delete your data! :-) merlin
"Merlin Moncure" <mmoncure@gmail.com> writes: > On 9/17/07, Carlo Stonebanks <stonec.register@sympatico.ca> wrote: >> Please see the section marked as "PROBLEM" in "ORIGINAL QUERY" plan below. > This looks like it might be the problem tom caught and rigged a solution to: > http://people.planetpostgresql.org/dfetter/index.php?/archives/134-PostgreSQL-Weekly-News-September-03-2007.html > (look fro band-aid). No, fraid not, that was about misestimation of outer joins, and I see no outer join here. What I do see is misestimation of a set-returning-function's output: -> Function Scan on zips_in_mile_range (cost=0.00..12.50 rows=1000 width=40) (actual time=149.850..149.920 rows=66 loops=1) There's not any very nice way to improve that in existing releases :-(. In 8.3 it will be possible to add a ROWS option to function definitions to replace the default "1000 rows" estimate with some other number, but that still helps little if the number of result rows is widely variable. As far as kluges go: rather than kluging conditions affecting unrelated tables, maybe you could put in a dummy constraint on the function's output --- ie, a condition you know is always true, but the planner won't know that, and will scale down its result-rows estimate accordingly. regards, tom lane
Re: Query works when kludged, but would prefer "best practice" solution
Hi Tom, Thanks for the suggestion - this concept is pretty new to me. Can you expand a bit on the idea of how to place such a "dummy" constraint on a function, and the conditions on which it affects the planner? Would this require that constraint_exclusion be set on? (When I go to sleep, I have a dream -- and in this dream Tom writes a brilliant three line code sample that makes it all clear to me, and I wake up a PostgreSQL guru) ;-) Carlo -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: September 17, 2007 11:30 PM To: Merlin Moncure Cc: Carlo Stonebanks; pgsql-performance@postgresql.org Subject: Re: [PERFORM] Query works when kludged, but would prefer "best practice" solution "Merlin Moncure" <mmoncure@gmail.com> writes: > On 9/17/07, Carlo Stonebanks <stonec.register@sympatico.ca> wrote: >> Please see the section marked as "PROBLEM" in "ORIGINAL QUERY" plan below. > This looks like it might be the problem tom caught and rigged a solution to: > http://people.planetpostgresql.org/dfetter/index.php?/archives/134-PostgreSQ L-Weekly-News-September-03-2007.html > (look fro band-aid). No, fraid not, that was about misestimation of outer joins, and I see no outer join here. What I do see is misestimation of a set-returning-function's output: -> Function Scan on zips_in_mile_range (cost=0.00..12.50 rows=1000 width=40) (actual time=149.850..149.920 rows=66 loops=1) There's not any very nice way to improve that in existing releases :-(. In 8.3 it will be possible to add a ROWS option to function definitions to replace the default "1000 rows" estimate with some other number, but that still helps little if the number of result rows is widely variable. As far as kluges go: rather than kluging conditions affecting unrelated tables, maybe you could put in a dummy constraint on the function's output --- ie, a condition you know is always true, but the planner won't know that, and will scale down its result-rows estimate accordingly. regards, tom lane
"Carlo Stonebanks" <stonec.register@sympatico.ca> writes: > Thanks for the suggestion - this concept is pretty new to me. Can you expand > a bit on the idea of how to place such a "dummy" constraint on a function, > and the conditions on which it affects the planner? Let's say that you know that the function's result column "x" can only range from 1 to 1000. The planner does not know that, and has no statistics from which it could guess, so it's going to fall back on default selectivity estimates for any WHERE clause involving x. So for instance you could tack on something like FROM ... (select * from myfunc() where x <= 1000) ... which will change the actual query result not at all, but will cause the planner to reduce its estimate of the number of rows out by whatever the default selectivity estimate for an inequality is (from memory, 0.333, but try it and see). If that's too much or not enough, you could try some other clauses that will never really reject any rows, for instance where x >= 1 and x <= 1000 where x <> -1 where x is not null Of course this technique depends on knowing something that will always be true about your data, but most people can think of something... Now this is not going to affect the evaluation of the function itself at all. What it will do is affect the shape of a join plan built atop that function scan, since joins are pretty much all about minimizing the number of intermediate rows. > Would this require that > constraint_exclusion be set on? No. regards, tom lane
I think Tom is talking about something like this: explain select * from foo(); QUERY PLAN ---------------------------------------------------------------------- Function Scan on foo (cost=0.00..12.50 rows=1000 width=50) The planner is estimating the function will return 1000 rows. explain select * from foo() where id > 0; QUERY PLAN --------------------------------------------------------------------- Function Scan on foo (cost=0.00..15.00 rows=333 width=50) Filter: (id > 0) In the second case I am asking for all ids greater than zero, but my ids are all positive integers. The planner doesn't know that, so it assumes the where clause will decrease the number of results. I would still say this is a kludge, and since you already found a kludge that works, this may not help you at all. Dave -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Carlo Stonebanks Sent: Tuesday, September 18, 2007 1:29 AM To: 'Tom Lane'; 'Merlin Moncure' Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Query works when kludged, but would prefer "best practice" solution Hi Tom, Thanks for the suggestion - this concept is pretty new to me. Can you expand a bit on the idea of how to place such a "dummy" constraint on a function, and the conditions on which it affects the planner? Would this require that constraint_exclusion be set on? (When I go to sleep, I have a dream -- and in this dream Tom writes a brilliant three line code sample that makes it all clear to me, and I wake up a PostgreSQL guru) ;-) Carlo -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: September 17, 2007 11:30 PM To: Merlin Moncure Cc: Carlo Stonebanks; pgsql-performance@postgresql.org Subject: Re: [PERFORM] Query works when kludged, but would prefer "best practice" solution "Merlin Moncure" <mmoncure@gmail.com> writes: > On 9/17/07, Carlo Stonebanks <stonec.register@sympatico.ca> wrote: >> Please see the section marked as "PROBLEM" in "ORIGINAL QUERY" plan below. > This looks like it might be the problem tom caught and rigged a > solution to: > http://people.planetpostgresql.org/dfetter/index.php?/archives/134-PostgreSQ L-Weekly-News-September-03-2007.html > (look fro band-aid). No, fraid not, that was about misestimation of outer joins, and I see no outer join here. What I do see is misestimation of a set-returning-function's output: -> Function Scan on zips_in_mile_range (cost=0.00..12.50 rows=1000 width=40) (actual time=149.850..149.920 rows=66 loops=1) There's not any very nice way to improve that in existing releases :-(. In 8.3 it will be possible to add a ROWS option to function definitions to replace the default "1000 rows" estimate with some other number, but that still helps little if the number of result rows is widely variable. As far as kluges go: rather than kluging conditions affecting unrelated tables, maybe you could put in a dummy constraint on the function's output --- ie, a condition you know is always true, but the planner won't know that, and will scale down its result-rows estimate accordingly. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings
My client "publishes" an "edition" of their DB from his production site to his hosted web/db server. This is done by FTPing a backup of the DB to his hosting provider. Immediately after a "publication" (restore to web/db server) we immediately run VACUUM ANALYZE to make sure the statistics and row estimates are correct. The problem is, after this initial VACUUM ANALYZE, the row estimates in query plans are off by several orders of magnitude. For example, a disastrous plan was created because the planner estimated 4K rows when in fact it returned 980K rows. Sometimes - a day or two later - the plans return to "normal" and row estimates are closer to realistic values. Guessing that there may be background events that are correcting the row estimates over time, I ran an ANALYZE on the DB - and sure enough - the row estimates corrected themselves. The puzzling thing is, there have been no writes of any sort to the data - there is no reason for the stats to have changed. I believe that a VACUUM may not be necessary for a newly restored DB, but I assumed that VACUUM ANALYZE and ANALYZE have the same net result. Am I wrong? If I am not wrong (i.e. VACUUM ANALYZE and ANALYZE should produce the same results) why would the performance improve on a DB that has seen no transactional activity only after the SECOND try? PG 8.2.4 on RH LINUX 1GB RAM SCSI RAID 1 Carlo
I am noticing that my queries are spending a lot of time in nested loops. The table/index row estimates are not bad, but the nested loops can be off by a factor of 50. In any case, they are always too high.
If this is always occurring, is this an indication of a general configuration problem?
Carlo
select
pp.provider_id,
pp.provider_practice_id,
nearby.distance
from mdx_core.provider_practice as pp
join mdx_core.facility as f
on f.facility_id = pp.facility_id
join (select * from mdx_core.zips_in_mile_range('08820', 10) where zip > '') as nearby
on f.default_country_code = 'US'
and f.default_postal_code = nearby.zip
and pp.facility_address_id is NULL
union select
pp.provider_id,
pp.provider_practice_id,
nearby.distance
from mdx_core.provider_practice as pp
join mdx_core.facility_address as fa
on fa.facility_address_id = pp.facility_address_id
join mdx_core.address as a
on a.address_id = fa.address_id
join (select * from mdx_core.zips_in_mile_range('08820', 10) where zip > '') as nearby
on a.country_code = 'US'
and a.postal_code = nearby.zip
Unique (cost=67605.91..67653.18 rows=4727 width=16) (actual time=8634.618..8637.918 rows=907 loops=1)
-> Sort (cost=67605.91..67617.73 rows=4727 width=16) (actual time=8634.615..8635.651 rows=907 loops=1)
Sort Key: provider_id, provider_practice_id, distance
-> Append (cost=0.00..67317.41 rows=4727 width=16) (actual time=176.056..8632.429 rows=907 loops=1)
-> Nested Loop (cost=0.00..38947.07 rows=3143 width=16) (actual time=176.054..7867.962 rows=872 loops=1)
-> Nested Loop (cost=0.00..11520.79 rows=8121 width=12) (actual time=169.372..3041.010 rows=907 loops=1)
-> Function Scan on zips_in_mile_range (cost=0.00..15.00 rows=333 width=40) (actual time=151.479..151.671 rows=66 loops=1)
Filter: (zip > ''::text)
-> Index Scan using facility_country_postal_code_idx on facility f (cost=0.00..34.25 rows=24 width=15) (actual time=4.969..43.740 rows=14 loops=66)
Index Cond: ((f.default_country_code = 'US'::bpchar) AND ((f.default_postal_code)::text = zips_in_mile_range.zip))
-> Index Scan using provider_practice_facility_idx on provider_practice pp (cost=0.00..3.36 rows=1 width=12) (actual time=4.915..5.316 rows=1 loops=907)
Index Cond: (f.facility_id = pp.facility_id)
Filter: (facility_address_id IS NULL)
-> Nested Loop (cost=0.00..28323.07 rows=1584 width=16) (actual time=170.310..762.472 rows=35 loops=1)
-> Nested Loop (cost=0.00..7791.77 rows=1579 width=12) (actual time=170.289..612.579 rows=36 loops=1)
-> Nested Loop (cost=0.00..2595.96 rows=712 width=12) (actual time=167.017..354.261 rows=29 loops=1)
-> Function Scan on zips_in_mile_range (cost=0.00..15.00 rows=333 width=40) (actual time=150.188..150.312 rows=66 loops=1)
Filter: (zip > ''::text)
-> Index Scan using address_country_postal_code_address_idx on address a (cost=0.00..7.73 rows=2 width=17) (actual time=2.483..3.086 rows=0 loops=66)
Index Cond: ((a.country_code = 'US'::bpchar) AND ((a.postal_code)::text = zips_in_mile_range.zip))
-> Index Scan using facility_address_address_idx on facility_address fa (cost=0.00..7.15 rows=12 width=8) (actual time=7.652..8.901 rows=1 loops=29)
Index Cond: (a.address_id = fa.address_id)
-> Index Scan using provider_practice_facility_address_idx on provider_practice pp (cost=0.00..12.80 rows=16 width=12) (actual time=4.156..4.158 rows=1 loops=36)
Index Cond: (fa.facility_address_id = pp.facility_address_id)
Total runtime: 8639.066 ms