Re: vacuum analyze slows sql query - Mailing list pgsql-performance
From | patrick ~ |
---|---|
Subject | Re: vacuum analyze slows sql query |
Date | |
Msg-id | 20041108231951.23927.qmail@web52105.mail.yahoo.com Whole thread Raw |
In response to | Re: vacuum analyze slows sql query (John Meinel <john@johnmeinel.com>) |
Responses |
Re: vacuum analyze slows sql query
Re: vacuum analyze slows sql query Re: vacuum analyze slows sql query |
List | pgsql-performance |
Hi John, Thanks for your reply and analysis. --- John Meinel <john@johnmeinel.com> wrote: > patrick ~ wrote: > [...] > > pkk=# explain analyze execute pkk_01( 241 ); > > QUERY PLAN > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Result (cost=2.66..2.67 rows=1 width=0) (actual time=2872.211..2872.213 > > rows=1 loops=1) > > InitPlan > > -> Limit (cost=0.00..2.66 rows=1 width=4) (actual > > time=2872.189..2872.189 rows=0 loops=1) > > -> Seq Scan on pkk_purchase p0 (cost=0.00..37225.83 rows=13983 > > width=4) (actual time=2872.180..2872.180 rows=0 loops=1) > > Filter: ((offer_id = $1) AND (((expire_time)::timestamp > with > > time zone > now()) OR (expire_time IS NULL) OR (pending = true)) AND > > ((cancel_date IS NULL) OR (pending = true))) > > Total runtime: 2872.339 ms > > (6 rows) > > > > Time: 2873.479 ms > > > > [...] > > > So, is this the ultimate solution to this issue? > > > > --patrick > > It's not so much that correlation is < 0.5. It sounds like you're > running into the same issue that I ran into in the past. You have a > column with lots of repeated values, and a few exceptional ones. Notice > this part of the query: > -> Seq Scan on pkk_purchase p0 (cost rows=13983) (actual rows=0) > > For a general number, it thinks it might return 14,000 rows, hence the > sequential scan. Before you do ANALYZE, it uses whatever defaults exist, > which are probably closer to reality. > > The problem is that you probably have some values for pkk_purchase where > it could return 14,000 rows (possibly much much more). And for those, > seq scan is the best plan. However, for the particular value that you > are testing, there are very few (no) entries in the table. You are absoultely correct: pkk=# select offer_id,count(*) from pkk_purchase group by offer_id order by count ; offer_id | count ----------+-------- 1019 | 1 1018 | 1 1016 | 1 (many of these) ... | ... 2131 | 6 844 | 6 1098 | 6 (a dozen or so of these) ... | ... 2263 | 682 2145 | 723 2258 | 797 2091 | 863 ... | ... 1153 | 96330 (the few heavy weights) 244 | 122163 242 | 255719 243 | 273427 184 | 348476 > With a prepared statement (or a function) it has to determine ahead of > time what the best query is without knowing what value you are going to > ask for. > > Lets say for a second that you manage to trick it into using index scan, > and then you actually call the function with one of the values that > returns 1,000s of rows. Probably it will take 10-100 times longer than > if it used a seq scan. Hmm... The fact is I am selecting (in this example anyway) over all values in pkk_offer table and calling the stored function with each pkk_offer.offer_id which in turn does a select on pkk_purchase table. Note that offer_id is a foreign key in pkk_purchase referencing pkk_offer table. I don't know if it matters (I suspect that it does) but I am using LIMIT 1 in the sub-query/stored function. All I need is one single row meeting any of the criteria laid out in the stored procedure to establish an offer_id is "pending". > So what is the solution? The only one I'm aware of is to turn your > static function into a dynamic one. > > So somewhere within the function you build up a SQL query string and > call EXECUTE str. This forces the query planner to be run every time you > call the function. This means that if you call it will a "nice" value, > you will get the fast index scan, and if you call it with a "bad" value, > it will switch back to seq scan. > > The downside is you don't get much of a benefit from using as stored > procedure, as it has to run the query planner all the time (as though > you issue the query manually each time.) But it still might be better > for you in the long run. Well, running the query without the stored function, basically typing out the stored function as a sub-query shows me: pkk=# explain analyze select o0.offer_id, ( select case when ( select p0.purchase_id from pkk_purchase p0 where p0.offer_id = o0.offer_id and ( p0.pending = true or ( ( p0.expire_time > now() or p0.expire_time isnull ) and p0.cancel_date isnull ) ) limit 1 ) isnull then false else true end ) from pkk_offer o0 ; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on pkk_offer o0 (cost=0.00..1834.11 rows=618 width=4) (actual time=2413.398..1341885.084 rows=618 loops=1) SubPlan -> Result (cost=2.94..2.95 rows=1 width=0) (actual time=2171.287..2171.289 rows=1 loops=618) InitPlan -> Limit (cost=0.00..2.94 rows=1 width=4) (actual time=2171.264..2171.266 rows=1 loops=618) -> Seq Scan on pkk_purchase p0 (cost=0.00..37225.83 rows=12670 width=4) (actual time=2171.245..2171.245 rows=1 loops=618) Filter: ((offer_id = $0) AND (((expire_time)::timestamp with time zone > now()) OR (expire_time IS NULL) OR (pending = true)) AND ((cancel_date IS NULL) OR (pending = true))) Total runtime: 1341887.523 ms (8 rows) while deleting all statistics on the pkk_% tables I get: pkk=# delete from pg_statistic where pg_statistic.starelid = pg_class.oid and pg_class.relname like 'pkk_%'; DELETE 11 pkk=# explain analyze select o0.offer_id, ( select case when ( select p0.purchase_id from pkk_purchase p0 where p0.offer_id = o0.offer_id and ( p0.pending = true or ( ( p0.expire_time > now() or p0.expire_time isnull ) and p0.cancel_date isnull ) ) limit 1 ) isnull then false else true end ) from pkk_offer o0 ; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Seq Scan on pkk_offer o0 (cost=0.00..6646.94 rows=618 width=4) (actual time=0.190..799.930 rows=618 loops=1) SubPlan -> Result (cost=10.73..10.74 rows=1 width=0) (actual time=1.277..1.278 rows=1 loops=618) InitPlan -> Limit (cost=0.00..10.73 rows=1 width=4) (actual time=1.266..1.267 rows=1 loops=618) -> Index Scan using pur_offer_id_idx on pkk_purchase p0 (cost=0.00..20690.18 rows=1929 width=4) (actual time=1.258..1.258 rows=1 loops=618) Index Cond: (offer_id = $0) Filter: ((((expire_time)::timestamp with time zone > now()) OR (expire_time IS NULL) OR (pending = true)) AND ((cancel_date IS NULL) OR (pending = true))) Total runtime: 801.234 ms (9 rows) As you can see this query (over all values of pkk_offer) with out any pg_statistics on the pkk_purchase table is extremely fast. Is this a bug in the PostgreSQL planner that misjudges the best choice with pg_statistics at hand? --patrick __________________________________ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com
pgsql-performance by date: