Re: vacuum analyze slows sql query - Mailing list pgsql-performance
From | John Meinel |
---|---|
Subject | Re: vacuum analyze slows sql query |
Date | |
Msg-id | 419008DA.3000007@johnmeinel.com Whole thread Raw |
In response to | Re: vacuum analyze slows sql query (patrick ~ <sidsrr@yahoo.com>) |
Responses |
Re: vacuum analyze slows sql query
|
List | pgsql-performance |
patrick ~ wrote: > Hi John, > > Thanks for your reply and analysis. > No problem. It just happens that this is a problem we ran into recently. > > --- John Meinel <john@johnmeinel.com> wrote: > > >>patrick ~ wrote: [...] > > Hmm... The fact is I am selecting (in this example anyway) over all > values in pkk_offer table and calling the stored function with each > pkk_offer.offer_id which in turn does a select on pkk_purchase table. > Note that offer_id is a foreign key in pkk_purchase referencing > pkk_offer table. > > I don't know if it matters (I suspect that it does) but I am using > LIMIT 1 in the sub-query/stored function. All I need is one single > row meeting any of the criteria laid out in the stored procedure to > establish an offer_id is "pending". > If you are trying to establish existence, we also had a whole thread on this. Basically what we found was that adding an ORDER BY clause, helped tremendously in getting the planner to switch to an Index scan. You might try something like: SELECT column FROM mytable WHERE column='myval' ORDER BY column LIMIT 1; There seems to be a big difference between the above statement and: SELECT column FROM mytable WHERE column='myval' LIMIT 1; > > >>So what is the solution? The only one I'm aware of is to turn your >>static function into a dynamic one. >> >>So somewhere within the function you build up a SQL query string and >>call EXECUTE str. This forces the query planner to be run every time you >>call the function. This means that if you call it will a "nice" value, >>you will get the fast index scan, and if you call it with a "bad" value, >>it will switch back to seq scan. >> >>The downside is you don't get much of a benefit from using as stored >>procedure, as it has to run the query planner all the time (as though >>you issue the query manually each time.) But it still might be better >>for you in the long run. > > > > Well, running the query without the stored function, basically typing > out the stored function as a sub-query shows me: > > > pkk=# explain analyze select o0.offer_id, ( select case when ( select > p0.purchase_id from pkk_purchase p0 where p0.offer_id = o0.offer_id and ( > p0.pending = true or ( ( p0.expire_time > now() or p0.expire_time isnull ) and > p0.cancel_date isnull ) ) limit 1 ) isnull then false else true end ) from > pkk_offer o0 ; > QUERY PLAN > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Seq Scan on pkk_offer o0 (cost=0.00..1834.11 rows=618 width=4) (actual > time=2413.398..1341885.084 rows=618 loops=1) > SubPlan > -> Result (cost=2.94..2.95 rows=1 width=0) (actual > time=2171.287..2171.289 rows=1 loops=618) > InitPlan > -> Limit (cost=0.00..2.94 rows=1 width=4) (actual > time=2171.264..2171.266 rows=1 loops=618) > -> Seq Scan on pkk_purchase p0 (cost=0.00..37225.83 > rows=12670 width=4) (actual time=2171.245..2171.245 rows=1 loops=618) > Filter: ((offer_id = $0) AND > (((expire_time)::timestamp with time zone > now()) OR (expire_time IS NULL) OR > (pending = true)) AND ((cancel_date IS NULL) OR (pending = true))) > Total runtime: 1341887.523 ms > (8 rows) > > > while deleting all statistics on the pkk_% tables I get: > > pkk=# delete from pg_statistic where pg_statistic.starelid = pg_class.oid and > pg_class.relname like 'pkk_%'; > DELETE 11 > > pkk=# explain analyze select o0.offer_id, ( select case when ( select > p0.purchase_id from pkk_purchase p0 where p0.offer_id = o0.offer_id and ( > p0.pending = true or ( ( p0.expire_time > now() or p0.expire_time isnull ) and > p0.cancel_date isnull ) ) limit 1 ) isnull then false else true end ) from > pkk_offer o0 ; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > Seq Scan on pkk_offer o0 (cost=0.00..6646.94 rows=618 width=4) (actual > time=0.190..799.930 rows=618 loops=1) > SubPlan > -> Result (cost=10.73..10.74 rows=1 width=0) (actual time=1.277..1.278 > rows=1 loops=618) > InitPlan > -> Limit (cost=0.00..10.73 rows=1 width=4) (actual > time=1.266..1.267 rows=1 loops=618) > -> Index Scan using pur_offer_id_idx on pkk_purchase p0 > (cost=0.00..20690.18 rows=1929 width=4) (actual time=1.258..1.258 rows=1 > loops=618) > Index Cond: (offer_id = $0) > Filter: ((((expire_time)::timestamp with time zone > > now()) OR (expire_time IS NULL) OR (pending = true)) AND ((cancel_date IS NULL) > OR (pending = true))) > Total runtime: 801.234 ms > (9 rows) > > > As you can see this query (over all values of pkk_offer) with out > any pg_statistics on the pkk_purchase table is extremely fast. > > Is this a bug in the PostgreSQL planner that misjudges the best > choice with pg_statistics at hand? > > --patrick > In order to understand your query I broke it up and restructured it as follows. You might try to add the ORDER BY line, and see what you get. EXPLAIN ANALYZE SELECT o0.offer_id, ( SELECT CASE WHEN ( SELECT p0.purchase_id FROM pkk_purchase p0 WHERE p0.offer_id = o0.offer_id AND ( p0.pending = true OR ( p0.cancel_date ISNULL AND ( p0.expire_time > NOW() or p0.expire_time ISNULL ) ) ) ORDER BY p0.purchase_id --Insert this line LIMIT 1 ) ISNULL THEN false ELSE true END ) FROM pkk_offer o0 ; I also wonder about some parts of your query. I don't know your business logic but you are tacking a lot of the query into the WHERE, and I wonder if postgres just thinks it's going to need to analyze all the data before it gets a match. I also don't remember what columns you have indices on. Or whether it is common to have cancel_date null, or expire_time > NOW() or expire_time null, etc. So is your function just everything within the CASE statement? You might try rewriting it as a loop using a cursor, as I believe using a cursor again lends itself to index scans (as it is even more likely that you will not get all the data.) Something like (this is untested) create function is_pending(int) returns bool as ' declare p_id alias for $1; begin DECLARE is_pending_cursor CURSOR FOR SELECT p0.purchase_id FROM pkk_purchase p0 WHERE p0.offer_id = p_id; FOR READ ONLY; FOR FETCH NEXT is_pending_cursor IF row.pending = true or ... RETURN true; RETURN false; END; '; I don't know cursors terribly well, but this might get you going. Probably in your case you also have a large portion of the records with pending = true, which means that with an index scan it doesn't have to hit very many records. Either you have a low record count for a particular purchase_id, or you have a lot of pendings. seq scan just hurts because it has to sift out all the other id's that you don't care about. But remember, I'm not a guru, just someone who has been hit by the inequal distribution problem. John =:->
Attachment
pgsql-performance by date: