Thread: vacuum analyze slows sql query
Greetings pgsql-performance :) Yesterday I posted to the pgsql-sql list about an issue with VACUUM while trying to track-down an issue with performance of a SQL SELECT statement invovling a stored function. It was suggested that I bring the discussion over to -performance. Instread of reposting the message here is a link to the original message followed by a brief summary: http://marc.theaimsgroup.com/?l=postgresql-sql&m=109945118928530&w=2 Summary: Our customer complains about web/php-based UI sluggishness accessing the data in db. I created a "stripped down" version of the tables in question to be able to post to the pgsql-sql list asking for hints as to how I can improve the SQL query. While doing this I noticed that if I 'createdb' and populate it with the "sanatized" data the query in question is quite fast; 618 rows returned in 864.522 ms. This was puzzling. Next I noticed that after a VACUUM the very same query would slow down to a crawl; 618 rows returned in 1080688.921 ms). This was reproduced on PostgreSQL 7.4.2 running on a Intel PIII 700Mhz, 512mb. This system is my /personal/ test system/sandbox. i.e., it isn't being stressed by any other processes. Thanks for reading, --patrick __________________________________ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com
Given that the plan doesn't change after an analyze, my guess would be that the first query is hitting cached data, then you vacuum and that chews though all the cache with its own data pushing the good data out of the cache so it has to be re-fetched from disk. If you run the select a 2nd time after the vacuum, what is the time? Not sure what your pkk_offer_has_pending_purch function does, that might be something to look at as well. I could be wrong, but thats the only thing that makes sense to me. ARC is supposed to help with that type of behavior in8.0 patrick ~ wrote: > Greetings pgsql-performance :) > > Yesterday I posted to the pgsql-sql list about an issue with VACUUM > while trying to track-down an issue with performance of a SQL SELECT > statement invovling a stored function. It was suggested that I bring > the discussion over to -performance. > > Instread of reposting the message here is a link to the original > message followed by a brief summary: > > http://marc.theaimsgroup.com/?l=postgresql-sql&m=109945118928530&w=2 > > > Summary: > > Our customer complains about web/php-based UI sluggishness accessing > the data in db. I created a "stripped down" version of the tables > in question to be able to post to the pgsql-sql list asking for hints > as to how I can improve the SQL query. While doing this I noticed > that if I 'createdb' and populate it with the "sanatized" data the > query in question is quite fast; 618 rows returned in 864.522 ms. > This was puzzling. Next I noticed that after a VACUUM the very same > query would slow down to a crawl; 618 rows returned in 1080688.921 ms). > > This was reproduced on PostgreSQL 7.4.2 running on a Intel PIII 700Mhz, > 512mb. This system is my /personal/ test system/sandbox. i.e., it > isn't being stressed by any other processes. > > > Thanks for reading, > --patrick >
patrick ~ <sidsrr@yahoo.com> writes: > that if I 'createdb' and populate it with the "sanatized" data the > query in question is quite fast; 618 rows returned in 864.522 ms. > This was puzzling. Next I noticed that after a VACUUM the very same > query would slow down to a crawl; 618 rows returned in 1080688.921 ms). The outer query is too simple to have more than one possible plan, so the issue is certainly a change in query plans inside the function. You need to be investigating what's happening inside that function. 7.1 doesn't have adequate tools for this, but in 7.4 you can use PREPARE and EXPLAIN ANALYZE EXECUTE to examine the query plans used for parameterized statements, which is what you've got here. My bet is that with ANALYZE stats present, the planner guesses wrong about which index to use; but without looking at EXPLAIN ANALYZE output there's no way to be sure. BTW, why the bizarrely complicated substitute for a NOT NULL test? ISTM you only need create function pkk_offer_has_pending_purch( integer ) returns bool as ' select p0.purchase_id is not null from pkk_purchase p0 where p0.offer_id = $1 and ( p0.pending = true or ( ( p0.expire_time > now() or p0.expire_time isnull ) and p0.cancel_date isnull ) ) limit 1 ' language 'sql' ; (Actually, seeing that pkk_purchase.purchase_id is defined as NOT NULL, I wonder why the function exists at all ... but I suppose you've "stripped" the function to the point of being nonsense.) regards, tom lane
Here is a fresh run with 'explain analyze' run before and after the VACUUM statement: -- begin % dropdb pkk DROP DATABASE % createdb pkk CREATE DATABASE % psql pkk < pkk_db.sql ERROR: function pkk_offer_has_pending_purch(integer) does not exist ERROR: function pkk_offer_has_pending_purch2(integer) does not exist ERROR: table "pkk_billing" does not exist ERROR: table "pkk_purchase" does not exist ERROR: table "pkk_offer" does not exist NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "pkk_offer_pkey" for table "pkk_offer" CREATE TABLE NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "pkk_purchase_pkey" for table "pkk_purchase" CREATE TABLE CREATE INDEX CREATE INDEX CREATE INDEX CREATE TABLE CREATE INDEX CREATE FUNCTION CREATE FUNCTION % zcat pkk.20041028_00.sql.gz | psql pkk SET SET SET SET % psql pkk pkk=# select offer_id, pkk_offer_has_pending_purch( offer_id ) from pkk_offer ; <ommitting output /> (618 rows) Time: 877.348 ms pkk=# explain analyze select offer_id, pkk_offer_has_pending_purch( offer_id ) from pkk_offer ; QUERY PLAN ----------------------------------------------------------------------------------------------------------- Seq Scan on pkk_offer (cost=0.00..22.50 rows=1000 width=4) (actual time=1.291..845.485 rows=618 loops=1) Total runtime: 849.475 ms (2 rows) Time: 866.613 ms pkk=# vacuum analyze ; VACUUM Time: 99344.399 ms pkk=# explain analyze select offer_id, pkk_offer_has_pending_purch( offer_id ) from pkk_offer ; QUERY PLAN ----------------------------------------------------------------------------------------------------------------- Seq Scan on pkk_offer (cost=0.00..13.72 rows=618 width=4) (actual time=3636.401..1047412.851 rows=618 loops=1) Total runtime: 1047415.525 ms (2 rows) Time: 1047489.477 ms -- end Tom, The reason of the extra "case" part in the function is to ensure non-null fields on the result. I tried your version as well and i get similar performance results: -- begin pkk=# create function toms_pending_purch( integer ) returns bool as 'select p0.purchase_id is not null from pkk_purchase p0 where p0.offer_id = $1 and ( p0.pending = true or ( ( p0.expire_time > now() or p0.expire_time isnull ) and p0.cancel_date isnull ) ) limit 1 ' language 'sql' ; CREATE FUNCTION Time: 2.496 ms pkk=# select offer_id, toms_pending_purch( offer_id ) from pkk_offer ; (618 rows) Time: 1052339.506 ms -- end Right now, I'm studying the document section on PREPARE and will attempt to play around with it. I was asked (in a prior post) whether running the statement a second time after the VACUUM improves in performance. It does not. After the VACUUM the statement remains slow. Thanks for your help, --patrick --- Tom Lane <tgl@sss.pgh.pa.us> wrote: > patrick ~ <sidsrr@yahoo.com> writes: > > that if I 'createdb' and populate it with the "sanatized" data the > > query in question is quite fast; 618 rows returned in 864.522 ms. > > This was puzzling. Next I noticed that after a VACUUM the very same > > query would slow down to a crawl; 618 rows returned in 1080688.921 ms). > > The outer query is too simple to have more than one possible plan, > so the issue is certainly a change in query plans inside the function. > You need to be investigating what's happening inside that function. > 7.1 doesn't have adequate tools for this, but in 7.4 you can use > PREPARE and EXPLAIN ANALYZE EXECUTE to examine the query plans used > for parameterized statements, which is what you've got here. > > My bet is that with ANALYZE stats present, the planner guesses wrong > about which index to use; but without looking at EXPLAIN ANALYZE output > there's no way to be sure. > > BTW, why the bizarrely complicated substitute for a NOT NULL test? > ISTM you only need > > create function > pkk_offer_has_pending_purch( integer ) > returns bool > as ' > select p0.purchase_id is not null > from pkk_purchase p0 > where p0.offer_id = $1 > and ( p0.pending = true > or ( ( p0.expire_time > now() > or p0.expire_time isnull ) > and p0.cancel_date isnull ) ) > limit 1 > ' language 'sql' ; > > (Actually, seeing that pkk_purchase.purchase_id is defined as NOT NULL, > I wonder why the function exists at all ... but I suppose you've > "stripped" the function to the point of being nonsense.) > > regards, tom lane __________________________________ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com
Just wanted to know if there were any insights after looking at requested 'explain analyze select ...'? Thanks, --patrick __________________________________ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com
Looking around at the pg_ tables and some PostgreSQL online docs prompted by another post/reply on this list regarding ALERT TABLE SET STATISTICS i found out that prior to a VACUUM the following select (taken from the online docs) shows: pkk=# select relname, relkind, reltuples, relpages from pg_class where relname like 'pkk_%'; relname | relkind | reltuples | relpages -------------------+---------+-----------+---------- pkk_billing | r | 1000 | 10 pkk_offer | r | 1000 | 10 pkk_offer_pkey | i | 1000 | 1 pkk_purchase | r | 1000 | 10 pkk_purchase_pkey | i | 1000 | 1 (5 rows) Time: 1097.263 ms and after a VACUUM: pkk=# vacuum analyze ; VACUUM Time: 100543.359 ms it shows: pkk=# select relname, relkind, reltuples, relpages from pg_class where relname like 'pkk_%'; relname | relkind | reltuples | relpages -------------------+---------+-------------+---------- pkk_billing | r | 714830 | 4930 pkk_offer | r | 618 | 6 pkk_offer_pkey | i | 618 | 4 pkk_purchase | r | 1.14863e+06 | 8510 pkk_purchase_pkey | i | 1.14863e+06 | 8214 (5 rows) Time: 3.868 ms Further, I notice that if I were to delete rows from the pg_statistic table I get the db in a state where the query is fast again: pkk=# explain analyze select offer_id, pkk_offer_has_pending_purch( offer_id ) from pkk_offer ; QUERY PLAN ----------------------------------------------------------------------------------------------------------------- Seq Scan on pkk_offer (cost=0.00..13.72 rows=618 width=4) (actual time=2415.739..1065709.092 rows=618 loops=1) Total runtime: 1065711.651 ms (2 rows) Time: 1065713.446 ms pkk=# delete from pg_statistic where pg_statistic.starelid = pg_class.oid and pg_class.relname like 'pkk_%'; DELETE 11 Time: 3.368 ms pkk=# select offer_id, pkk_offer_has_pending_purch( offer_id ) from pkk_offer ; (618 rows) Time: 876.377 ms pkk=# explain analyze select offer_id, pkk_offer_has_pending_purch( offer_id ) from pkk_offer ; QUERY PLAN ---------------------------------------------------------------------------------------------------------- Seq Scan on pkk_offer (cost=0.00..13.72 rows=618 width=4) (actual time=1.329..846.786 rows=618 loops=1) Total runtime: 848.170 ms (2 rows) Time: 849.958 ms Now, I'm sure someone (a PostgreSQL developer most likely) is about to shoot me for doing such a thing :-) But, however *ugly, wrong, sacrilege* this may be, if this is the only solution...err workaround I have that will help me i must resort to it. The only two questions I have about this are: 1. Is this really the only solution left for me? 2. Am I in anyway screwing the db doing this? Best regards, --patrick __________________________________ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com
patrick ~ <sidsrr@yahoo.com> writes: > 1. Is this really the only solution left for me? You still haven't followed the suggestions that were given to you (ie, find out what is happening with the plan for the query inside the problematic function). regards, tom lane
Hi Tom, -performance@, I apologize if I didn't follow through with the PREPARE and EXECUTE. I assume that is what you are refering to. After reading the PostgreSQL docs on PREPARE statement I realized two things: a) PREPARE is only session long and b) that I can not (at least I haven't figured out how) PREPARE a statement which would mimic my original select statement which I could EXECUTE over all rows of pkk_offer table. Best I could do is either: PREPARE pkk_01 ( interger ) select $1, pkk_offer_has_pending_purch( $1 ) from pkk_offer ; or PREPARE pkk_00 ( integer ) <the def of pkk_offer_has_pending_purc( integer )> In the former case the EXPLAIN ANALYZE doesn't give enough data (it is the same as w/o the PREPARE statement). In the latter case, I can only execute it with one offer_id at at time. Is this sufficient? If so, here are the results before and after VACUUM ANALYZE: pkk=# explain analyze execute pkk_00( 795 ) ; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Result (cost=8.57..8.58 rows=1 width=0) (actual time=0.095..0.096 rows=1 loops=1) InitPlan -> Limit (cost=0.00..8.57 rows=1 width=4) (actual time=0.083..0.084 rows=1 loops=1) -> Index Scan using pur_offer_id_idx on pkk_purchase p0 (cost=0.00..17.13 rows=2 width=4) (actual time=0.079..0.079 rows=1 loops=1) Index Cond: (offer_id = $1) Filter: ((((expire_time)::timestamp with time zone > now()) OR (expire_time IS NULL) OR (pending = true)) AND ((cancel_date IS NULL) OR (pending = true))) Total runtime: 0.238 ms (7 rows) pkk=# VACUUM ANALYZE ; VACUUM Time: 97105.589 ms pkk=# explain analyze execute pkk_00( 795 ) ; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Result (cost=8.57..8.58 rows=1 width=0) (actual time=0.329..0.330 rows=1 loops=1) InitPlan -> Limit (cost=0.00..8.57 rows=1 width=4) (actual time=0.311..0.312 rows=1 loops=1) -> Index Scan using pur_offer_id_idx on pkk_purchase p0 (cost=0.00..17.13 rows=2 width=4) (actual time=0.307..0.307 rows=1 loops=1) Index Cond: (offer_id = $1) Filter: ((((expire_time)::timestamp with time zone > now()) OR (expire_time IS NULL) OR (pending = true)) AND ((cancel_date IS NULL) OR (pending = true))) Total runtime: 0.969 ms (7 rows) Time: 16.252 ms In both before and after "Index Scan" is used on pur_offer_id_idx. So, unless I'm missing something obvious here I am at a loss. I went as far as doing the EXPLAIN ANALYZE EXECUTE pkk_00( offer_id ) for each offer_id in pkk_offer table one at a time (not manually but by scripting it). All instances use "Index Scan". I only noticed a couple that had quite large "actual times" like this following: pkk=# explain analyze execute pkk_00( 2312 ) ; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Result (cost=8.57..8.58 rows=1 width=0) (actual time=21.279..21.282 rows=1 loops=1) InitPlan -> Limit (cost=0.00..8.57 rows=1 width=4) (actual time=21.256..21.258 rows=1 loops=1) -> Index Scan using pur_offer_id_idx on pkk_purchase p0 (cost=0.00..17.13 rows=2 width=4) (actual time=21.249..21.249 rows=1 loops=1) Index Cond: (offer_id = $1) Filter: ((((expire_time)::timestamp with time zone > now()) OR (expire_time IS NULL) OR (pending = true)) AND ((cancel_date IS NULL) OR (pending = true))) Total runtime: 21.435 ms (7 rows) Time: 22.541 ms Which makes sense when you look at the number of entries this offer_id has in pkk_purchase table vs offer_id = 795: pkk=# select offer_id, count(*) from pkk_purchase where offer_id in ( 795, 2312 ) group by offer_id ; offer_id | count ----------+------- 795 | 4 2312 | 1015 (2 rows) Time: 21.118 ms --patrick --- Tom Lane <tgl@sss.pgh.pa.us> wrote: > patrick ~ <sidsrr@yahoo.com> writes: > > 1. Is this really the only solution left for me? > > You still haven't followed the suggestions that were given to you > (ie, find out what is happening with the plan for the query inside > the problematic function). > > regards, tom lane __________________________________ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com
patrick ~ <sidsrr@yahoo.com> writes: > PREPARE pkk_00 ( integer ) <the def of pkk_offer_has_pending_purc( integer ) This is what you want to do, but not quite like that. The PREPARE determines the plan and so VACUUMing and re-EXECUTing is going to show the same plan. What we need to look at is - standing start PREPARE pkk_00 ... EXPLAIN ANALYZE EXECUTE pkk_00 ... VACUUM ANALYZE; PREPARE pkk_01 ... EXPLAIN ANALYZE EXECUTE pkk_01 ... regards, tom lane
Sorry for the late reply. Was feeling a bit under the weather this weekend and didn't get a chance to look at this. --- Tom Lane <tgl@sss.pgh.pa.us> wrote: > patrick ~ <sidsrr@yahoo.com> writes: > > PREPARE pkk_00 ( integer ) <the def of pkk_offer_has_pending_purc( integer > ) > > This is what you want to do, but not quite like that. The PREPARE > determines the plan and so VACUUMing and re-EXECUTing is going to show > the same plan. What we need to look at is > - standing start > PREPARE pkk_00 ... > EXPLAIN ANALYZE EXECUTE pkk_00 ... > VACUUM ANALYZE; > PREPARE pkk_01 ... > EXPLAIN ANALYZE EXECUTE pkk_01 ... But of course! I feel a bit silly now. This is what I get after following Tom's directions: pkk=# prepare pkk_00 ( integer ) as select ... PREPARE Time: 1.753 ms pkk=# execute pkk_00( 241 ); case ------ f (1 row) Time: 0.788 ms pkk=# explain analyze execute pkk_00( 241 ); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Result (cost=10.73..10.74 rows=1 width=0) (actual time=0.067..0.068 rows=1 loops=1) InitPlan -> Limit (cost=0.00..10.73 rows=1 width=4) (actual time=0.055..0.055 rows=0 loops=1) -> Index Scan using pur_offer_id_idx on pkk_purchase p0 (cost=0.00..20690.18 rows=1929 width=4) (actual time=0.052..0.052 rows=0 loops=1) Index Cond: (offer_id = $1) Filter: ((((expire_time)::timestamp with time zone > now()) OR (expire_time IS NULL) OR (pending = true)) AND ((cancel_date IS NULL) OR (pending = true))) Total runtime: 0.213 ms (7 rows) Time: 24.654 ms pkk=# vacuum analyze ; VACUUM Time: 128826.078 ms pkk=# prepare pkk_01 ( integer ) as select ... PREPARE Time: 104.658 ms pkk=# execute pkk_01( 241 ); case ------ f (1 row) Time: 7652.708 ms pkk=# explain analyze execute pkk_01( 241 ); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Result (cost=2.66..2.67 rows=1 width=0) (actual time=2872.211..2872.213 rows=1 loops=1) InitPlan -> Limit (cost=0.00..2.66 rows=1 width=4) (actual time=2872.189..2872.189 rows=0 loops=1) -> Seq Scan on pkk_purchase p0 (cost=0.00..37225.83 rows=13983 width=4) (actual time=2872.180..2872.180 rows=0 loops=1) Filter: ((offer_id = $1) AND (((expire_time)::timestamp with time zone > now()) OR (expire_time IS NULL) OR (pending = true)) AND ((cancel_date IS NULL) OR (pending = true))) Total runtime: 2872.339 ms (6 rows) Time: 2873.479 ms So it looks like after the VACCUM the planner resorts to Seq Scan rather than Index Scan. This is because of the value of correlation field in pg_stats (according to PostgreSQL docs) being closer to 0 rather than �1: pkk=# select tablename,attname,correlation from pg_stats where tablename = 'pkk_purchase' and attname = 'offer_id' ; tablename | attname | correlation --------------+----------+------------- pkk_purchase | offer_id | 0.428598 (1 row) So I started to experiment with ALTER TABLE SET STATISTICS values to see which gets the correlation closer to �1. The trend seems to indicat the higher the stat value is set it pushes the correlation value closer to 0: set statistics correlation ---------------------------- 800 0.393108 500 0.408137 200 0.43197 50 0.435211 1 0.45758 And a subsequent PREPARE and EXPLAIN ANALYZE confirms that the Planer reverts back to using the Index Scan after setting stats to 1 (even though correlation value is still closer to 0 than 1): pkk=# explain analyze execute pkk_02( 241 ); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Result (cost=2.95..2.96 rows=1 width=0) (actual time=0.068..0.069 rows=1 loops=1) InitPlan -> Limit (cost=0.00..2.95 rows=1 width=4) (actual time=0.056..0.056 rows=0 loops=1) -> Index Scan using pur_offer_id_idx on pkk_purchase p0 (cost=0.00..35810.51 rows=12119 width=4) (actual time=0.053..0.053 rows=0 loops=1) Index Cond: (offer_id = $1) Filter: ((((expire_time)::timestamp with time zone > now()) OR (expire_time IS NULL) OR (pending = true)) AND ((cancel_date IS NULL) OR (pending = true))) Total runtime: 0.200 ms (7 rows) So, is this the ultimate solution to this issue? --patrick __________________________________ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com
patrick ~ wrote: [...] > pkk=# explain analyze execute pkk_01( 241 ); > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Result (cost=2.66..2.67 rows=1 width=0) (actual time=2872.211..2872.213 > rows=1 loops=1) > InitPlan > -> Limit (cost=0.00..2.66 rows=1 width=4) (actual > time=2872.189..2872.189 rows=0 loops=1) > -> Seq Scan on pkk_purchase p0 (cost=0.00..37225.83 rows=13983 > width=4) (actual time=2872.180..2872.180 rows=0 loops=1) > Filter: ((offer_id = $1) AND (((expire_time)::timestamp with > time zone > now()) OR (expire_time IS NULL) OR (pending = true)) AND > ((cancel_date IS NULL) OR (pending = true))) > Total runtime: 2872.339 ms > (6 rows) > > Time: 2873.479 ms > [...] > So, is this the ultimate solution to this issue? > > --patrick It's not so much that correlation is < 0.5. It sounds like you're running into the same issue that I ran into in the past. You have a column with lots of repeated values, and a few exceptional ones. Notice this part of the query: -> Seq Scan on pkk_purchase p0 (cost rows=13983) (actual rows=0) For a general number, it thinks it might return 14,000 rows, hence the sequential scan. Before you do ANALYZE, it uses whatever defaults exist, which are probably closer to reality. The problem is that you probably have some values for pkk_purchase where it could return 14,000 rows (possibly much much more). And for those, seq scan is the best plan. However, for the particular value that you are testing, there are very few (no) entries in the table. With a prepared statement (or a function) it has to determine ahead of time what the best query is without knowing what value you are going to ask for. Lets say for a second that you manage to trick it into using index scan, and then you actually call the function with one of the values that returns 1,000s of rows. Probably it will take 10-100 times longer than if it used a seq scan. So what is the solution? The only one I'm aware of is to turn your static function into a dynamic one. So somewhere within the function you build up a SQL query string and call EXECUTE str. This forces the query planner to be run every time you call the function. This means that if you call it will a "nice" value, you will get the fast index scan, and if you call it with a "bad" value, it will switch back to seq scan. The downside is you don't get much of a benefit from using as stored procedure, as it has to run the query planner all the time (as though you issue the query manually each time.) But it still might be better for you in the long run. Example: instead of create function test(int) returns int as ' declare x alias for $1; int y; begin select into y ... from ... where id=x limit ...; return y; end '; use this format create function test(int) returns int as ' declare x alias for $1; int y; begin EXECUTE ''select into y ... from ... where id='' ||quote_literal(x) || '' limit ...''; return y; end; '; I think that will point you in the right direction. John =:->
Attachment
Hi John, Thanks for your reply and analysis. --- John Meinel <john@johnmeinel.com> wrote: > patrick ~ wrote: > [...] > > pkk=# explain analyze execute pkk_01( 241 ); > > QUERY PLAN > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Result (cost=2.66..2.67 rows=1 width=0) (actual time=2872.211..2872.213 > > rows=1 loops=1) > > InitPlan > > -> Limit (cost=0.00..2.66 rows=1 width=4) (actual > > time=2872.189..2872.189 rows=0 loops=1) > > -> Seq Scan on pkk_purchase p0 (cost=0.00..37225.83 rows=13983 > > width=4) (actual time=2872.180..2872.180 rows=0 loops=1) > > Filter: ((offer_id = $1) AND (((expire_time)::timestamp > with > > time zone > now()) OR (expire_time IS NULL) OR (pending = true)) AND > > ((cancel_date IS NULL) OR (pending = true))) > > Total runtime: 2872.339 ms > > (6 rows) > > > > Time: 2873.479 ms > > > > [...] > > > So, is this the ultimate solution to this issue? > > > > --patrick > > It's not so much that correlation is < 0.5. It sounds like you're > running into the same issue that I ran into in the past. You have a > column with lots of repeated values, and a few exceptional ones. Notice > this part of the query: > -> Seq Scan on pkk_purchase p0 (cost rows=13983) (actual rows=0) > > For a general number, it thinks it might return 14,000 rows, hence the > sequential scan. Before you do ANALYZE, it uses whatever defaults exist, > which are probably closer to reality. > > The problem is that you probably have some values for pkk_purchase where > it could return 14,000 rows (possibly much much more). And for those, > seq scan is the best plan. However, for the particular value that you > are testing, there are very few (no) entries in the table. You are absoultely correct: pkk=# select offer_id,count(*) from pkk_purchase group by offer_id order by count ; offer_id | count ----------+-------- 1019 | 1 1018 | 1 1016 | 1 (many of these) ... | ... 2131 | 6 844 | 6 1098 | 6 (a dozen or so of these) ... | ... 2263 | 682 2145 | 723 2258 | 797 2091 | 863 ... | ... 1153 | 96330 (the few heavy weights) 244 | 122163 242 | 255719 243 | 273427 184 | 348476 > With a prepared statement (or a function) it has to determine ahead of > time what the best query is without knowing what value you are going to > ask for. > > Lets say for a second that you manage to trick it into using index scan, > and then you actually call the function with one of the values that > returns 1,000s of rows. Probably it will take 10-100 times longer than > if it used a seq scan. Hmm... The fact is I am selecting (in this example anyway) over all values in pkk_offer table and calling the stored function with each pkk_offer.offer_id which in turn does a select on pkk_purchase table. Note that offer_id is a foreign key in pkk_purchase referencing pkk_offer table. I don't know if it matters (I suspect that it does) but I am using LIMIT 1 in the sub-query/stored function. All I need is one single row meeting any of the criteria laid out in the stored procedure to establish an offer_id is "pending". > So what is the solution? The only one I'm aware of is to turn your > static function into a dynamic one. > > So somewhere within the function you build up a SQL query string and > call EXECUTE str. This forces the query planner to be run every time you > call the function. This means that if you call it will a "nice" value, > you will get the fast index scan, and if you call it with a "bad" value, > it will switch back to seq scan. > > The downside is you don't get much of a benefit from using as stored > procedure, as it has to run the query planner all the time (as though > you issue the query manually each time.) But it still might be better > for you in the long run. Well, running the query without the stored function, basically typing out the stored function as a sub-query shows me: pkk=# explain analyze select o0.offer_id, ( select case when ( select p0.purchase_id from pkk_purchase p0 where p0.offer_id = o0.offer_id and ( p0.pending = true or ( ( p0.expire_time > now() or p0.expire_time isnull ) and p0.cancel_date isnull ) ) limit 1 ) isnull then false else true end ) from pkk_offer o0 ; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on pkk_offer o0 (cost=0.00..1834.11 rows=618 width=4) (actual time=2413.398..1341885.084 rows=618 loops=1) SubPlan -> Result (cost=2.94..2.95 rows=1 width=0) (actual time=2171.287..2171.289 rows=1 loops=618) InitPlan -> Limit (cost=0.00..2.94 rows=1 width=4) (actual time=2171.264..2171.266 rows=1 loops=618) -> Seq Scan on pkk_purchase p0 (cost=0.00..37225.83 rows=12670 width=4) (actual time=2171.245..2171.245 rows=1 loops=618) Filter: ((offer_id = $0) AND (((expire_time)::timestamp with time zone > now()) OR (expire_time IS NULL) OR (pending = true)) AND ((cancel_date IS NULL) OR (pending = true))) Total runtime: 1341887.523 ms (8 rows) while deleting all statistics on the pkk_% tables I get: pkk=# delete from pg_statistic where pg_statistic.starelid = pg_class.oid and pg_class.relname like 'pkk_%'; DELETE 11 pkk=# explain analyze select o0.offer_id, ( select case when ( select p0.purchase_id from pkk_purchase p0 where p0.offer_id = o0.offer_id and ( p0.pending = true or ( ( p0.expire_time > now() or p0.expire_time isnull ) and p0.cancel_date isnull ) ) limit 1 ) isnull then false else true end ) from pkk_offer o0 ; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Seq Scan on pkk_offer o0 (cost=0.00..6646.94 rows=618 width=4) (actual time=0.190..799.930 rows=618 loops=1) SubPlan -> Result (cost=10.73..10.74 rows=1 width=0) (actual time=1.277..1.278 rows=1 loops=618) InitPlan -> Limit (cost=0.00..10.73 rows=1 width=4) (actual time=1.266..1.267 rows=1 loops=618) -> Index Scan using pur_offer_id_idx on pkk_purchase p0 (cost=0.00..20690.18 rows=1929 width=4) (actual time=1.258..1.258 rows=1 loops=618) Index Cond: (offer_id = $0) Filter: ((((expire_time)::timestamp with time zone > now()) OR (expire_time IS NULL) OR (pending = true)) AND ((cancel_date IS NULL) OR (pending = true))) Total runtime: 801.234 ms (9 rows) As you can see this query (over all values of pkk_offer) with out any pg_statistics on the pkk_purchase table is extremely fast. Is this a bug in the PostgreSQL planner that misjudges the best choice with pg_statistics at hand? --patrick __________________________________ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com
patrick ~ wrote: > Hi John, > > Thanks for your reply and analysis. > No problem. It just happens that this is a problem we ran into recently. > > --- John Meinel <john@johnmeinel.com> wrote: > > >>patrick ~ wrote: [...] > > Hmm... The fact is I am selecting (in this example anyway) over all > values in pkk_offer table and calling the stored function with each > pkk_offer.offer_id which in turn does a select on pkk_purchase table. > Note that offer_id is a foreign key in pkk_purchase referencing > pkk_offer table. > > I don't know if it matters (I suspect that it does) but I am using > LIMIT 1 in the sub-query/stored function. All I need is one single > row meeting any of the criteria laid out in the stored procedure to > establish an offer_id is "pending". > If you are trying to establish existence, we also had a whole thread on this. Basically what we found was that adding an ORDER BY clause, helped tremendously in getting the planner to switch to an Index scan. You might try something like: SELECT column FROM mytable WHERE column='myval' ORDER BY column LIMIT 1; There seems to be a big difference between the above statement and: SELECT column FROM mytable WHERE column='myval' LIMIT 1; > > >>So what is the solution? The only one I'm aware of is to turn your >>static function into a dynamic one. >> >>So somewhere within the function you build up a SQL query string and >>call EXECUTE str. This forces the query planner to be run every time you >>call the function. This means that if you call it will a "nice" value, >>you will get the fast index scan, and if you call it with a "bad" value, >>it will switch back to seq scan. >> >>The downside is you don't get much of a benefit from using as stored >>procedure, as it has to run the query planner all the time (as though >>you issue the query manually each time.) But it still might be better >>for you in the long run. > > > > Well, running the query without the stored function, basically typing > out the stored function as a sub-query shows me: > > > pkk=# explain analyze select o0.offer_id, ( select case when ( select > p0.purchase_id from pkk_purchase p0 where p0.offer_id = o0.offer_id and ( > p0.pending = true or ( ( p0.expire_time > now() or p0.expire_time isnull ) and > p0.cancel_date isnull ) ) limit 1 ) isnull then false else true end ) from > pkk_offer o0 ; > QUERY PLAN > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Seq Scan on pkk_offer o0 (cost=0.00..1834.11 rows=618 width=4) (actual > time=2413.398..1341885.084 rows=618 loops=1) > SubPlan > -> Result (cost=2.94..2.95 rows=1 width=0) (actual > time=2171.287..2171.289 rows=1 loops=618) > InitPlan > -> Limit (cost=0.00..2.94 rows=1 width=4) (actual > time=2171.264..2171.266 rows=1 loops=618) > -> Seq Scan on pkk_purchase p0 (cost=0.00..37225.83 > rows=12670 width=4) (actual time=2171.245..2171.245 rows=1 loops=618) > Filter: ((offer_id = $0) AND > (((expire_time)::timestamp with time zone > now()) OR (expire_time IS NULL) OR > (pending = true)) AND ((cancel_date IS NULL) OR (pending = true))) > Total runtime: 1341887.523 ms > (8 rows) > > > while deleting all statistics on the pkk_% tables I get: > > pkk=# delete from pg_statistic where pg_statistic.starelid = pg_class.oid and > pg_class.relname like 'pkk_%'; > DELETE 11 > > pkk=# explain analyze select o0.offer_id, ( select case when ( select > p0.purchase_id from pkk_purchase p0 where p0.offer_id = o0.offer_id and ( > p0.pending = true or ( ( p0.expire_time > now() or p0.expire_time isnull ) and > p0.cancel_date isnull ) ) limit 1 ) isnull then false else true end ) from > pkk_offer o0 ; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > Seq Scan on pkk_offer o0 (cost=0.00..6646.94 rows=618 width=4) (actual > time=0.190..799.930 rows=618 loops=1) > SubPlan > -> Result (cost=10.73..10.74 rows=1 width=0) (actual time=1.277..1.278 > rows=1 loops=618) > InitPlan > -> Limit (cost=0.00..10.73 rows=1 width=4) (actual > time=1.266..1.267 rows=1 loops=618) > -> Index Scan using pur_offer_id_idx on pkk_purchase p0 > (cost=0.00..20690.18 rows=1929 width=4) (actual time=1.258..1.258 rows=1 > loops=618) > Index Cond: (offer_id = $0) > Filter: ((((expire_time)::timestamp with time zone > > now()) OR (expire_time IS NULL) OR (pending = true)) AND ((cancel_date IS NULL) > OR (pending = true))) > Total runtime: 801.234 ms > (9 rows) > > > As you can see this query (over all values of pkk_offer) with out > any pg_statistics on the pkk_purchase table is extremely fast. > > Is this a bug in the PostgreSQL planner that misjudges the best > choice with pg_statistics at hand? > > --patrick > In order to understand your query I broke it up and restructured it as follows. You might try to add the ORDER BY line, and see what you get. EXPLAIN ANALYZE SELECT o0.offer_id, ( SELECT CASE WHEN ( SELECT p0.purchase_id FROM pkk_purchase p0 WHERE p0.offer_id = o0.offer_id AND ( p0.pending = true OR ( p0.cancel_date ISNULL AND ( p0.expire_time > NOW() or p0.expire_time ISNULL ) ) ) ORDER BY p0.purchase_id --Insert this line LIMIT 1 ) ISNULL THEN false ELSE true END ) FROM pkk_offer o0 ; I also wonder about some parts of your query. I don't know your business logic but you are tacking a lot of the query into the WHERE, and I wonder if postgres just thinks it's going to need to analyze all the data before it gets a match. I also don't remember what columns you have indices on. Or whether it is common to have cancel_date null, or expire_time > NOW() or expire_time null, etc. So is your function just everything within the CASE statement? You might try rewriting it as a loop using a cursor, as I believe using a cursor again lends itself to index scans (as it is even more likely that you will not get all the data.) Something like (this is untested) create function is_pending(int) returns bool as ' declare p_id alias for $1; begin DECLARE is_pending_cursor CURSOR FOR SELECT p0.purchase_id FROM pkk_purchase p0 WHERE p0.offer_id = p_id; FOR READ ONLY; FOR FETCH NEXT is_pending_cursor IF row.pending = true or ... RETURN true; RETURN false; END; '; I don't know cursors terribly well, but this might get you going. Probably in your case you also have a large portion of the records with pending = true, which means that with an index scan it doesn't have to hit very many records. Either you have a low record count for a particular purchase_id, or you have a lot of pendings. seq scan just hurts because it has to sift out all the other id's that you don't care about. But remember, I'm not a guru, just someone who has been hit by the inequal distribution problem. John =:->
Attachment
>> Lets say for a second that you manage to trick it into using index scan, >> and then you actually call the function with one of the values that >> returns 1,000s of rows. Probably it will take 10-100 times longer than >> if it used a seq scan. > I don't know if it matters (I suspect that it does) but I am using > LIMIT 1 in the sub-query/stored function. All I need is one single > row meeting any of the criteria laid out in the stored procedure to > establish an offer_id is "pending". So, in your case if you LIMIT the index scan will always be fast, and the seq scan will be catastrophic, because you don't need to retrieve all the rows, but just one. (IMHO the planner screws these LIMIT clauses becauses it expects the data to be randomly distributed in the first page while in real life it's not). You could use EXIST to test the existence of a subquery (after all, thats its purpose), or you could : When SELECT ... FROM table WHERE stuff=value LIMIT 1 obstinately uses a seq scan, spray a little order by : When SELECT ... FROM table WHERE stuff=value ORDER BY stuff LIMIT 1 the ORDER BY will make the planner think "I could use the index to order"...
patrick ~ wrote: > Hi John, > > Thanks for your reply and analysis. > > > --- John Meinel <john@johnmeinel.com> wrote: > > >>patrick ~ wrote: >>[...] >> >>>pkk=# explain analyze execute pkk_01( 241 ); >>> QUERY PLAN >>> >> One other thing that I just thought of. I think it is actually possible to add an index on a function of a column. So if you have the "is_really_pending" function, you might be able to do: CREATE INDEX pkk_is_really_pending ON pkk_purchase (is_really_pending(purchase_id)); But you need a better guru than I to make sure of that. This *might* do what you need. John =:->
Attachment
--- John Meinel <john@johnmeinel.com> wrote: > If you are trying to establish existence, we also had a whole thread on > this. Basically what we found was that adding an ORDER BY clause, helped > tremendously in getting the planner to switch to an Index scan. You > might try something like: > > SELECT column FROM mytable WHERE column='myval' ORDER BY column LIMIT 1; > > There seems to be a big difference between the above statement and: > > SELECT column FROM mytable WHERE column='myval' LIMIT 1; The ORDER BY "trick" worked beautifully! I just hope it'll continue to work consistently in production code. > I also wonder about some parts of your query. I don't know your business > logic but you are tacking a lot of the query into the WHERE, and I > wonder if postgres just thinks it's going to need to analyze all the > data before it gets a match. I have a table of offers (pkk_offer) and a table keeping track of all purchases against each offer (pkk_purchase) and a third table keeping track of billing for each purchase (pkk_billing). That's the basic setup of my db. In actuallity there are more tables and views invovled keeping track of usage, etc. The on UI page that lists all offers in the system needs to indicated to the user (operator) which offers are "pending". The term "pending" is used to mean that the particular offer has either an active purchase against it or has a purchase which hasn't yet been entered into the billing system yet (doesn't yet show up in pkk_billing). An active purcahse is indicated by pkk_purchase.expire_time in the future or IS NULL. Where IS NULL indicates a subscription type purchase (a recurring purchase). Offers are created either to be one-time purchasable or subscription type. The pkk_purchase.pending (boolean) column indicates whether or not the purchase has been entered into the billing system. It is a rare case where this flag remains true for a long period of time (which would indicate something wrong with the billing sub-system). There is a foreign key pkk_purchase.offer_id referencing pkk_offer.offer_id. And likewise, pkk_billing.client_id referencing pkk_purchase.client_id and pkk_billing.purchase_id referecning pkk_purchase.purchase_id. > So is your function just everything within the CASE statement? Yes. I posted a "stripped down" version of the database on pgsql-sql@ list earlier this month if you are interested in looking at it: http://marc.theaimsgroup.com/?l=postgresql-sql&m=109945118928530&w=2 (Just side note: MARC doesn't seem to subscribe to -performance or -hackers. I have requested them to carry these two lists. I think if more people request this it might happen. I like their archiving system). > You might try rewriting it as a loop using a cursor, as I believe using > a cursor again lends itself to index scans (as it is even more likely > that you will not get all the data.) I may try this as well as trying a suggestion by Pierre-Fr���d���ric Caillaud to use EXISTS, though my initial attempt to use it didn't seem to be any faster than my original stored function. So far the ORDER BY "trick" seems to be the best solution. I appreciate everyone's help and suggestions on this topic! Best wishes, --patrick __________________________________ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com
patrick ~ wrote: > --- John Meinel <john@johnmeinel.com> wrote: > > >>If you are trying to establish existence, we also had a whole thread on >>this. Basically what we found was that adding an ORDER BY clause, helped >>tremendously in getting the planner to switch to an Index scan. You >>might try something like: >> >>SELECT column FROM mytable WHERE column='myval' ORDER BY column LIMIT 1; >> >>There seems to be a big difference between the above statement and: >> >>SELECT column FROM mytable WHERE column='myval' LIMIT 1; > > > > The ORDER BY "trick" worked beautifully! I just hope it'll > continue to work consistently in production code. For sure it will not break the goal: "check the existence". Regards Gaetano Mendola