Thread: Re: [pgsql-hackers-win32] Poor Performance for large queries in functions
Tom Lane wrote: > John Meinel <john@johnmeinel.com> writes: > >>... However, if I try to >>bundle this query up into a server side function, it runs very slow (10 >>seconds). I'm trying to figure out why, but since I can't run EXPLAIN >>ANALYZE inside a function, I don't really know what else to do. > > > A parameterized query inside a function is basically the same as a > PREPARE'd query with parameters at the SQL level. So you can > investigate what's happening here with > > PREPARE foo(int) AS > SELECT * FROM object WHERE id in ( > SELECT id FROM data_t WHERE project_id = $1 > UNION SELECT ... ; > > EXPLAIN ANALYZE EXECUTE foo(48542); > > I'm not sure where the problem is either, so please do send along the > results. > > regards, tom lane > > PS: pgsql-performance would be a more appropriate venue for this > discussion. Well, I think I tracked the problem down to the fact that the column does not have a "not null" constraint on it. Here is a demonstration. Basically, I have 3 tables, tobjects, tdata, and tproject. tdata basically just links between tobjects and tproject, but isn't required to link to tproject. Yes, the real data has more columns, but this shows the problem. jfmeinel=> \d tobjects Table "public.tobjects" Column | Type | Modifiers --------+---------+----------- id | integer | not null Indexes: "tobjects_pkey" primary key, btree (id) jfmeinel=> \d tproject Table "public.tproject" Column | Type | Modifiers --------+---------+----------- id | integer | not null Indexes: "tproject_pkey" primary key, btree (id) jfmeinel=> \d tdata Table "public.tdata" Column | Type | Modifiers ------------+---------+----------- id | integer | not null project_id | integer | Indexes: "tdata_pkey" primary key, btree (id) "tdata_project_id_idx" btree (project_id) Foreign-key constraints: "tdata_id_fkey" FOREIGN KEY (id) REFERENCES tobjects(id) ON UPDATE CASCADE ON DELETE CASCADE "tdata_project_id_fkey" FOREIGN KEY (project_id) REFERENCES tproject(id) ON UPDATE CASCADE ON DELETE SET DEFAULT jfmeinel=> select count(*) from tdata; count -------- 545768 jfmeinel=> select count(*) - count(project_id) from tdata; ?column? ---------- 240 So tdata(project_id) is almost completely full, of the 540000+ entries, only 240 are null. jfmeinel=> prepare myget(int) as select id from tdata jfmeinel-> where project_id = $1; PREPARE jfmeinel=> explain analyze execute myget(30000); QUERY PLAN -------------------------------------------------------------------- Seq Scan on tdata (cost=0.00..9773.10 rows=181923 width=4) (actual time=1047.000..1047.000 rows=0 loops=1) Filter: (project_id = $1) Total runtime: 1047.000 ms jfmeinel=> explain analyze select id from tdata where project_id = 30000; QUERY PLAN ------------------------------------------------------------------------- Index Scan using tdata_project_id_idx on tdata (cost=0.00..4.20 rows=1 width=4) (actual time=0.000..0.000 rows=0 loops =1) Index Cond: (project_id = 30000) Total runtime: 0.000 ms So notice that when doing the actual select it is able to do the index query. But for some reason with a prepared statement, it is not able to do it. Any ideas? Since I only have the integers now, I can send the data to someone if they care to investigate it. It comes to 2.2M as a .tar.bz2, so obviously I'm not going to spam the list. If I rewrite myget as: prepare myget(int) as select id from tdata where project_id = 30000; it does the right thing again. So it's something about how a variable interacts with an indexed column with null values. Note: I've tried creating a script that generates dummy data to show this problem and I have failed (it always performed the query correctly.) But this test data definitely shows the problem. And yes, I've vacuum analyzed all over the place. John =:-> PS> I tested this on PostgreSQL 7.4.3, and it did not demonstrate this problem. I am using PostgreSQL 8.0.0beta2 (probably -dev1)
Attachment
John Meinel wrote: > > So notice that when doing the actual select it is able to do the index > query. But for some reason with a prepared statement, it is not able to > do it. > > Any ideas? In the index-using example, PG knows the value you are comparing to. So, it can make a better estimate of how many rows will be returned. With the prepared/compiled version it has to come up with a plan that makes sense for any value. If you look back at the explain output you'll see PG is guessing 181,923 rows will match with the prepared query but only 1 for the second query. If in fact you returned that many rows, you wouldn't want to use the index - it would mean fetching values twice. The only work-around if you are using plpgsql functions is to use EXECUTE to make sure your queries are planned for each value provided. -- Richard Huxton Archonet Ltd
Richard Huxton wrote: > John Meinel wrote: > >> >> So notice that when doing the actual select it is able to do the index >> query. But for some reason with a prepared statement, it is not able >> to do it. >> >> Any ideas? > > > In the index-using example, PG knows the value you are comparing to. So, > it can make a better estimate of how many rows will be returned. With > the prepared/compiled version it has to come up with a plan that makes > sense for any value. > > If you look back at the explain output you'll see PG is guessing 181,923 > rows will match with the prepared query but only 1 for the second query. > If in fact you returned that many rows, you wouldn't want to use the > index - it would mean fetching values twice. > > The only work-around if you are using plpgsql functions is to use > EXECUTE to make sure your queries are planned for each value provided. > I suppose that make sense. If the number was small (< 100) then there probably would be a lot of responses. Because the tproject table is all small integers. But for a large number, it probably doesn't exist on that table at all. Thanks for the heads up. John =:->
Attachment
[ enlarging on Richard's response a bit ] John Meinel <john@johnmeinel.com> writes: > jfmeinel=> explain analyze execute myget(30000); > QUERY PLAN > -------------------------------------------------------------------- > Seq Scan on tdata (cost=0.00..9773.10 rows=181923 width=4) > (actual time=1047.000..1047.000 rows=0 loops=1) > Filter: (project_id = $1) > Total runtime: 1047.000 ms > jfmeinel=> explain analyze select id from tdata where project_id = 30000; > QUERY PLAN > ------------------------------------------------------------------------- > Index Scan using tdata_project_id_idx on tdata (cost=0.00..4.20 > rows=1 width=4) (actual time=0.000..0.000 rows=0 loops =1) > Index Cond: (project_id = 30000) > Total runtime: 0.000 ms > So notice that when doing the actual select it is able to do the index > query. But for some reason with a prepared statement, it is not able to > do it. This isn't a "can't do it" situation, it's a "doesn't want to do it" situation, and it's got nothing whatever to do with null or not null. The issue is the estimated row count, which in the first case is so high as to make the seqscan approach look cheaper. So the real question here is what are the statistics on the column that are making the planner guess such a large number when it has no specific information about the compared-to value. Do you have one extremely common value in the column? Have you done an ANALYZE recently on the table, and if so can you show us the pg_stats row for the column? regards, tom lane
Tom Lane wrote: > [ enlarging on Richard's response a bit ] > > John Meinel <john@johnmeinel.com> writes: > >>jfmeinel=> explain analyze execute myget(30000); >> QUERY PLAN >>-------------------------------------------------------------------- >> Seq Scan on tdata (cost=0.00..9773.10 rows=181923 width=4) >> (actual time=1047.000..1047.000 rows=0 loops=1) >> Filter: (project_id = $1) >> Total runtime: 1047.000 ms > > >>jfmeinel=> explain analyze select id from tdata where project_id = 30000; >> QUERY PLAN > > >>------------------------------------------------------------------------- >> Index Scan using tdata_project_id_idx on tdata (cost=0.00..4.20 >>rows=1 width=4) (actual time=0.000..0.000 rows=0 loops =1) >> Index Cond: (project_id = 30000) >> Total runtime: 0.000 ms > > >>So notice that when doing the actual select it is able to do the index >>query. But for some reason with a prepared statement, it is not able to >>do it. > > > This isn't a "can't do it" situation, it's a "doesn't want to do it" > situation, and it's got nothing whatever to do with null or not null. > The issue is the estimated row count, which in the first case is so high > as to make the seqscan approach look cheaper. So the real question here > is what are the statistics on the column that are making the planner > guess such a large number when it has no specific information about the > compared-to value. Do you have one extremely common value in the column? > Have you done an ANALYZE recently on the table, and if so can you show > us the pg_stats row for the column? > > regards, tom lane > The answer is "yes" that particular column has very common numbers in it. Project id is a number from 1->21. I ended up modifying my query such that I do the bulk of the work in a regular UNION SELECT so that all that can be optimized, and then I later do another query for this row in an 'EXECUTE ...' so that unless I'm actually requesting a small number, the query planner can notice that it can do an indexed query. I'm pretty sure this is just avoiding worst case scenario. Because it is true that if I use the number 18, it will return 500,000 rows. Getting those with an indexed lookup would be very bad. But typically, I'm doing numbers in a very different range, and so the planner was able to know that it would not likely find that number. Thanks for pointing out what the query planner was thinking, I was able to work around it. John =:->