Thread: SQL code runs slower as a stored function
Hi all,
It's my first time posting to the list, though it's helped me through numerous jams throughout the past year as I've been familiarizing myself with all of the wonderful postgresisms =)
Just yesterday, I made what I thought would be a very minor and routine modification to a stored function. The function aggregates over several tables-worth of network packet data and generates some very simple statistics on packet loss (packets sent vs packets received). Initially, the function ran over an entire set of such data, but then I wanted to change it to only aggregate only on data within a specific time window, as necessary.
So I added 2 timestamp parameters to the function (start time and stop time) and utilized them in an appropriate WHERE clause to limit the dataset being aggregated, and suddenly the function takes much longer to run. For example, running the original un-windowed function over ~2hrs worth of packet data took a steady ~90ms. Running the new windowed version over the same data now takes ~15s. The result sets are identical.
So, baffled, I decided to take the slow-running code out of the stored function and run it as a straight sql query, replacing all parameter variables with static values. To my surprise, it only took ~90ms again. The result sets were all still identical.
Can anyone lend a guess as to what I'm running into here, or do I need to provide more specifics to recreate the issue? It's repeatable, but it's a fair bit of data for me to just post in here as-is. I've already discovered a few creative workarounds (e.g. plpgsql: return query execute ...) that make it run faster again, but at this point, I'm really more interested in finding out what would make sql code run so much slower as a stored function in the first place.
Thanks!
sg
It's my first time posting to the list, though it's helped me through numerous jams throughout the past year as I've been familiarizing myself with all of the wonderful postgresisms =)
Just yesterday, I made what I thought would be a very minor and routine modification to a stored function. The function aggregates over several tables-worth of network packet data and generates some very simple statistics on packet loss (packets sent vs packets received). Initially, the function ran over an entire set of such data, but then I wanted to change it to only aggregate only on data within a specific time window, as necessary.
So I added 2 timestamp parameters to the function (start time and stop time) and utilized them in an appropriate WHERE clause to limit the dataset being aggregated, and suddenly the function takes much longer to run. For example, running the original un-windowed function over ~2hrs worth of packet data took a steady ~90ms. Running the new windowed version over the same data now takes ~15s. The result sets are identical.
So, baffled, I decided to take the slow-running code out of the stored function and run it as a straight sql query, replacing all parameter variables with static values. To my surprise, it only took ~90ms again. The result sets were all still identical.
Can anyone lend a guess as to what I'm running into here, or do I need to provide more specifics to recreate the issue? It's repeatable, but it's a fair bit of data for me to just post in here as-is. I've already discovered a few creative workarounds (e.g. plpgsql: return query execute ...) that make it run faster again, but at this point, I'm really more interested in finding out what would make sql code run so much slower as a stored function in the first place.
Thanks!
sg
On 13/05/2010 16:59, S G wrote: > Can anyone lend a guess as to what I'm running into here, or do I need > to provide more specifics to recreate the issue? It's repeatable, but > it's a fair bit of data for me to just post in here as-is. I've already > discovered a few creative workarounds (e.g. plpgsql: return query > execute ...) that make it run faster again, but at this point, I'm > really more interested in finding out what would make sql code run so > much slower as a stored function in the first place. It would help if you post the output from EXPLAIN ANALYZE for each case, as this will hopefully show where the bottleneck is. Ray. -- Raymond O'Donnell :: Galway :: Ireland rod@iol.ie
S G, * S G (sgennaria2@gmail.com) wrote: > Can anyone lend a guess as to what I'm running into here, or do I need to > provide more specifics to recreate the issue? It's repeatable, but it's a > fair bit of data for me to just post in here as-is. I've already discovered > a few creative workarounds (e.g. plpgsql: return query execute ...) that > make it run faster again, but at this point, I'm really more interested in > finding out what would make sql code run so much slower as a stored function > in the first place. The planner doesn't know what values those variables can take when you're passing them into a function. Therefore, it tries to come up with a plan which will work in the 'general' case, and then it will store that plan and reuse it. When there are static values, it can construct a better plan. If you're using partitioning on the table, that can mean the difference between a plan that has to scan all parts of the table and a plan that only has to scan the one part of the table that matches the constraint. Using 'execute' will cause the planner to re-plan the query every time using the static values that you've put into the query. Thanks, Stephen
Attachment
Thanks Stephen, that makes a lot of sense based on some research I was doing yesterday.
I'm still a bit green to understand the inner workings of the planner and all the stuff coming out of EXPLAIN, but I'm definitely trying to keep it all in mind as I progress at this stuff.
I guess to really get down to the issue, I'm curious if what I'm doing is considered 'standard procedure' to others-- i.e. using funny workarounds like building the query in a text var and executing it with plpgsql's RETURN QUERY EXECUTE command. Are there other schools of thought on how to approach something like this? Could it be more of a sign that my design is flawed? Something in me just feels like there should be a better way to approach this. Otherwise I feel like I'm just blaming postgres for the problem, which I'm not so sure I want to do.
Also re: Raymond's request, I tried humoring myself with the EXPLAIN output, and I actually don't see anything useful-looking at all when I run it on a stored function like I'm using. Is there any way to utilize EXPLAIN on a query embedded in a stored function? I could run it just fine on the raw sql, but the raw sql wasn't what was running slow, so I'm not sure if it's even helpful to do that.
Thanks!
On Thu, May 13, 2010 at 12:06 PM, Stephen Frost <sfrost@snowman.net> wrote:
I'm still a bit green to understand the inner workings of the planner and all the stuff coming out of EXPLAIN, but I'm definitely trying to keep it all in mind as I progress at this stuff.
I guess to really get down to the issue, I'm curious if what I'm doing is considered 'standard procedure' to others-- i.e. using funny workarounds like building the query in a text var and executing it with plpgsql's RETURN QUERY EXECUTE command. Are there other schools of thought on how to approach something like this? Could it be more of a sign that my design is flawed? Something in me just feels like there should be a better way to approach this. Otherwise I feel like I'm just blaming postgres for the problem, which I'm not so sure I want to do.
Also re: Raymond's request, I tried humoring myself with the EXPLAIN output, and I actually don't see anything useful-looking at all when I run it on a stored function like I'm using. Is there any way to utilize EXPLAIN on a query embedded in a stored function? I could run it just fine on the raw sql, but the raw sql wasn't what was running slow, so I'm not sure if it's even helpful to do that.
Thanks!
On Thu, May 13, 2010 at 12:06 PM, Stephen Frost <sfrost@snowman.net> wrote:
S G,The planner doesn't know what values those variables can take when
* S G (sgennaria2@gmail.com) wrote:
> Can anyone lend a guess as to what I'm running into here, or do I need to
> provide more specifics to recreate the issue? It's repeatable, but it's a
> fair bit of data for me to just post in here as-is. I've already discovered
> a few creative workarounds (e.g. plpgsql: return query execute ...) that
> make it run faster again, but at this point, I'm really more interested in
> finding out what would make sql code run so much slower as a stored function
> in the first place.
you're passing them into a function. Therefore, it tries to come up
with a plan which will work in the 'general' case, and then it will
store that plan and reuse it. When there are static values, it can
construct a better plan. If you're using partitioning on the table,
that can mean the difference between a plan that has to scan all parts
of the table and a plan that only has to scan the one part of the table
that matches the constraint. Using 'execute' will cause the planner to
re-plan the query every time using the static values that you've put
into the query.
Thanks,
Stephen
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
iEYEARECAAYFAkvsI5UACgkQrzgMPqB3kiilsQCeMsPGf4uTCLV5uhjfQBdhewCr
XHEAnA99iWd71pUCWv4A0cVPZ2NMvdp6
=YNPE
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 > Also re: Raymond's request, I tried humoring myself with the EXPLAIN output, > and I actually don't see anything useful-looking at all when I run it on a > stored function like I'm using. Is there any way to utilize EXPLAIN on a > query embedded in a stored function? I could run it just fine on the raw > sql, but the raw sql wasn't what was running slow, so I'm not sure if it's > even helpful to do that. I wrote a blog article about this: Google for "Why is my function slow?" - -- Greg Sabino Mullane greg@turnstep.com End Point Corporation http://www.endpoint.com/ PGP Key: 0x14964AC8 201005131251 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iEYEAREDAAYFAkvsLjQACgkQvJuQZxSWSsjgzwCfSxAeq4m6KUInlYSoBQECMASe 1psAoLJ8zrzrR4O9V0UZdOr0YHjCbxXj =80ur -----END PGP SIGNATURE-----
S G, * S G (sgennaria2@gmail.com) wrote: > I guess to really get down to the issue, I'm curious if what I'm doing is > considered 'standard procedure' to others-- i.e. using funny workarounds > like building the query in a text var and executing it with plpgsql's RETURN > QUERY EXECUTE command. It's certainly not unusual when it's necessary. We do try to minimize the times it's necessary by making the planner as smart as possible. > Are there other schools of thought on how to > approach something like this? Could it be more of a sign that my design is > flawed? Something in me just feels like there should be a better way to > approach this. Otherwise I feel like I'm just blaming postgres for the > problem, which I'm not so sure I want to do. I don't think needing to do this would imply that the design is flawed.. > Also re: Raymond's request, I tried humoring myself with the EXPLAIN output, > and I actually don't see anything useful-looking at all when I run it on a > stored function like I'm using. Is there any way to utilize EXPLAIN on a > query embedded in a stored function? I could run it just fine on the raw > sql, but the raw sql wasn't what was running slow, so I'm not sure if it's > even helpful to do that. You can try putting the explain *into* the stored procedure, and then putting its results into a text variable and then spitting that back out. You can also play with setting it up as a prepared statement and then running explain/explain analyze on that; eg: prepare x as select * from table where column1 = $1; explain execute x('2010-01-01'); That would hopefully give you the same plan as what's happening in the stored procedure. Thanks, Stephen
Attachment
I wrote a blog article about this: Google for "Why is my function slow?"
Greg, great article. It's funny and humbling how I thought I'd invented the wheel with this workaround, but now it suddenly appears to be common practice. I guess I'm on the right track.
I'm curious though-- your examples were plpgsql and you mentioned at the bottom that the plpgsql documentation mentions the one-time plan limitation. Does the same hold true for stored functions in sql? My slow function in this case is written in sql, so I'm guessing the answer is 'yes'.
It would help if you post the output from EXPLAIN ANALYZE for each case,
as this will hopefully show where the bottleneck is.
I'm glad you asked, and I think I've discussed enough about what *could-be* happening that I'd like to get my hands dirty with what's really going on. Maybe now I can learn how to read and interpret all this stuff. I will present 3 EXPLAIN ANALYZE cases:
1) slow with variables: equal query structure as example 2
2) fast with static values: equal query structure as example 1
3) fast with variables: I changed where I put the constraint and it allowed me to keep this as a normal sql stored procedure without relying on the plpgsql RETURN QUERY EXECUTE fix
1) slow with variables: equal query structure as example 2
' CTE subquery2'
' -> HashAggregate (cost=4966.54..4966.56 rows=1 width=44) (actual time=5616.178..5616.179 rows=1 loops=1)'
' CTE subquery'
' -> Sort (cost=4966.51..4966.51 rows=1 width=346) (actual time=5612.507..5612.726 rows=1810 loops=1)'
' Sort Key: tb_blackhawk_sent.sent_date_time'
' Sort Method: quicksort Memory: 870kB'
' -> Nested Loop Left Join (cost=813.78..4966.50 rows=1 width=346) (actual time=10.012..5608.011 rows=1810 loops=1)'
' Join Filter: (tb_blackhawk_sent.pcap_sent_date_time = tb_blackhawk.pcap_sent_date_time)'
' -> Nested Loop (cost=813.78..3367.26 rows=1 width=229) (actual time=4.698..24.509 rows=1810 loops=1)'
' Join Filter: ((tb_blackhawk_sent.pcap_sent_date_time >= sf_group_join_times_for_collector_in_test.join_time) AND (tb_blackhawk_sent.pcap_sent_date_time <= sf_group_join_times_for_collector_in_test.stop_time))'
' -> Function Scan on sf_group_join_times_for_collector_in_test (cost=0.00..265.00 rows=1 width=92) (actual time=2.946..2.957 rows=2 loops=1)'
' Filter: (((group_ip)::text = $3) AND (listen_port = $4))'
' -> Bitmap Heap Scan on tb_blackhawk_sent (cost=813.78..3102.11 rows=10 width=137) (actual time=1.661..9.843 rows=1862 loops=2)'
' Recheck Cond: (tb_blackhawk_sent.test_id = $1)'
' Filter: ((tb_blackhawk_sent.pcap_sent_date_time >= $5) AND (tb_blackhawk_sent.pcap_sent_date_time <= $6) AND ((tb_blackhawk_sent.destination_ip)::text = $3) AND (tb_blackhawk_sent.destination_port = $4))'
' -> Bitmap Index Scan on tb_blackhawk_sent_pk2 (cost=0.00..813.77 rows=13526 width=0) (actual time=1.625..1.625 rows=13031 loops=2)'
' Index Cond: (tb_blackhawk_sent.test_id = $1)'
' -> Index Scan using tb_blackhawk_pk2 on tb_blackhawk (cost=0.00..1599.22 rows=1 width=117) (actual time=0.678..3.083 rows=1 loops=1810)'
' Index Cond: ((tb_blackhawk.test_id = $1) AND (tb_blackhawk_sent.sequence_num = tb_blackhawk.sequence_num) AND ((tb_blackhawk.collector_ip)::text = ($2)::text) AND ((tb_blackhawk_sent.collector_ip)::text = (tb_blackhawk.source_ip)::text))'
' Filter: (((tb_blackhawk.destination_ip)::text = $3) AND (tb_blackhawk.destination_port = $4))'
' -> CTE Scan on subquery (cost=0.00..0.02 rows=1 width=44) (actual time=5612.513..5615.248 rows=1810 loops=1)'
'Total runtime: 5616.395 ms'
2) fast with static values: equal query structure as example 1
'CTE Scan on subquery2 (cost=14396.95..14403.95 rows=200 width=88) (actual time=46.243..46.244 rows=1 loops=1)'
' CTE subquery2'
' -> HashAggregate (cost=14393.95..14396.95 rows=200 width=44) (actual time=46.222..46.223 rows=1 loops=1)'
' CTE subquery'
' -> Sort (cost=14353.35..14356.25 rows=1160 width=346) (actual time=42.338..42.532 rows=1810 loops=1)'
' Sort Key: tb_blackhawk_sent.sent_date_time'
' Sort Method: quicksort Memory: 870kB'
' -> Hash Left Join (cost=11315.16..14294.31 rows=1160 width=346) (actual time=19.476..38.382 rows=1810 loops=1)'
' Hash Cond: (((tb_blackhawk_sent.collector_ip)::text = (tb_blackhawk.source_ip)::text) AND (tb_blackhawk_sent.sequence_num = tb_blackhawk.sequence_num) AND (tb_blackhawk_sent.pcap_sent_date_time = tb_blackhawk.pcap_sent_date_time))'
' -> Nested Loop (cost=789.32..3490.06 rows=1160 width=229) (actual time=4.954..21.640 rows=1810 loops=1)'
' Join Filter: ((tb_blackhawk_sent.pcap_sent_date_time >= sf_group_join_times_for_collector_in_test.join_time) AND (tb_blackhawk_sent.pcap_sent_date_time <= sf_group_join_times_for_collector_in_test.stop_time))'
' -> Function Scan on sf_group_join_times_for_collector_in_test (cost=0.00..265.00 rows=1 width=92) (actual time=2.992..2.998 rows=2 loops=1)'
' Filter: (((group_ip)::text = '239.0.0.1'::text) AND (listen_port = 5001))'
' -> Bitmap Heap Scan on tb_blackhawk_sent (cost=789.32..3068.48 rows=10439 width=137) (actual time=1.762..8.630 rows=1862 loops=2)'
' Recheck Cond: (tb_blackhawk_sent.test_id = 219)'
' Filter: ((tb_blackhawk_sent.pcap_sent_date_time >= '-infinity'::timestamp without time zone) AND (tb_blackhawk_sent.pcap_sent_date_time <= 'infinity'::timestamp without time zone) AND ((tb_blackhawk_sent.destination_ip)::text = '239.0.0.1'::text) AND (tb_blackhawk_sent.destination_port = 5001))'
' -> Bitmap Index Scan on tb_blackhawk_sent_pk2 (cost=0.00..786.71 rows=13118 width=0) (actual time=1.730..1.730 rows=13031 loops=2)'
' Index Cond: (tb_blackhawk_sent.test_id = 219)'
' -> Hash (cost=10451.04..10451.04 rows=4274 width=117) (actual time=14.508..14.508 rows=1361 loops=1)'
' -> Bitmap Heap Scan on tb_blackhawk (cost=3179.36..10451.04 rows=4274 width=117) (actual time=9.792..13.853 rows=1361 loops=1)'
' Recheck Cond: ((test_id = 219) AND ((collector_ip)::text = '192.168.145.2'::text))'
' Filter: (((destination_ip)::text = '239.0.0.1'::text) AND (destination_port = 5001))'
' -> Bitmap Index Scan on tb_blackhawk_pk2 (cost=0.00..3178.29 rows=5725 width=0) (actual time=9.753..9.753 rows=9437 loops=1)'
' Index Cond: ((test_id = 219) AND ((collector_ip)::text = '192.168.145.2'::text))'
' -> CTE Scan on subquery (cost=0.00..23.20 rows=1160 width=44) (actual time=42.343..45.235 rows=1810 loops=1)'
'Total runtime: 46.436 ms'
3) fast with variables: I changed where I put the constraint and it allowed me to keep this as a normal sql stored procedure without relying on the plpgsql RETURN QUERY EXECUTE fix
'CTE Scan on subquery2 (cost=11879.96..11880.00 rows=1 width=88) (actual time=39.888..39.889 rows=1 loops=1)'
' CTE subquery2'
' -> HashAggregate (cost=11879.95..11879.96 rows=1 width=44) (actual time=39.867..39.868 rows=1 loops=1)'
' CTE subquery'
' -> Sort (cost=11874.02..11874.56 rows=215 width=346) (actual time=35.562..35.729 rows=1810 loops=1)'
' Sort Key: tb_blackhawk_sent.sent_date_time'
' Sort Method: quicksort Memory: 870kB'
' -> Hash Left Join (cost=9309.88..11865.69 rows=215 width=346) (actual time=18.192..31.508 rows=1810 loops=1)'
' Hash Cond: (((tb_blackhawk_sent.collector_ip)::text = (tb_blackhawk.source_ip)::text) AND (tb_blackhawk_sent.sequence_num = tb_blackhawk.sequence_num) AND (tb_blackhawk_sent.pcap_sent_date_time = tb_blackhawk.pcap_sent_date_time))'
' -> Nested Loop (cost=814.26..3328.94 rows=215 width=229) (actual time=4.538..15.664 rows=1810 loops=1)'
' Join Filter: ((tb_blackhawk_sent.pcap_sent_date_time >= sf_group_join_times_for_collector_in_test.join_time) AND (tb_blackhawk_sent.pcap_sent_date_time <= sf_group_join_times_for_collector_in_test.stop_time))'
' -> Function Scan on sf_group_join_times_for_collector_in_test (cost=0.00..265.00 rows=1 width=92) (actual time=2.958..2.963 rows=2 loops=1)'
' Filter: (((group_ip)::text = $3) AND (listen_port = $4))'
' -> Bitmap Heap Scan on tb_blackhawk_sent (cost=814.26..3034.96 rows=1932 width=137) (actual time=1.573..5.569 rows=1862 loops=2)'
' Recheck Cond: (tb_blackhawk_sent.test_id = $1)'
' Filter: (((tb_blackhawk_sent.destination_ip)::text = $3) AND (tb_blackhawk_sent.destination_port = $4))'
' -> Bitmap Index Scan on tb_blackhawk_sent_pk2 (cost=0.00..813.77 rows=13526 width=0) (actual time=1.541..1.541 rows=13031 loops=2)'
' Index Cond: (tb_blackhawk_sent.test_id = $1)'
' -> Hash (cost=8485.14..8485.14 rows=599 width=117) (actual time=13.640..13.640 rows=1361 loops=1)'
' -> Bitmap Heap Scan on tb_blackhawk (cost=1924.02..8485.14 rows=599 width=117) (actual time=9.240..13.035 rows=1361 loops=1)'
' Recheck Cond: ((test_id = $1) AND ((collector_ip)::text = ($2)::text))'
' Filter: (((destination_ip)::text = $3) AND (destination_port = $4))'
' -> Bitmap Index Scan on tb_blackhawk_pk2 (cost=0.00..1923.87 rows=4193 width=0) (actual time=9.201..9.201 rows=9437 loops=1)'
' Index Cond: ((test_id = $1) AND ((collector_ip)::text = ($2)::text))'
' -> CTE Scan on subquery (cost=0.00..5.38 rows=1 width=44) (actual time=35.571..38.811 rows=1810 loops=1)'
' Filter: ((sent_date_time_a >= $5) AND (sent_date_time_a <= $6))'
'Total runtime: 40.092 ms'
Hope you can read that better than I can!
Thanks!
Stephen Frost <sfrost@snowman.net> writes: > * S G (sgennaria2@gmail.com) wrote: >> I guess to really get down to the issue, I'm curious if what I'm doing is >> considered 'standard procedure' to others-- i.e. using funny workarounds >> like building the query in a text var and executing it with plpgsql's RETURN >> QUERY EXECUTE command. > It's certainly not unusual when it's necessary. We do try to minimize > the times it's necessary by making the planner as smart as possible. Yeah. This is a known sore spot that I hope to do something about in the 9.1 development cycle. regards, tom lane