Thread: what's going on here?
We have two postgres-driven web pages, and we've noticed in the last few days that one is loading about 10 times faster than the other. Looking into it, we found out that postgres has very different execution paths for what is essentially the same query, and this makes no sense to me. Sorry for the deludge of info here, but maybe it'll explain why this is happening.... First, here are the explains. Note the only parameter that changes is playlist.stream, and yet for some reason the second query makes use of indices which are unrelated to playlist.stream. music=# explain select users.name as username, playlist.id as id, paths.len as len, paths.path as path, timestamp(playlist.added) as added, timestamp(playlist.played) as played from users, playlist, paths where paths.md5=files.md5 and files.song=playlist.songid and playlist.userid=users.id and playlist.played is null and playlist.stream=1 order by playlist.added, playlist.id; NOTICE: QUERY PLAN: Sort (cost=10845.28..10845.28 rows=1857 width=80) -> Hash Join (cost=10007.93..10744.44 rows=1857 width=80) -> Merge Join (cost=10006.44..10668.17 rows=1857 width=64) -> Sort (cost=3754.03..3754.03 rows=2321 width=36) -> Hash Join (cost=306.62..3624.28 rows=2321 width=36) -> Seq Scan on files (cost=0.00..1072.13 rows=55413 width=16) -> Hash (cost=300.81..300.81 rows=2321 width=20) -> Seq Scan on playlist (cost=0.00..300.81 rows=2321 width=20) -> Sort (cost=6252.41..6252.41 rows=50617 width=28) -> Seq Scan on paths (cost=0.00..1677.17 rows=50617 width=28) -> Hash (cost=1.39..1.39 rows=39 width=16) -> Seq Scan on users (cost=0.00..1.39 rows=39 width=16) EXPLAIN music=# explain select users.name as username, playlist.id as id, paths.len as len, paths.path as path, timestamp(playlist.added) as added, timestamp(playlist.played) as played from users, playlist, paths where paths.md5=files.md5 and files.song=playlist.songid and playlist.userid=users.id and playlist.played is null and playlist.stream=2 order by playlist.added, playlist.id; NOTICE: QUERY PLAN: Sort (cost=1467.44..1467.44 rows=164 width=80) -> Hash Join (cost=1.49..1461.40 rows=164 width=80) -> Nested Loop (cost=0.00..1452.85 rows=164 width=64) -> Nested Loop (cost=0.00..771.76 rows=205 width=36) -> Seq Scan on playlist (cost=0.00..300.81 rows=205 width=20) -> Index Scan using files_song_btree_key on files (cost=0.00..2.28 rows=1 width=16) -> Index Scan using paths_md5_btree_key on paths (cost=0.00..3.31 rows=1 width=28) -> Hash (cost=1.39..1.39 rows=39 width=16) -> Seq Scan on users (cost=0.00..1.39 rows=39 width=16) EXPLAIN Okay, so now the relevant schema: Table "users" Attribute | Type | Modifier -------------+-----------+------------------------------------------------ id | integer | not null default nextval('users_id_seq'::text) name | text | not null Indices: users_name_key, users_pkey Index "users_name_key" Attribute | Type -----------+------ name | text unique btree Index "users_pkey" Attribute | Type -----------+--------- id | integer unique btree (primary key) Table "playlist" Attribute | Type | Modifier -----------+---------+--------------------------------------------------- id | integer | not null default nextval('playlist_id_seq'::text) songid | integer | userid | integer | added | integer | played | integer | stream | integer | Indices: playlist_added_key, playlist_pkey, playlist_played_key, playlist_songid_key Index "playlist_added_key" Attribute | Type -----------+--------- added | integer btree Index "playlist_pkey" Attribute | Type -----------+--------- id | integer unique btree (primary key) Index "playlist_played_key" Attribute | Type -----------+--------- played | integer btree Index "playlist_songid_key" Attribute | Type -----------+--------- songid | integer btree Table "paths" Attribute | Type | Modifier -----------+-----------+---------- md5 | char(32) | not null path | text | not null len | integer | Indices: paths_md5_btree_key, paths_md5_key, paths_path_key Index "paths_md5_btree_key" Attribute | Type -----------+---------- md5 | char(32) btree Index "paths_md5_key" Attribute | Type -----------+---------- md5 | char(32) hash Index "paths_path_key" Attribute | Type -----------+------ path | text btree Table "files" Attribute | Type | Modifier -----------+----------+---------- song | integer | not null md5 | char(32) | not null Indices: files_md5_btree_key, files_md5_key, files_song_btree_key, files_song_key Index "files_md5_btree_key" Attribute | Type -----------+---------- md5 | char(32) btree Index "files_md5_key" Attribute | Type -----------+---------- md5 | char(32) hash Index "files_song_btree_key" Attribute | Type -----------+--------- song | integer btree Index "files_song_key" Attribute | Type -----------+--------- song | integer hash Interestingly, the sequential scan on playlist claims to be returning 2000 results for stream=1 and only 200 for stream=2. I'm not sure which part of the where clause this guess comes from, because the playlist table has equal numbers of entries for both streams.
Has vacuum analyze been run on both recently? What is the maximum number of rows with a particular stream value, and how many does each of 1 and 2 actually have? > Interestingly, the sequential scan on playlist claims to be returning 2000 > results for stream=1 and only 200 for stream=2. I'm not sure which part of > the where clause this guess comes from, because the playlist table has > equal numbers of entries for both streams.
Every night. There are 6223 rows with stream=2 and 7041 rows with stream=1. At any given time, there will be between 1 to 30 rows with played=null for both values. On Fri, 9 Mar 2001, Stephan Szabo wrote: > > Has vacuum analyze been run on both recently? What is the maximum number > of rows with a particular stream value, and how many does each of 1 and 2 > actually have? > > > Interestingly, the sequential scan on playlist claims to be returning 2000 > > results for stream=1 and only 200 for stream=2. I'm not sure which part of > > the where clause this guess comes from, because the playlist table has > > equal numbers of entries for both streams. > >
Hmm, I also notice that it's getting very different numbers for rows from files as well. On Fri, 9 Mar 2001, Ben wrote: > Every night. There are 6223 rows with stream=2 and 7041 rows with > stream=1. At any given time, there will be between 1 to 30 rows with > played=null for both values. > > On Fri, 9 Mar 2001, Stephan Szabo wrote: > > > > > Has vacuum analyze been run on both recently? What is the maximum number > > of rows with a particular stream value, and how many does each of 1 and 2 > > actually have? > > > > > Interestingly, the sequential scan on playlist claims to be returning 2000 > > > results for stream=1 and only 200 for stream=2. I'm not sure which part of > > > the where clause this guess comes from, because the playlist table has > > > equal numbers of entries for both streams.
I think that's because the stream=1 query is doing a sequential scan on files, so it has to look at everything, whereas the index scan in the stream=2 query only looks at the important rows. (Which in this case was just 1 row.) I think this question is: why is one using an index scan and the other using a sequential scan? On Fri, 9 Mar 2001, Stephan Szabo wrote: > > Hmm, I also notice that it's getting very different numbers for > rows from files as well. > > On Fri, 9 Mar 2001, Ben wrote: > > > Every night. There are 6223 rows with stream=2 and 7041 rows with > > stream=1. At any given time, there will be between 1 to 30 rows with > > played=null for both values. > > > > On Fri, 9 Mar 2001, Stephan Szabo wrote: > > > > > > > > Has vacuum analyze been run on both recently? What is the maximum number > > > of rows with a particular stream value, and how many does each of 1 and 2 > > > actually have? > > > > > > > Interestingly, the sequential scan on playlist claims to be returning 2000 > > > > results for stream=1 and only 200 for stream=2. I'm not sure which part of > > > > the where clause this guess comes from, because the playlist table has > > > > equal numbers of entries for both streams. > >
Ben <bench@silentmedia.com> writes: > music=# explain ... where ... playlist.stream=1 > -> Seq Scan on playlist (cost=0.00..300.81 rows=2321 width=20) > music=# explain ... where ... playlist.stream=2 > -> Seq Scan on playlist (cost=0.00..300.81 rows=205 width=20) I am betting that 1 is the most common value in playlist.stream, or at least is being chosen as the most common value by VACUUM ANALYZE's not-totally-accurate estimation process. The 2321 rowcount estimate then falls out of the stored statistic for the most common value's frequency. In the second case, the estimator knows that 2 is *not* the most common value, but it has absolutely no statistical basis on which to guess what the frequency really is. I think it uses 1/10th of the most common frequency for anything that's not the most common value (look in utils/adt/selfuncs.c to be sure). There's probably also some contribution from the "playlist.played is null" clause, else the row count estimate would be exactly 1/10th as much. However, I don't believe that the thing currently makes any serious effort to gauge the selectivity of IS NULL, which is a shame because that would critically affect the results here. (You did say some thousands of rows matching the stream=N clause, but only a few matching IS NULL, right?) Given the fundamental difference in this initial row count estimate, the large difference in the subsequent join plan structure is not too surprising. In short: another demonstration of the limitations of our current statistics about data frequencies. BTW, you didn't actually say which plan was faster. Since the second one was closer to the true statistic (only a few rows returned from playlist), I'm hoping it was faster... regards, tom lane
On Fri, 9 Mar 2001, Tom Lane wrote: > Ben <bench@silentmedia.com> writes: > > music=# explain ... where ... playlist.stream=1 > > -> Seq Scan on playlist (cost=0.00..300.81 rows=2321 width=20) > > > music=# explain ... where ... playlist.stream=2 > > -> Seq Scan on playlist (cost=0.00..300.81 rows=205 width=20) > > I am betting that 1 is the most common value in playlist.stream, or at > least is being chosen as the most common value by VACUUM ANALYZE's > not-totally-accurate estimation process. The 2321 rowcount estimate > then falls out of the stored statistic for the most common value's > frequency. In the second case, the estimator knows that 2 is *not* the Yes, stream=1 is the most common value, but 2321? There are ~7000 rows where stream=1. > most common value, but it has absolutely no statistical basis on which > to guess what the frequency really is. I think it uses 1/10th of the > most common frequency for anything that's not the most common value > (look in utils/adt/selfuncs.c to be sure). There's probably also some > contribution from the "playlist.played is null" clause, else the row > count estimate would be exactly 1/10th as much. However, I don't > believe that the thing currently makes any serious effort to gauge the > selectivity of IS NULL, which is a shame because that would critically > affect the results here. (You did say some thousands of rows matching > the stream=N clause, but only a few matching IS NULL, right?) This is not good news. Most of the reason I have an index on playlist.played is to figure out which are null, to limit the results this query has to look at. Is this (probably very hard to fix) shortcoming addressed in version 7.1, by chance? > Given the fundamental difference in this initial row count estimate, > the large difference in the subsequent join plan structure is not too > surprising. This I don't follow. The stream=1 query is returning 10 times the rows - shouldn't that make the use of an index even *more* likely? > BTW, you didn't actually say which plan was faster. Since the second > one was closer to the true statistic (only a few rows returned from > playlist), I'm hoping it was faster... Sorry. Yes, the second one was far faster, thanks to the indices it used.
> This I don't follow. The stream=1 query is returning 10 times the rows - > shouldn't that make the use of an index even *more* likely? No, because if you're getting a large % of the rows of the table it'll be faster to sequentially scan the heap rather than do so in random access by the index order. (You have to go into the heap file for the info on whether or not the tuple is valid, and you'd potentially be seeking around the heap file which can end up being more expensive). As a question, does an index on stream, played (both in the same index) help at all?
Ben <bench@silentmedia.com> writes: >> I am betting that 1 is the most common value in playlist.stream, or at >> least is being chosen as the most common value by VACUUM ANALYZE's >> not-totally-accurate estimation process. The 2321 rowcount estimate >> then falls out of the stored statistic for the most common value's >> frequency. In the second case, the estimator knows that 2 is *not* the > Yes, stream=1 is the most common value, but 2321? There are ~7000 rows > where stream=1. Well, I did say not totally accurate ;-). But [digs in code] there is a hardwired selectivity of 0.333 for the IS NULL clause, so it's really estimating about 6960 rows matching the stream=1 clause. That's closer than I'd have expected, actually, given the inaccuracies in VACUUM's method of obtaining this estimate. So the real problem here is the lack of estimation for IS NULL. > This is not good news. Most of the reason I have an index on > playlist.played is to figure out which are null, to limit the results this > query has to look at. We don't currently use indexes for IS NULL at all. > Is this (probably very hard to fix) shortcoming > addressed in version 7.1, by chance? No. The fact that the planner doesn't derive an estimate could be fixed with not too much work (but it won't happen for 7.1). Making IS NULL an indexable condition would affect considerably more code, and it's not really what's making the difference here anyway --- although it would allow us to replace the seqscan on playlist with an indexscan, which'd be even faster. > This I don't follow. The stream=1 query is returning 10 times the rows - > shouldn't that make the use of an index even *more* likely? No. Your second query is using nested loop with inner indexscan joins. That's fine for *small* numbers of rows returned by the outer subquery, but it goes to hell in a handbasket for large numbers of rows. The planner is doing the right thing to switch to a heavier-duty plan with more startup overhead ... or it would be if it had the right idea about how many rows are going to come out of playlist, that is. The real issue here is the lack of estimation of IS NULL selectivity. I was aware that that was broken, but I don't recall having seen a real-life example before where it really mattered. I'll try to move it up a notch or two on my to-do list. Or if someone else wants to dive in, feel free... regards, tom lane
Unfortunately, no. On Fri, 9 Mar 2001, Stephan Szabo wrote: > As a question, does an index on stream, played (both in the same > index) help at all? >
On Fri, 9 Mar 2001, Tom Lane wrote: > No. The fact that the planner doesn't derive an estimate could be fixed > with not too much work (but it won't happen for 7.1). Making IS NULL an > indexable condition would affect considerably more code, and it's not > really what's making the difference here anyway --- although it would > allow us to replace the seqscan on playlist with an indexscan, which'd > be even faster. It would be nice if postgres could index on IS NULL, or if it was documented that indexes don't get used here. Otherwise I would have used played == 0 to mean the same thing that I'm trying to get out of played IS NULL..... doing that lets explain give a very accurate picture. > No. Your second query is using nested loop with inner indexscan joins. > That's fine for *small* numbers of rows returned by the outer subquery, > but it goes to hell in a handbasket for large numbers of rows. The > planner is doing the right thing to switch to a heavier-duty plan with > more startup overhead ... or it would be if it had the right idea about > how many rows are going to come out of playlist, that is. Ahhh, so desu. Thanks for the explinations. > The real issue here is the lack of estimation of IS NULL selectivity. > I was aware that that was broken, but I don't recall having seen a > real-life example before where it really mattered. I'll try to move > it up a notch or two on my to-do list. Or if someone else wants to > dive in, feel free... Yeah, like I said above, being about to use an accurate index for estimating the size of the inner loop makes quite a difference. But at least in this case, I had a value I could use in place of NULL, so it's hardly a high priority. Thanks again.