Thread: what's going on here?

what's going on here?

From

Ben

Date:

09 March 2001, 15:09:33

We have two postgres-driven web pages, and we've noticed in the last few
days that one is loading about 10 times faster than the other. Looking
into it, we found out that postgres has very different execution paths for
what is essentially the same query, and this makes no sense to me.

Sorry for the deludge of info here, but maybe it'll explain why this is
happening....

First, here are the explains. Note the only parameter that changes is
playlist.stream, and yet for some reason the second query makes use of
indices which are unrelated to playlist.stream.

music=# explain select users.name as username, playlist.id as id,
paths.len as len, paths.path as path, timestamp(playlist.added) as added,
timestamp(playlist.played) as played from users, playlist, paths where
paths.md5=files.md5 and files.song=playlist.songid and
playlist.userid=users.id and playlist.played is null and playlist.stream=1
order by playlist.added, playlist.id;
NOTICE:  QUERY PLAN:

Sort  (cost=10845.28..10845.28 rows=1857 width=80)
  ->  Hash Join  (cost=10007.93..10744.44 rows=1857 width=80)
        ->  Merge Join  (cost=10006.44..10668.17 rows=1857 width=64)
              ->  Sort  (cost=3754.03..3754.03 rows=2321 width=36)
                    ->  Hash Join  (cost=306.62..3624.28 rows=2321 width=36)
                          ->  Seq Scan on files  (cost=0.00..1072.13 rows=55413 width=16)
                          ->  Hash  (cost=300.81..300.81 rows=2321 width=20)
                                ->  Seq Scan on playlist (cost=0.00..300.81 rows=2321 width=20)
              ->  Sort  (cost=6252.41..6252.41 rows=50617 width=28)
                    ->  Seq Scan on paths  (cost=0.00..1677.17 rows=50617 width=28)
        ->  Hash  (cost=1.39..1.39 rows=39 width=16)
              ->  Seq Scan on users  (cost=0.00..1.39 rows=39 width=16)

EXPLAIN
music=# explain select users.name as username, playlist.id as id,
paths.len as len, paths.path as path, timestamp(playlist.added) as added,
timestamp(playlist.played) as played from users, playlist, paths where
paths.md5=files.md5 and files.song=playlist.songid and
playlist.userid=users.id and playlist.played is null and playlist.stream=2
order by playlist.added, playlist.id;
NOTICE:  QUERY PLAN:

Sort  (cost=1467.44..1467.44 rows=164 width=80)
  ->  Hash Join  (cost=1.49..1461.40 rows=164 width=80)
        ->  Nested Loop  (cost=0.00..1452.85 rows=164 width=64)
              ->  Nested Loop  (cost=0.00..771.76 rows=205 width=36)
                    ->  Seq Scan on playlist  (cost=0.00..300.81 rows=205 width=20)
                    ->  Index Scan using files_song_btree_key on files (cost=0.00..2.28 rows=1 width=16)
              ->  Index Scan using paths_md5_btree_key on paths (cost=0.00..3.31 rows=1 width=28)
        ->  Hash  (cost=1.39..1.39 rows=39 width=16)
              ->  Seq Scan on users  (cost=0.00..1.39 rows=39 width=16)

EXPLAIN


Okay, so now the relevant schema:

                              Table "users"
  Attribute  |   Type    |                    Modifier
-------------+-----------+------------------------------------------------
 id          | integer   | not null default nextval('users_id_seq'::text)
 name        | text      | not null
Indices: users_name_key,
         users_pkey

Index "users_name_key"
 Attribute | Type
-----------+------
 name      | text
unique btree

 Index "users_pkey"
 Attribute |  Type
-----------+---------
 id        | integer
unique btree (primary key)




                            Table "playlist"
 Attribute |  Type   |                     Modifier
-----------+---------+---------------------------------------------------
 id        | integer | not null default nextval('playlist_id_seq'::text)
 songid    | integer |
 userid    | integer |
 added     | integer |
 played    | integer |
 stream    | integer |
Indices: playlist_added_key,
         playlist_pkey,
         playlist_played_key,
         playlist_songid_key

Index "playlist_added_key"
 Attribute |  Type
-----------+---------
 added     | integer
btree

Index "playlist_pkey"
 Attribute |  Type
-----------+---------
 id        | integer
unique btree (primary key)

Index "playlist_played_key"
 Attribute |  Type
-----------+---------
 played    | integer
btree

Index "playlist_songid_key"
 Attribute |  Type
-----------+---------
 songid    | integer
btree


          Table "paths"
 Attribute |   Type    | Modifier
-----------+-----------+----------
 md5       | char(32)  | not null
 path      | text      | not null
 len       | integer   |
Indices: paths_md5_btree_key,
         paths_md5_key,
         paths_path_key

Index "paths_md5_btree_key"
 Attribute |   Type
-----------+----------
 md5       | char(32)
btree

Index "paths_md5_key"
 Attribute |   Type
-----------+----------
 md5       | char(32)
hash

Index "paths_path_key"
 Attribute | Type
-----------+------
 path      | text
btree



          Table "files"
 Attribute |   Type   | Modifier
-----------+----------+----------
 song      | integer  | not null
 md5       | char(32) | not null
Indices: files_md5_btree_key,
         files_md5_key,
         files_song_btree_key,
         files_song_key

Index "files_md5_btree_key"
 Attribute |   Type
-----------+----------
 md5       | char(32)
btree

Index "files_md5_key"
 Attribute |   Type
-----------+----------
 md5       | char(32)
hash

Index "files_song_btree_key"
 Attribute |  Type
-----------+---------
 song      | integer
btree

Index "files_song_key"
 Attribute |  Type
-----------+---------
 song      | integer
hash



Interestingly, the sequential scan on playlist claims to be returning 2000
results for stream=1 and only 200 for stream=2. I'm not sure which part of
the where clause this guess comes from, because the playlist table has
equal numbers of entries for both streams.

Re: what's going on here?

From

Stephan Szabo

Date:

09 March 2001, 15:20:10

Has vacuum analyze been run on both recently?  What is the maximum number
of rows with a particular stream value, and how many does each of 1 and 2
actually have?

> Interestingly, the sequential scan on playlist claims to be returning 2000
> results for stream=1 and only 200 for stream=2. I'm not sure which part of
> the where clause this guess comes from, because the playlist table has
> equal numbers of entries for both streams.

Re: what's going on here?

From

Ben

Date:

09 March 2001, 15:24:44

Every night. There are 6223 rows with stream=2 and 7041 rows with
stream=1. At any given time, there will be between 1 to 30 rows with
played=null for both values.

On Fri, 9 Mar 2001, Stephan Szabo wrote:

>
> Has vacuum analyze been run on both recently?  What is the maximum number
> of rows with a particular stream value, and how many does each of 1 and 2
> actually have?
>
> > Interestingly, the sequential scan on playlist claims to be returning 2000
> > results for stream=1 and only 200 for stream=2. I'm not sure which part of
> > the where clause this guess comes from, because the playlist table has
> > equal numbers of entries for both streams.
>
>

Re: what's going on here?

From

Stephan Szabo

Date:

09 March 2001, 15:35:40

Hmm, I also notice that it's getting very different numbers for
rows from files as well.

On Fri, 9 Mar 2001, Ben wrote:

> Every night. There are 6223 rows with stream=2 and 7041 rows with
> stream=1. At any given time, there will be between 1 to 30 rows with
> played=null for both values.
>
> On Fri, 9 Mar 2001, Stephan Szabo wrote:
>
> >
> > Has vacuum analyze been run on both recently?  What is the maximum number
> > of rows with a particular stream value, and how many does each of 1 and 2
> > actually have?
> >
> > > Interestingly, the sequential scan on playlist claims to be returning 2000
> > > results for stream=1 and only 200 for stream=2. I'm not sure which part of
> > > the where clause this guess comes from, because the playlist table has
> > > equal numbers of entries for both streams.

Re: what's going on here?

From

Ben

Date:

09 March 2001, 15:48:40

I think that's because the stream=1 query is doing a sequential scan on
files, so it has to look at everything, whereas the index scan in the
stream=2 query only looks at the important rows. (Which in this case was
just 1 row.)

I think this question is: why is one using an index scan and the other
using a sequential scan?

On Fri, 9 Mar 2001, Stephan Szabo wrote:

>
> Hmm, I also notice that it's getting very different numbers for
> rows from files as well.
>
> On Fri, 9 Mar 2001, Ben wrote:
>
> > Every night. There are 6223 rows with stream=2 and 7041 rows with
> > stream=1. At any given time, there will be between 1 to 30 rows with
> > played=null for both values.
> >
> > On Fri, 9 Mar 2001, Stephan Szabo wrote:
> >
> > >
> > > Has vacuum analyze been run on both recently?  What is the maximum number
> > > of rows with a particular stream value, and how many does each of 1 and 2
> > > actually have?
> > >
> > > > Interestingly, the sequential scan on playlist claims to be returning 2000
> > > > results for stream=1 and only 200 for stream=2. I'm not sure which part of
> > > > the where clause this guess comes from, because the playlist table has
> > > > equal numbers of entries for both streams.
>
>

Re: what's going on here?

From

Tom Lane

Date:

09 March 2001, 16:37:07

Ben <bench@silentmedia.com> writes:
> music=# explain ... where ... playlist.stream=1
>      ->  Seq Scan on playlist (cost=0.00..300.81 rows=2321 width=20)

> music=# explain ... where ... playlist.stream=2
>      ->  Seq Scan on playlist  (cost=0.00..300.81 rows=205 width=20)

I am betting that 1 is the most common value in playlist.stream, or at
least is being chosen as the most common value by VACUUM ANALYZE's
not-totally-accurate estimation process.  The 2321 rowcount estimate
then falls out of the stored statistic for the most common value's
frequency.  In the second case, the estimator knows that 2 is *not* the
most common value, but it has absolutely no statistical basis on which
to guess what the frequency really is.  I think it uses 1/10th of the
most common frequency for anything that's not the most common value
(look in utils/adt/selfuncs.c to be sure).  There's probably also some
contribution from the "playlist.played is null" clause, else the row
count estimate would be exactly 1/10th as much.  However, I don't
believe that the thing currently makes any serious effort to gauge the
selectivity of IS NULL, which is a shame because that would critically
affect the results here.  (You did say some thousands of rows matching
the stream=N clause, but only a few matching IS NULL, right?)

Given the fundamental difference in this initial row count estimate,
the large difference in the subsequent join plan structure is not too
surprising.

In short: another demonstration of the limitations of our current
statistics about data frequencies.

BTW, you didn't actually say which plan was faster.  Since the second
one was closer to the true statistic (only a few rows returned from
playlist), I'm hoping it was faster...

            regards, tom lane

Re: what's going on here?

From

Ben

Date:

09 March 2001, 17:05:51

On Fri, 9 Mar 2001, Tom Lane wrote:

> Ben <bench@silentmedia.com> writes:
> > music=# explain ... where ... playlist.stream=1
> >      ->  Seq Scan on playlist (cost=0.00..300.81 rows=2321 width=20)
>
> > music=# explain ... where ... playlist.stream=2
> >      ->  Seq Scan on playlist  (cost=0.00..300.81 rows=205 width=20)
>
> I am betting that 1 is the most common value in playlist.stream, or at
> least is being chosen as the most common value by VACUUM ANALYZE's
> not-totally-accurate estimation process.  The 2321 rowcount estimate
> then falls out of the stored statistic for the most common value's
> frequency.  In the second case, the estimator knows that 2 is *not* the

Yes, stream=1 is the most common value, but 2321? There are ~7000 rows
where stream=1.

> most common value, but it has absolutely no statistical basis on which
> to guess what the frequency really is.  I think it uses 1/10th of the
> most common frequency for anything that's not the most common value
> (look in utils/adt/selfuncs.c to be sure).  There's probably also some
> contribution from the "playlist.played is null" clause, else the row
> count estimate would be exactly 1/10th as much.  However, I don't
> believe that the thing currently makes any serious effort to gauge the
> selectivity of IS NULL, which is a shame because that would critically
> affect the results here.  (You did say some thousands of rows matching
> the stream=N clause, but only a few matching IS NULL, right?)

This is not good news. Most of the reason I have an index on
playlist.played is to figure out which are null, to limit the results this
query has to look at. Is this (probably very hard to fix) shortcoming
addressed in version 7.1, by chance?

> Given the fundamental difference in this initial row count estimate,
> the large difference in the subsequent join plan structure is not too
> surprising.

This I don't follow. The stream=1 query is returning 10 times the rows -
shouldn't that make the use of an index even *more* likely?

> BTW, you didn't actually say which plan was faster.  Since the second
> one was closer to the true statistic (only a few rows returned from
> playlist), I'm hoping it was faster...

Sorry. Yes, the second one was far faster, thanks to the indices it used.

Re: what's going on here?

From

Stephan Szabo

Date:

09 March 2001, 17:17:55

> This I don't follow. The stream=1 query is returning 10 times the rows -
> shouldn't that make the use of an index even *more* likely?

No, because if you're getting a large % of the rows of the table it'll be
faster to sequentially scan the heap rather than do so in random access by
the index order.  (You have to go into the heap file for the info on
whether or not the tuple is valid, and you'd potentially be seeking around
the heap file which can end up  being more expensive).

As a question, does an index on stream, played (both in the same
index) help at all?

Re: what's going on here?

From

Tom Lane

Date:

09 March 2001, 17:32:58

Ben <bench@silentmedia.com> writes:
>> I am betting that 1 is the most common value in playlist.stream, or at
>> least is being chosen as the most common value by VACUUM ANALYZE's
>> not-totally-accurate estimation process.  The 2321 rowcount estimate
>> then falls out of the stored statistic for the most common value's
>> frequency.  In the second case, the estimator knows that 2 is *not* the

> Yes, stream=1 is the most common value, but 2321? There are ~7000 rows
> where stream=1.

Well, I did say not totally accurate ;-).  But [digs in code] there is
a hardwired selectivity of 0.333 for the IS NULL clause, so it's really
estimating about 6960 rows matching the stream=1 clause.  That's closer
than I'd have expected, actually, given the inaccuracies in VACUUM's
method of obtaining this estimate.  So the real problem here is the lack
of estimation for IS NULL.

> This is not good news. Most of the reason I have an index on
> playlist.played is to figure out which are null, to limit the results this
> query has to look at.

We don't currently use indexes for IS NULL at all.

> Is this (probably very hard to fix) shortcoming
> addressed in version 7.1, by chance?

No.  The fact that the planner doesn't derive an estimate could be fixed
with not too much work (but it won't happen for 7.1).  Making IS NULL an
indexable condition would affect considerably more code, and it's not
really what's making the difference here anyway --- although it would
allow us to replace the seqscan on playlist with an indexscan, which'd
be even faster.

> This I don't follow. The stream=1 query is returning 10 times the rows -
> shouldn't that make the use of an index even *more* likely?

No.  Your second query is using nested loop with inner indexscan joins.
That's fine for *small* numbers of rows returned by the outer subquery,
but it goes to hell in a handbasket for large numbers of rows.  The
planner is doing the right thing to switch to a heavier-duty plan with
more startup overhead ... or it would be if it had the right idea about
how many rows are going to come out of playlist, that is.

The real issue here is the lack of estimation of IS NULL selectivity.
I was aware that that was broken, but I don't recall having seen a
real-life example before where it really mattered.  I'll try to move
it up a notch or two on my to-do list.  Or if someone else wants to
dive in, feel free...

            regards, tom lane

Re: what's going on here?

From

Ben

Date:

09 March 2001, 21:02:44

Unfortunately, no.

On Fri, 9 Mar 2001, Stephan Szabo wrote:

> As a question, does an index on stream, played (both in the same
> index) help at all?
>

Re: what's going on here?

From

Ben

Date:

09 March 2001, 21:12:12

On Fri, 9 Mar 2001, Tom Lane wrote:

> No.  The fact that the planner doesn't derive an estimate could be fixed
> with not too much work (but it won't happen for 7.1).  Making IS NULL an
> indexable condition would affect considerably more code, and it's not
> really what's making the difference here anyway --- although it would
> allow us to replace the seqscan on playlist with an indexscan, which'd
> be even faster.

It would be nice if postgres could index on IS NULL, or if it was
documented that indexes don't get used here. Otherwise I would have used
played == 0 to mean the same thing that I'm trying to get out of played IS
NULL..... doing that lets explain give a very accurate picture.

> No.  Your second query is using nested loop with inner indexscan joins.
> That's fine for *small* numbers of rows returned by the outer subquery,
> but it goes to hell in a handbasket for large numbers of rows.  The
> planner is doing the right thing to switch to a heavier-duty plan with
> more startup overhead ... or it would be if it had the right idea about
> how many rows are going to come out of playlist, that is.

Ahhh, so desu. Thanks for the explinations.

> The real issue here is the lack of estimation of IS NULL selectivity.
> I was aware that that was broken, but I don't recall having seen a
> real-life example before where it really mattered.  I'll try to move
> it up a notch or two on my to-do list.  Or if someone else wants to
> dive in, feel free...

Yeah, like I said above, being about to use an accurate index for
estimating the size of the inner loop makes quite a difference. But at
least in this case, I had a value I could use in place of NULL, so it's
hardly a high priority.

Thanks again.