Thread: Query only slow on first run

Query only slow on first run

From
cluster
Date:
I have a query that takes about 7000 ms in average to complete the first
time it runs. Subsequent runs complete in only 50 ms. That is more than
a factor 100 faster! How can I make the query perform good in the first
run too?

Query and output from both first and second run of Explain Analyze is
pasted here:

http://rafb.net/p/yrKyoA17.html

Re: Query only slow on first run

From
Andrew Sullivan
Date:
On Tue, Nov 27, 2007 at 05:33:36PM +0100, cluster wrote:
> I have a query that takes about 7000 ms in average to complete the first
> time it runs. Subsequent runs complete in only 50 ms. That is more than
> a factor 100 faster! How can I make the query perform good in the first
> run too?

Probably by buying much faster disk hardware.  You'll note that the query
plans you posted are the same, except for the actual time it took to get the
results back.  That tells me you have slow storage.  On subsequent runs,
the data is cached, so it's fast.

A

--
Andrew Sullivan
Old sigs will return after re-constitution of blue smoke

Re: Query only slow on first run

From
Kevin Kempter
Date:
On Tuesday 27 November 2007 09:33:36 cluster wrote:
> I have a query that takes about 7000 ms in average to complete the first
> time it runs. Subsequent runs complete in only 50 ms. That is more than
> a factor 100 faster! How can I make the query perform good in the first
> run too?
>
> Query and output from both first and second run of Explain Analyze is
> pasted here:
>
> http://rafb.net/p/yrKyoA17.html
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match


The query is faster the second run because the data pages were pulled into the
buffer pool during the first run. I would suggest using the explain plan from
the first run and test your changes on a recently started instance (or at
least on an instance where enough activity has transpired to effectively
rotate the buffer pages).

/Kevin

Re: Query only slow on first run

From
Tom Lane
Date:
Andrew Sullivan <ajs@crankycanuck.ca> writes:
> On Tue, Nov 27, 2007 at 05:33:36PM +0100, cluster wrote:
>> I have a query that takes about 7000 ms in average to complete the first
>> time it runs. Subsequent runs complete in only 50 ms. That is more than
>> a factor 100 faster! How can I make the query perform good in the first
>> run too?

> Probably by buying much faster disk hardware.

Or buy more RAM, so that the data can stay cached.

            regards, tom lane

Re: Query only slow on first run

From
cluster
Date:
>> Probably by buying much faster disk hardware.
> Or buy more RAM, so that the data can stay cached.

So the only problem here is lack of RAM and/or disk speed?

Re: Query only slow on first run

From
Bill Moran
Date:
In response to cluster <skrald@amossen.dk>:

> >> Probably by buying much faster disk hardware.
> > Or buy more RAM, so that the data can stay cached.
>
> So the only problem here is lack of RAM and/or disk speed?

Not automatically, but the chances that more RAM and/or faster disks will
improve this situation are probably 90% or better.

Other things that could cause this problem are poor schema design, and
unreasonable expectations.

--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023

Re: Query only slow on first run

From
"Dave Dutcher"
Date:
> -----Original Message-----
> From: cluster
>
> >> Probably by buying much faster disk hardware.
> > Or buy more RAM, so that the data can stay cached.
>
> So the only problem here is lack of RAM and/or disk speed?

I don't think you can reach that conclusion yet.  Like everybody said the
reason the query was faster the second time was that the disk pages were
cached in RAM, and pulling the data out of RAM is way faster than disk.  If
I were you, I would try to optimize the query for when the disk pages aren't
in RAM.  In order to test the query without having anything cached you need
to clear out Postgres's shared buffers and the OS cache.  That can be
tricky, but it may be as easy as running a big select on another table.

As for optimizing the query, I noticed that all three joins are done by
nested loops.  I wonder if another join method would be faster.  Have you
analyzed all the tables?  You aren't disabling hash joins or merge joins are
you?  If you aren't, then as a test I would try disabling nested loops by
doing "set enable_nestloop=false" and see if the query is any faster for
you.  If it is faster without nested loops, then you might need to look into
changing some settings.

Dave


Re: Query only slow on first run

From
cluster
Date:
> As for optimizing the query, I noticed that all three joins are done by
> nested loops.  I wonder if another join method would be faster.  Have you
> analyzed all the tables?

Yes. I did a VACUUM FULL ANALYZE before running the test queries. Also I
have just performed an ANALYZE just to be sure everything was really
analyzed.

> You aren't disabling hash joins or merge joins are
> you?

Nope.

>  If you aren't, then as a test I would try disabling nested loops by
> doing "set enable_nestloop=false" and see if the query is any faster for
> you.

If I disable the nested loops, the query becomes *much* slower.

A thing that strikes me is the following. As you can see I have the
constraint: q.status = 1. Only a small subset of the data set has this
status. I have an index on q.status but for some reason this is not
used. Instead the constraint are ensured with a "Filter: (q.status = 1)"
in an index scan for the primary key in the "q" table. If the small
subset having q.status = 1 could be isolated quickly using an index, I
would expect the query to perform better. I just don't know why the
planner doesn't use the index on q.status.

Re: Query only slow on first run

From
"Steinar H. Gunderson"
Date:
On Tue, Nov 27, 2007 at 11:51:40PM +0100, cluster wrote:
> A thing that strikes me is the following. As you can see I have the
> constraint: q.status = 1. Only a small subset of the data set has this
> status. I have an index on q.status but for some reason this is not used.
> Instead the constraint are ensured with a "Filter: (q.status = 1)" in an
> index scan for the primary key in the "q" table. If the small subset having
> q.status = 1 could be isolated quickly using an index, I would expect the
> query to perform better. I just don't know why the planner doesn't use the
> index on q.status.

An index scan (as opposed to a bitmap index scan) can only use one index at a
time, so it will choose the most selective one. Here it quite correctly
recognizes that there will only be one matching record for the given
question_id, so it uses the primary key instead.

You could make an index on (question_id,status) (or a partial index on
question id, with status=1 as the filter), but I'm not sure how much it would
help you unless the questions table is extremely big. It doesn't appear to
be; in fact, it appears to be all in RAM, so that's not your bottleneck.

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Query only slow on first run

From
"Dave Dutcher"
Date:
> -----Original Message-----
> From: cluster
>
> If I disable the nested loops, the query becomes *much* slower.
>
> A thing that strikes me is the following. As you can see I have the
> constraint: q.status = 1. Only a small subset of the data set
> has this status. I have an index on q.status but for some
> reason this is not used. Instead the constraint are ensured
> with a "Filter: (q.status = 1)"
> in an index scan for the primary key in the "q" table. If the
> small subset having q.status = 1 could be isolated quickly
> using an index, I would expect the query to perform better. I
> just don't know why the planner doesn't use the index on q.status.
>

What version of Postgres are you using?  Do you know what your
join_collapse_limit is set to?

You might be able to force it to scan for questions with a status of 1 first
to see if it helps by changing the FROM clause to:

FROM posts p, question_tags qt, (SELECT * FROM questions WHERE status = 1
OFFSET 0) q

Dave



Re: Query only slow on first run

From
Tom Lane
Date:
"Steinar H. Gunderson" <sgunderson@bigfoot.com> writes:
> You could make an index on (question_id,status) (or a partial index on
> question id, with status=1 as the filter), but I'm not sure how much it would
> help you unless the questions table is extremely big. It doesn't appear to
> be; in fact, it appears to be all in RAM, so that's not your bottleneck.

Wouldn't help, because the accesses to "questions" are not the problem.
The query's spending nearly all its time in the scan of "posts", and
I'm wondering why --- doesn't seem like it should take 6400msec to fetch
646 rows, unless perhaps the data is just horribly misordered relative
to the index.  Which may in fact be the case ... what exactly is that
"random_number" column, and why are you desirous of ordering by it?
For that matter, if it is what it sounds like, why is it sane to group
by it?  You'll probably always get groups of one row ...

            regards, tom lane

Re: Query only slow on first run

From
"Steinar H. Gunderson"
Date:
On Tue, Nov 27, 2007 at 07:25:54PM -0500, Tom Lane wrote:
>> You could make an index on (question_id,status) (or a partial index on
>> question id, with status=1 as the filter), but I'm not sure how much it would
>> help you unless the questions table is extremely big. It doesn't appear to
>> be; in fact, it appears to be all in RAM, so that's not your bottleneck.
> Wouldn't help, because the accesses to "questions" are not the problem.

Yes, that was my point too. :-)

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Query only slow on first run

From
tmp
Date:
> The query's spending nearly all its time in the scan of "posts", and
> I'm wondering why --- doesn't seem like it should take 6400msec to fetch
> 646 rows, unless perhaps the data is just horribly misordered relative
> to the index.   Which may in fact be the case ...

Yes, they probably are. I use the random_number column in order to
receive a semi random sample subset from the large amount of rows. The
technique is described in [1]. This subset is later used for some
statistical investigation, but this is somewhat irrelevant here. In
order to receive the sample fast, I have made an index on the
random_number column.

> what exactly is that
> "random_number" column

A random float that is initialized when the row is created and never
modified afterwards. The physical row ordering will clearly not match
the random_number ordering. However, other queries uses a row ordering
by the primary key so I don't think it would make much sense to make the
index on random_number a clustering index just in order to speed up this
single query.

>  and why are you desirous of ordering by it?

In order to simulate a random pick of K rows. See [1].

> For that matter, if it is what it sounds like, why is it sane to group
> by it?  You'll probably always get groups of one row ...

For each random_number, another table (question_tags) holds zero or more
rows satisfying a number of constraints. I need to count(*) the number
of corresponding question_tag rows for each random_number.

We have primarily two tables of interest here: questions (~100k rows)
and posts (~400k rows). Each post refers to a question, but only the
"posts" rows for which the corresponding "question.status = 1" are
relevant. This reduces the number of relevant question rows to about
10k. Within the post rows corresponding to these 10k questions I would
like to pick a random sample of size K.

[1] http://archives.postgresql.org/pgsql-general/2007-10/msg01240.php


Re: Query only slow on first run

From
Craig James
Date:
tmp wrote:
>> what exactly is that
>> "random_number" column
>
> A random float that is initialized when the row is created and never
> modified afterwards. The physical row ordering will clearly not match
> the random_number ordering. However, other queries uses a row ordering
> by the primary key so I don't think it would make much sense to make the
> index on random_number a clustering index just in order to speed up this
> single query.
>
>>  and why are you desirous of ordering by it?
>
> In order to simulate a random pick of K rows. See [1].

A trick that I used is to sample the random column once, and create a much smaller table of the first N rows, where N
isthe sample size you want, and use that. 

If you need a different N samples each time, you can create a temporary table, put your random N rows into that, do an
ANALYZE,and then join to this smaller table.  The overall performance can be MUCH faster even though you're creating
andpopulating a whole table, than the plan that Postgres comes up with. This seems wrong-headed (why shouldn't Postgres
beable to be as efficient on its own?), but it works. 

Craig


Re: Query only slow on first run

From
"Dave Dutcher"
Date:
> -----Original Message-----
> From: tmp
> We have primarily two tables of interest here: questions
> (~100k rows) and posts (~400k rows). Each post refers to a
> question, but only the "posts" rows for which the
> corresponding "question.status = 1" are relevant. This
> reduces the number of relevant question rows to about 10k.

Earlier you said only a small subset of questions have a status of 1, so I
assumed you meant like 100 not 10k :)  According to the explain analyze
there are only 646 rows in posts which match your criteria, so it does seem
like scanning posts first might be the right thing to do.


Re: Query only slow on first run

From
Tom Lane
Date:
"Dave Dutcher" <dave@tridecap.com> writes:
> ... According to the explain analyze
> there are only 646 rows in posts which match your criteria, so it does seem
> like scanning posts first might be the right thing to do.

No, that's not right.  What the output actually shows is that only 646
posts rows were needed to produce the first 200 aggregate rows, which was
enough to satisfy the LIMIT.  The planner is evidently doing things this
way in order to exploit the presence of the LIMIT --- if it had to
compute all the aggregate results it would likely have picked a
different plan.

            regards, tom lane

Re: Query only slow on first run

From
cluster
Date:
>> I'm wondering why --- doesn't seem like it should take 6400msec to fetch
>> 646 rows, unless perhaps the data is just horribly misordered relative
>> to the index.   Which may in fact be the case ...

Hmm, actually I still don't understand why it takes 6400 ms to fetch the
rows. As far as I can see the index used is "covering" so that real row
lookups shouldn't be necessary. Also, only the the random_numbers
induces by questions with status = 1 should be considered - and this
part is a relatively small subset.

In general, I don't understand why the query is so I/O dependant as it
apparently is.

Re: Query only slow on first run

From
"Steinar H. Gunderson"
Date:
On Wed, Nov 28, 2007 at 09:16:08PM +0100, cluster wrote:
> Hmm, actually I still don't understand why it takes 6400 ms to fetch the
> rows. As far as I can see the index used is "covering" so that real row
> lookups shouldn't be necessary.

The indexes don't contain visibility information, so Postgres has to look up
the row on disk to verify it isn't dead.

> Also, only the the random_numbers induces by questions with status = 1
> should be considered - and this part is a relatively small subset.

Again, you'll need to have a combined index if you want this to help you any.

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Query only slow on first run

From
cluster
Date:
> The indexes don't contain visibility information, so Postgres has to look up
> the row on disk to verify it isn't dead.

I guess this fact drastically decreases the performance. :-(
The number of rows with a random_number will just grow over time while
the number of questions with status = 1 will always be somewhat constant
at about 10.000 or most likely much less.

I could really use any kind of suggestion on how to improve the query in
order to make it scale better for large data sets The 6-7000 ms for a
clean run is really a showstopper. Need to get it below 70 ms somehow.

Re: Query only slow on first run

From
Jean-David Beyer
Date:
cluster wrote:
>> The indexes don't contain visibility information, so Postgres has to
>> look up the row on disk to verify it isn't dead.
>
> I guess this fact drastically decreases the performance. :-( The number
> of rows with a random_number will just grow over time while the number of
> questions with status = 1 will always be somewhat constant at about
> 10.000 or most likely much less.
>
> I could really use any kind of suggestion on how to improve the query in
> order to make it scale better for large data sets The 6-7000 ms for a
> clean run is really a showstopper. Need to get it below 70 ms somehow.
>
Here is a suggestion that I have not tried. This might not make sense,
depending on how often you do this.

Make two tables whose DDL is almost the same. In one, put all the rows with
status = 1, and in the other put all the rows whose status != 1.

Now all the other queries you run would probably need to join both tables,
so maybe you make a hash index on the right fields so that would go fast.

Now for the status = 1 queries, you just look at that smaller table. This
would obviously be faster.

For the other queries, you would get stuck with the join. You would have to
weigh the overall performance issue vs. the performance of this special query.


--
  .~.  Jean-David Beyer          Registered Linux User 85642.
  /V\  PGP-Key: 9A2FC99A         Registered Machine   241939.
 /( )\ Shrewsbury, New Jersey    http://counter.li.org
 ^^-^^ 16:55:01 up 2 days, 22:43, 0 users, load average: 4.31, 4.32, 4.20

Re: Query only slow on first run

From
Tom Lane
Date:
cluster <skrald@amossen.dk> writes:
> I could really use any kind of suggestion on how to improve the query in
> order to make it scale better for large data sets The 6-7000 ms for a
> clean run is really a showstopper. Need to get it below 70 ms somehow.

Buy a faster disk?

You're essentially asking for a random sample of data that is not
currently in memory.  You're not going to get that without some I/O.

            regards, tom lane

Re: Query only slow on first run

From
"Scott Marlowe"
Date:
On Nov 28, 2007 3:15 PM, cluster <skrald@amossen.dk> wrote:
> > The indexes don't contain visibility information, so Postgres has to look up
> > the row on disk to verify it isn't dead.
>
> I guess this fact drastically decreases the performance. :-(
> The number of rows with a random_number will just grow over time while
> the number of questions with status = 1 will always be somewhat constant
> at about 10.000 or most likely much less.

Have you tried a partial index?

create index xyz on tablename (random) where status = 1

> I could really use any kind of suggestion on how to improve the query in
> order to make it scale better for large data sets The 6-7000 ms for a
> clean run is really a showstopper. Need to get it below 70 ms somehow.

Also, look into clustering the table on status or random every so often.

More importantly, you might need to research a faster way to get your
random results

Re: Query only slow on first run

From
cluster
Date:
> You're essentially asking for a random sample of data that is not
> currently in memory.  You're not going to get that without some I/O.

No, that sounds reasonable enough. But do you agree with the statement
that my query will just get slower and slower over time as the number of
posts increases while the part having status = 1 is constant?
(Therefore, as the relevant fraction becomes smaller over time, the
"Filter: status = 1" operation becomes slower.)

Re: Query only slow on first run

From
Tom Lane
Date:
cluster <skrald@amossen.dk> writes:
>> You're essentially asking for a random sample of data that is not
>> currently in memory.  You're not going to get that without some I/O.

> No, that sounds reasonable enough. But do you agree with the statement
> that my query will just get slower and slower over time as the number of
> posts increases while the part having status = 1 is constant?

No, not as long as it sticks to that plan.  The time's basically
determined by the number of aggregate rows the LIMIT asks for,
times the average number of "post" rows per aggregate group.
And as far as you said the latter number is not going to increase.

            regards, tom lane