Thread: How exactly does Analyze work?

How exactly does Analyze work?

From
Richard Neill
Date:
Dear All,

Thanks very much for your help so far. My understanding of PG is getting
a lot better!

I wonder if I've understood analyze properly: I'm not sure I quite
understand how specific the statistics gathered actually are.


In particular, what happens in the following case:
   1. I start with have a table with 100 million rows, and column wid has
      linearly distributed values from 45-90.  (wid is indexed)

   2. I run vacuum analyze

   3. I insert about 2 million rows, all of which have the new wid of 91.

   4. I then do a select * WHERE wid = 91.

How smart is analyze? Will it actually say "well, I've never seen 91 in
this table, because all the values only go up to 90, so you'd better do
a sequential scan"?


-----

On another note, I notice that if I ever manually run vacuum or analyze,
the performance of the database drops to the point where many of the
operators get kicked out. Is there any way to run them "nice" ?

We need to maintain a response time of under 1 second all day for simple
queries (which usually run in about 22ms). But Vacuum or Analyze seem to
lock up the system for a few minutes, during which other queries block
on them, although there is still plenty of CPU spare.

-----


Also, I find that, even with the autovacuum daemon running, there was
one query last night that I had to terminate after an hour. In
desperation, I restarted postgres, let it take 15 mins to vacuum the
entire DB, and then re-ran the query (in 8 minutes)

Any ideas how I can troubleshoot this better? The database is only 30GB
in total - it should (if my intuition is right) be impossible that any
simple select (even over a modestly complex view) should take longer
than a multiple of the time required to read all the data from disk?



Thanks very much,

Richard









Re: How exactly does Analyze work?

From
Tom Lane
Date:
Richard Neill <rn214@cam.ac.uk> writes:
> In particular, what happens in the following case:
>    1. I start with have a table with 100 million rows, and column wid has
>       linearly distributed values from 45-90.  (wid is indexed)

>    2. I run vacuum analyze

>    3. I insert about 2 million rows, all of which have the new wid of 91.

>    4. I then do a select * WHERE wid = 91.

> How smart is analyze? Will it actually say "well, I've never seen 91 in
> this table, because all the values only go up to 90, so you'd better do
> a sequential scan"?

ANALYZE is not magic.  The system won't know that the 91's are there
until you re-ANALYZE (either manually or automatically).  In a case
like this I expect the planner would assume there are very few matching
rows and go for an indexscan.  That might still be the right thing given
this specific scenario (need to fetch 2% of the table), but it certainly
wouldn't be if you had say half of the table matching the query.
Moral: re-ANALYZE after any bulk load.

> On another note, I notice that if I ever manually run vacuum or analyze,
> the performance of the database drops to the point where many of the
> operators get kicked out. Is there any way to run them "nice" ?

See vacuum_cost_delay.

> We need to maintain a response time of under 1 second all day for simple
> queries (which usually run in about 22ms). But Vacuum or Analyze seem to
> lock up the system for a few minutes, during which other queries block
> on them, although there is still plenty of CPU spare.

It sounds to me like you don't really have enough disk I/O bandwidth
to meet your performance requirements.  All the CPU in the world won't
help you if you didn't spend any money on the disks :-(.  You might be
able to alleviate this with vacuum_cost_delay, but it's a band-aid.

            regards, tom lane

Re: How exactly does Analyze work?

From
Kevin Kempter
Date:
On Wednesday 25 November 2009 05:34:26 Richard Neill wrote:
> Dear All,
>
> Thanks very much for your help so far. My understanding of PG is getting
> a lot better!
>
> I wonder if I've understood analyze properly: I'm not sure I quite
> understand how specific the statistics gathered actually are.
>
>
> In particular, what happens in the following case:
>    1. I start with have a table with 100 million rows, and column wid has
>       linearly distributed values from 45-90.  (wid is indexed)
>
>    2. I run vacuum analyze
>
>    3. I insert about 2 million rows, all of which have the new wid of 91.
>
>    4. I then do a select * WHERE wid = 91.
>
> How smart is analyze? Will it actually say "well, I've never seen 91 in
> this table, because all the values only go up to 90, so you'd better do
> a sequential scan"?
>
>
> -----
>
> On another note, I notice that if I ever manually run vacuum or analyze,
> the performance of the database drops to the point where many of the
> operators get kicked out. Is there any way to run them "nice" ?

increasing maintenance_work_mem to several GB (if you have the memory) will
help

>
> We need to maintain a response time of under 1 second all day for simple
> queries (which usually run in about 22ms). But Vacuum or Analyze seem to
> lock up the system for a few minutes, during which other queries block
> on them, although there is still plenty of CPU spare.
>
> -----
>
>
> Also, I find that, even with the autovacuum daemon running, there was
> one query last night that I had to terminate after an hour. In
> desperation, I restarted postgres, let it take 15 mins to vacuum the
> entire DB, and then re-ran the query (in 8 minutes)
>
> Any ideas how I can troubleshoot this better? The database is only 30GB
> in total - it should (if my intuition is right) be impossible that any
> simple select (even over a modestly complex view) should take longer
> than a multiple of the time required to read all the data from disk?
>
>
>
> Thanks very much,
>
> Richard
>