Home > mailing lists

Re: Estimating hot data size - Mailing list pgsql-performance

From	Tomas Vondra
Subject	Re: Estimating hot data size
Date	February 16, 2011 17:14:58
Msg-id	4D5C3E05.2070909@fuzzy.cz Whole thread Raw
In response to	Estimating hot data size (Chris Hoover <revoohc@gmail.com>)
List	pgsql-performance

Tree view

Dne 16.2.2011 21:51, Chris Hoover napsal(a):
> All,
>
> I'm trying to estimate the size of my hot data set, and wanted to get
> some validation that I'm doing this correctly.
>
> Basically, I'm using the sum(heap_blks_read + idx_blks_read) from
> pg_statio_all_tables, and diffing the numbers over a period of time (1
> hour at least).  Is this a fair estimate?  The reason for doing this is
> we are looking at new server hardware, and I want to try and get enough
> ram on the machine to keep the hot data in memory plus provide room for
> growth.
>
> Thanks,
>
> Chris
>
> Example:
>
>
>
> *Time*
>
>
>
> *Total Blocks*
>
> 2011-02-16 11:25:34.621874-05
>
>
>
> 123,260,464,427.00
>
> 2011-02-16 12:25:46.486719-05
>
>
>
> 123,325,880,943.00
>
>
>
> To get the hot data for this hour (in KB), I'm taking:
>
>
>  (123,325,880,943.00 - 123,260,464,427.00)* 8 = 523,332,128KB
>
>
> Correct?

I doubt that, although I'm not sure what exactly you mean by hot data
set. I guess it's the data set you're working with frequently, right?

The first gotcha is that heap_blks_read counts only blocks not found in
shared buffers, so those 500MB is actually the amount of data read from
the disk (or filesystem cache). It does not say anything about how
frequently the data are used.

The second gotcha is that the same block may be counted repeatedly,
especially if it is not frequently used. It's counted for query A, then
it's removed from the cache (to be replaced by another block), and then
for another query B. So the number heap_blks_read does not mean there
were that many different blocks read from the disk.

What I'd recommend is to measure the cache hit ratio, i.e. this

   heap_blks_hit / (heap_blks_read + heap_blks_hit)

which means how efficient the cache is. Increase shared buffers until it
stops to increase - that's the hot data set size.

regards
Tomas

PS: The value heap_blks_hit does not actually mean the blocks were read
    from the disk - it might be read from filesystem cache (and there's
    not easy way to find out this AFAIK).

pgsql-performance by date:

From: Chris Hoover
Date: 16 February 2011, 16:51:45
Subject: Estimating hot data size

From: Bruce Momjian
Date: 16 February 2011, 17:22:54
Subject: Re: Why we don't want hints Was: Slow count(*) again...

Re: Estimating hot data size - Mailing list pgsql-performance

Previous

Next