Thread: Blocks read for index scans

Blocks read for index scans

From

Jim Nasby

Date:

13 April 2006, 15:00:30

While working on determining a good stripe size for a database, I
realized it would be handy to know what the average request size is.
Getting this info is a simple matter of joining pg_stat_all_tables
and pg_statio_all_tables and doing some math, but there's one issue
I've found; it appears that there's no information on how many heap
blocks were read in by an index scan. Is there any way to get that info?
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Blocks read for index scans

From

"Jim Nasby"

Date:

13 April 2006, 15:49:04

Adding -performance back in...

> From: Steve Poe [mailto:steve.poe@gmail.com]
> Jim,
>
> I could be way off, but doesn't from pg_statio_user_tables
> contain this
> information?

http://www.postgresql.org/docs/8.1/interactive/monitoring-stats.html#MONITORING-STATS-VIEWS states:

"numbers of disk blocks read and buffer hits in all indexes of that table"

That leads me to believe that it's only tracking index blocks read, and not heap blocks read. One could presume that
eachindex row read as reported by pg_stat_all_tables would represent a heap block read, but a large number of those
would(hopefully) have already been in shared_buffers. 

> On Thu, 2006-04-13 at 13:00 -0500, Jim Nasby wrote:
> > While working on determining a good stripe size for a database, I
> > realized it would be handy to know what the average request
> size is.
> > Getting this info is a simple matter of joining pg_stat_all_tables
> > and pg_statio_all_tables and doing some math, but there's
> one issue
> > I've found; it appears that there's no information on how
> many heap
> > blocks were read in by an index scan. Is there any way to
> get that info?
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Blocks read for index scans

From

Tom Lane

Date:

13 April 2006, 21:36:21

Jim Nasby <jnasby@pervasive.com> writes:
> While working on determining a good stripe size for a database, I
> realized it would be handy to know what the average request size is.
> Getting this info is a simple matter of joining pg_stat_all_tables
> and pg_statio_all_tables and doing some math, but there's one issue
> I've found; it appears that there's no information on how many heap
> blocks were read in by an index scan. Is there any way to get that info?

If the table is otherwise idle, the change in the table's entry in
pgstatio_all_tables should do, no?

(This is as of 8.1 ... older versions acted a bit differently IIRC.)

            regards, tom lane

Re: Blocks read for index scans

From

"Jim C. Nasby"

Date:

14 April 2006, 03:14:24

On Thu, Apr 13, 2006 at 08:36:09PM -0400, Tom Lane wrote:
> Jim Nasby <jnasby@pervasive.com> writes:
> > While working on determining a good stripe size for a database, I
> > realized it would be handy to know what the average request size is.
> > Getting this info is a simple matter of joining pg_stat_all_tables
> > and pg_statio_all_tables and doing some math, but there's one issue
> > I've found; it appears that there's no information on how many heap
> > blocks were read in by an index scan. Is there any way to get that info?
>
> If the table is otherwise idle, the change in the table's entry in
> pgstatio_all_tables should do, no?

Ahh, ok, I see the heap blocks are counted. So I guess if you wanted to
know what the average number of blocks read from the heap per request
was you'd have to do heap_blks_read / ( seq_scan + idx_scan ), with the
last two comming from pg_stat_all_tables.

In my case it would be helpful to break the heap access numbers out
between seqscans and index scans, since each of those represents very
different access patterns. Would adding that be a mess?

> (This is as of 8.1 ... older versions acted a bit differently IIRC.)

Yeah; I recall that it was pretty confusing exactly how things were
broken out and that you changed it as part of the bitmap scan work.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Blocks read for index scans

From

Terje Elde

Date:

14 April 2006, 03:37:55

Jim Nasby wrote:
> While working on determining a good stripe size for a database, I
> realized it would be handy to know what the average request size is.
> Getting this info is a simple matter of joining pg_stat_all_tables and
> pg_statio_all_tables and doing some math, but there's one issue I've
> found; it appears that there's no information on how many heap blocks
> were read in by an index scan. Is there any way to get that info?

RAID usually doesn't work the way most people think. ;)

Not sure how well you know RAID, so I'm just mentioning some points just
in case, and for the archives.

If your average request is for 16K, and you choose a 16K stripe size,
then that means half your request (assuming normal bell curve) would be
larger than a single stripe, and you've just succeeded in having half
your requests have to have two spindles seek instead of one.  If that's
done sequentially, you're set for less than half the performance of a
flat disk.

Knowing what the average stripe size is can be a good place to start,
but the real question is;  which stripe size will allow the majority of
your transactions to be possible to satisfy without having to go to two
spindles?

I've actually had good success with 2MB stripe sizes using software
raid.  If the reads are fairly well distributed, all the drives are hit
equally, and very few small requests have to go to two spindles.

Read speeds from modern drives are fast.  It's usually the seeks that
kill performance, so making sure you reduce the number of seeks should
almost always be the priority.

That said, it's the transactions against disk that typically matter.  On
FreeBSD, you can get an impression of this using 'systat -vmstat', and
watch the KB/t column for your drives.

A seek will take some time, the head has to settle down, find the right
place to start reading etc, so a seek will always take time.  A seek
over a longer distance takes more time though, so even if your
transactions are pretty small, using a large stripe size can be a good
thing if your have lots of small transactions that are close by.  The
head will be in the area, reducing seek time.

This all depends on what types of load you have, and it's hard to
generalize too much on what makes things fast.  As always, it pretty
much boils down to trying things while running as close to production
load as you can.

Terje

Re: Blocks read for index scans

From

Tom Lane

Date:

14 April 2006, 12:13:18

"Jim C. Nasby" <jnasby@pervasive.com> writes:
> In my case it would be helpful to break the heap access numbers out
> between seqscans and index scans, since each of those represents very
> different access patterns. Would adding that be a mess?

Yes; it'd require more counters-per-table than we now keep, thus
nontrivial bloat in the stats collector's tables.  Not to mention
incompatible changes in the pgstats views and the underlying functions
(which some apps probably use directly).

            regards, tom lane

Re: Blocks read for index scans

From

"Jim C. Nasby"

Date:

18 April 2006, 17:01:32

On Fri, Apr 14, 2006 at 11:12:55AM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > In my case it would be helpful to break the heap access numbers out
> > between seqscans and index scans, since each of those represents very
> > different access patterns. Would adding that be a mess?
>
> Yes; it'd require more counters-per-table than we now keep, thus
> nontrivial bloat in the stats collector's tables.  Not to mention

ISTM it would only require two additional columns, which doesn't seem
unreasonable, especially considering the value of the information
collected.

> incompatible changes in the pgstats views and the underlying functions
> (which some apps probably use directly).

There's certainly ways around that issue, especially since this would
only be adding new information (though we would probably want to
consider the old info as depricated and eventually remove it).
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Blocks read for index scans

From

"Jim C. Nasby"

Date:

18 April 2006, 17:06:49

On Fri, Apr 14, 2006 at 08:05:39AM +0200, Terje Elde wrote:
> Jim Nasby wrote:
> >While working on determining a good stripe size for a database, I
> >realized it would be handy to know what the average request size is.
> >Getting this info is a simple matter of joining pg_stat_all_tables and
> >pg_statio_all_tables and doing some math, but there's one issue I've
> >found; it appears that there's no information on how many heap blocks
> >were read in by an index scan. Is there any way to get that info?
<snip>
> Knowing what the average stripe size is can be a good place to start,
> but the real question is;  which stripe size will allow the majority of
> your transactions to be possible to satisfy without having to go to two
> spindles?

And of course right now there's not a very good way to know that...
granted, I can look at the average request size on the machine, but that
will include any seqscans that are happening, and for stripe sizing I
think it's better to leave that out of the picture unless your workload
is heavily based on seqscans.

> That said, it's the transactions against disk that typically matter.  On
> FreeBSD, you can get an impression of this using 'systat -vmstat', and
> watch the KB/t column for your drives.

On a related note, you know of any way to determine the breakdown
between read activity and write activity on FreeBSD? vmstat, systat,
iostat all only return aggregate info. :(
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Blocks read for index scans

From

Terje Elde

Date:

18 April 2006, 23:41:43

Jim C. Nasby wrote:
>> That said, it's the transactions against disk that typically matter.  On
>> FreeBSD, you can get an impression of this using 'systat -vmstat', and
>> watch the KB/t column for your drives.
>>
>
> On a related note, you know of any way to determine the breakdown
> between read activity and write activity on FreeBSD? vmstat, systat,
> iostat all only return aggregate info. :(
>

Can't think of a right way to do this ATM, but for a lab-type setup to
get an idea, you could set up a gmirror volume, then choose a balancing
algorithm to only read from one of the disks.  The effect should be that
writes go to both, while reads only go to one.  Activity on the
write-only disk would give you an idea of the write activity, and
(read/write disk - write-only disk) would give you an idea of the
reads.  I have to admit though, seems like quite a bit of hassle, and
I'm not sure how good the numbers would be, given that at least some of
the info (KB/transaction) are totals, it'd require a bit of math to get
decent numbers.  But at least it's something.

Terje

Re: Blocks read for index scans

From

"Jim C. Nasby"

Date:

19 April 2006, 02:13:59

On Wed, Apr 19, 2006 at 04:35:11AM +0200, Terje Elde wrote:
> Jim C. Nasby wrote:
> >>That said, it's the transactions against disk that typically matter.  On
> >>FreeBSD, you can get an impression of this using 'systat -vmstat', and
> >>watch the KB/t column for your drives.
> >>
> >
> >On a related note, you know of any way to determine the breakdown
> >between read activity and write activity on FreeBSD? vmstat, systat,
> >iostat all only return aggregate info. :(
> >
>
>
> Can't think of a right way to do this ATM, but for a lab-type setup to
> get an idea, you could set up a gmirror volume, then choose a balancing
> algorithm to only read from one of the disks.  The effect should be that
> writes go to both, while reads only go to one.  Activity on the
> write-only disk would give you an idea of the write activity, and
> (read/write disk - write-only disk) would give you an idea of the
> reads.  I have to admit though, seems like quite a bit of hassle, and
> I'm not sure how good the numbers would be, given that at least some of
> the info (KB/transaction) are totals, it'd require a bit of math to get
> decent numbers.  But at least it's something.

Yeah... not gonna happen...

It's completely mind-boggling that FBSD doesn't track writes and reads
seperately.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Blocks read for index scans

From

Mark Kirkwood

Date:

19 April 2006, 02:31:33

Jim C. Nasby wrote:

>
> Yeah... not gonna happen...
>
> It's completely mind-boggling that FBSD doesn't track writes and reads
> seperately.

'iostat' does not tell you this, but 'gstat' does - its the "geom"
system monitor (a bit annoying that the standard tool is lacking in this
regard...).

Cheers

Mark