Thread: [Fwd: Re: 8192 BLCKSZ ?]

[Fwd: Re: 8192 BLCKSZ ?]

From
mlw
Date:
Tom Samplonius wrote:

> On Tue, 28 Nov 2000, mlw wrote:
>
> > Tom Samplonius wrote:
> > >
> > > On Mon, 27 Nov 2000, mlw wrote:
> > >
> > > > This is just a curiosity.
> > > >
> > > > Why is the default postgres block size 8192? These days, with caching
> > > > file systems, high speed DMA disks, hundreds of megabytes of RAM, maybe
> > > > even gigabytes. Surely, 8K is inefficient.
> > >
> > >   I think it is a pretty wild assumption to say that 32k is more efficient
> > > than 8k.  Considering how blocks are used, 32k may be in fact quite a bit
> > > slower than 8k blocks.
> >
> > I'm not so sure I agree. Perhaps I am off base here, but I did a bit of
> > OS profiling a while back when I was doing a DICOM server. I
> > experimented with block sizes and found that the best throughput on
> > Linux and Windows NT was at 32K. The graph I created showed a steady
> > increase in performance and a drop just after 32K, then steady from
> > there. In Windows NT it was more pronounced than it was in Linux, but
> > Linux still exhibited a similar trait.
>
>   You are a bit off base here.  The typical access pattern is random IO,
> not sequentional.  If you use a large block size in Postgres, Postgres
> will read and write more data than necessary.  Which is faster? 1000 x 8K
> IOs?  Or 1000 x 32K IOs

I can sort of see your point, but the  8K vs 32K is not a linear
relationship.
The big hit is the disk I/O operation, more so than just the data size. 
It may
be almost as efficient to write 32K as it is to write 8K. While I do not
know the
exact numbers, and it varies by OS and disk subsystem,  I am sure that
writing
32K is not even close to 4x more expensive than 8K. Think about seek
times,
writing anything to the disk is expensive regardless of the amount of
data. Most
disks today have many heads, and are RL encoded. It may only add 10us
(approx.
1-2 sectors of a 64 sector drive spinning 7200 rpm)  to a disk operation
which
takes an order of magnitude longer positioning the heads.

The overhead of an additional 24K is minute compared to the cost of a
disk
operation. So if any measurable benefit can come from having bigger
buffers, i.e.
having more data available per disk operation, it will probably be
faster.


Re: [Fwd: Re: 8192 BLCKSZ ?]

From
mlw
Date:
Kevin O'Gorman wrote:
> 
> mlw wrote:
> >
> > Tom Samplonius wrote:
> >
> > > On Tue, 28 Nov 2000, mlw wrote:
> > >
> > > > Tom Samplonius wrote:
> > > > >
> > > > > On Mon, 27 Nov 2000, mlw wrote:
> > > > >
> > > > > > This is just a curiosity.
> > > > > >
> > > > > > Why is the default postgres block size 8192? These days, with caching
> > > > > > file systems, high speed DMA disks, hundreds of megabytes of RAM, maybe
> > > > > > even gigabytes. Surely, 8K is inefficient.
> > > > >
> > > > >   I think it is a pretty wild assumption to say that 32k is more efficient
> > > > > than 8k.  Considering how blocks are used, 32k may be in fact quite a bit
> > > > > slower than 8k blocks.
> > > >
> > > > I'm not so sure I agree. Perhaps I am off base here, but I did a bit of
> > > > OS profiling a while back when I was doing a DICOM server. I
> > > > experimented with block sizes and found that the best throughput on
> > > > Linux and Windows NT was at 32K. The graph I created showed a steady
> > > > increase in performance and a drop just after 32K, then steady from
> > > > there. In Windows NT it was more pronounced than it was in Linux, but
> > > > Linux still exhibited a similar trait.
> > >
> > >   You are a bit off base here.  The typical access pattern is random IO,
> > > not sequentional.  If you use a large block size in Postgres, Postgres
> > > will read and write more data than necessary.  Which is faster? 1000 x 8K
> > > IOs?  Or 1000 x 32K IOs
> >
> > I can sort of see your point, but the  8K vs 32K is not a linear
> > relationship.
> > The big hit is the disk I/O operation, more so than just the data size.
> > It may
> > be almost as efficient to write 32K as it is to write 8K. While I do not
> > know the
> > exact numbers, and it varies by OS and disk subsystem,  I am sure that
> > writing
> > 32K is not even close to 4x more expensive than 8K. Think about seek
> > times,
> > writing anything to the disk is expensive regardless of the amount of
> > data. Most
> > disks today have many heads, and are RL encoded. It may only add 10us
> > (approx.
> > 1-2 sectors of a 64 sector drive spinning 7200 rpm)  to a disk operation
> > which
> > takes an order of magnitude longer positioning the heads.
> >
> > The overhead of an additional 24K is minute compared to the cost of a
> > disk
> > operation. So if any measurable benefit can come from having bigger
> > buffers, i.e.
> > having more data available per disk operation, it will probably be
> > faster.
> 
> This is only part of the story.  It applies best when you're going
> to use sequential scans, for instance, or otherwise use all the info
> in any block that you fetch.  However, when your blocks are 8x bigger,
> your number of blocks in the disk cache is 8x fewer.  If you're
> accessing random blocks, your hopes of finding the block in the
> cache are affected (probably not 8x, but there is an effect).
> 
> So don't just blindly think that bigger blocks are better.  It
> ain't necessarily so.
> 

First, the difference between 8K and 32K is 4 not 8.

The problem is you are looking at these numbers as if there is a linear
relationship between the 8 and the 32. You are thinking 8 is 1/4 the
size of 32, so it must be 1/4 the amount of work. This is not true at
all.

Many operating systems used a fixed memory block size allocation for
their disk cache. They do not allocate a new block for every disk
request, they maintain a pool of fixed sized buffer blocks. So if you
use fewer bytes than the OS block size you waste the difference between
your block size and the block size of the OS cache entry.

I'm pretty sure Linux uses a 32K buffer size in its cache, and I'm
pretty confident that NT does as well from my previous tests.

So, in effect, an 8K block may waste 3/4 of the memory in the disk
cache.

http://www.mohawksoft.com


Re: [Fwd: Re: 8192 BLCKSZ ?]

From
mlw
Date:
Kevin O'Gorman wrote:
> 
> mlw wrote:
> > Many operating systems used a fixed memory block size allocation for
> > their disk cache. They do not allocate a new block for every disk
> > request, they maintain a pool of fixed sized buffer blocks. So if you
> > use fewer bytes than the OS block size you waste the difference between
> > your block size and the block size of the OS cache entry.
> >
> > I'm pretty sure Linux uses a 32K buffer size in its cache, and I'm
> > pretty confident that NT does as well from my previous tests.
> 
> I dunno about NT, but here's a quote from "Linux Kernel Internals"
> 2nd Ed, page 92-93:
>     .. The block size for any given device may be 512, 1024, 2048 or
>     4096 bytes....
> 
>     ... the buffer cache manages individual block buffers of
>     varying size.  For this, every block is given a 'buffer_head' data
>     structure. ...  The definition of the buffer head is in linux/fs.h
> 
>     ... the size of this area exactly matches the block size 'b_size'...
> 
> The quote goes on to describe how the data structures are designed to
> be processor-cache-aware.
> 

I double checked the kernel source, and you are right. I stand corrected
about the disk caching.

My assertion stands, it is a neglagable difference to read 32K vs 8K
from a disk, and the probability of data being within a 4 times larger
block is 4 times better, even though the probability of having the
correct block in memory is 4 times less. So, I don't think it is a
numerically significant issue.


-- 
http://www.mohawksoft.com


Re: [Fwd: Re: 8192 BLCKSZ ?]

From
mlw
Date:
Kevin O'Gorman wrote:
> 
> mlw wrote:
> >
> > Kevin O'Gorman wrote:
> > >
> > > mlw wrote:
> > > > Many operating systems used a fixed memory block size allocation for
> > > > their disk cache. They do not allocate a new block for every disk
> > > > request, they maintain a pool of fixed sized buffer blocks. So if you
> > > > use fewer bytes than the OS block size you waste the difference between
> > > > your block size and the block size of the OS cache entry.
> > > >
> > > > I'm pretty sure Linux uses a 32K buffer size in its cache, and I'm
> > > > pretty confident that NT does as well from my previous tests.
> > >
> > > I dunno about NT, but here's a quote from "Linux Kernel Internals"
> > > 2nd Ed, page 92-93:
> > >     .. The block size for any given device may be 512, 1024, 2048 or
> > >     4096 bytes....
> > >
> > >     ... the buffer cache manages individual block buffers of
> > >     varying size.  For this, every block is given a 'buffer_head' data
> > >     structure. ...  The definition of the buffer head is in linux/fs.h
> > >
> > >     ... the size of this area exactly matches the block size 'b_size'...
> > >
> > > The quote goes on to describe how the data structures are designed to
> > > be processor-cache-aware.
> > >
> >
> > I double checked the kernel source, and you are right. I stand corrected
> > about the disk caching.
> >
> > My assertion stands, it is a neglagable difference to read 32K vs 8K
> > from a disk, and the probability of data being within a 4 times larger
> > block is 4 times better, even though the probability of having the
> > correct block in memory is 4 times less. So, I don't think it is a
> > numerically significant issue.
> >
> 
> My point is that it's going to depend strongly on what you're doing.
> If you're getting only one item from each block, you pay a cost in cache
> flushing even if the disk I/O time isn't much different.  You're carrying
> 3x unused bytes and displacing other, possibly useful, things from the
> cache.
> 
> So whether it's a good thing or not is something you have to measure, not
> argue about.  Because it will vary depending on your workload.  That's
> where a DBA begins to earn his/her pay.

I would tend to disagree "in general." One can always find more optimal
ways to search data if one knows the nature of the data and the nature
of the search before hand. The nature of the data could be knowledge of
whether it is sorted along the lines of the type of search you want to
do. It could be knowledge of the entirety of the data, and so on.

The cost difference between 32K vs 8K disk reads/writes are so small
these days when compared with overall cost of the disk operation itself,
that you can even measure it, well below 1%. Remember seek times
advertised on disks are an average. 

SQL itself is a compromise between a hand coded search program and a
general purpose solution. As a general purpose search system, one can
not conclude that data is less likely to be in a larger block vs more
likely to be in a smaller block that remains in cache.

There are just as many cases where one could make an argument about one
verses the other based on the nature of data and the nature of the
search.

However, that being said, memory DIMMS are 256M for $100 and time is
priceless. The 8K default has been there as long I can remember having
to think about it, and only recently did I learn it can be changed. I
have been using Postgres since about 1996.

I argue that reading 32K is, for all practical purposes, not measurably
different to read or write to disk than is 8K. The sole point in your
argument is that with a 4x larger block you have a 1/4 chance that the
block will be in memory. 

I argue that with a 4x greater block size, you have 4x greater chance
that data will be in a block, and that this offsets the 1/4 chance of
something being in cache. 

The likelihood of something being in a cache is directly proportional to
the ratio of the size of whole object being cached vs size of the cache
itself, and the algorithms used to calculate what remains in cache.
Typically this is a combination of LRU, frequency, and some predictive
analysis.

Small databases may, in fact, reside entirely in disk cache because of
the amount of RAM on modern machines. Large databases can not be
entirely cached and some small percentage of them will be in cache.
Depending on the "randomness" of the search criteria, the probability of
the item which you wish to locate being in cache has, as far as I can
see, little to do with the block size.

I am going to see if I can get some time together this weekend and see
if the benchmark programs measure a difference in block sizes, and if
so, compare. I will try to test 8K, 16K, 24K, 32K.







-- 
http://www.mohawksoft.com


RE: 8192 BLCKSZ ?]

From
"Andrew Snow"
Date:

> The cost difference between 32K vs 8K disk reads/writes are so small
> these days when compared with overall cost of the disk operation itself,
> that you can even measure it, well below 1%. Remember seek times
> advertised on disks are an average.

It has been said how small the difference is - therefore in my opinion it
should remain at 8KB to maintain best average performance with all existing
platforms.

I say its best let the OS and mass storage subsystem worry about read-ahead
caching and whether they actually read 8KB off the disk, or 32KB or 64KB
when we ask for 8.


- Andrew




RE: 8192 BLCKSZ ?]

From
Don Baccus
Date:
At 10:52 AM 12/2/00 +1100, Andrew Snow wrote:
>
>
>> The cost difference between 32K vs 8K disk reads/writes are so small
>> these days when compared with overall cost of the disk operation itself,
>> that you can even measure it, well below 1%. Remember seek times
>> advertised on disks are an average.
>
>It has been said how small the difference is - therefore in my opinion it
>should remain at 8KB to maintain best average performance with all existing
>platforms.

With versions <= PG 7.0, the motivation that's been stated isn't performance
based as much as an option to let you stick relatively big chunks of text
(~40k-ish+ for lzText) in a single row without resorting to classic PG's
ugly LOB interface or something almost as ugly as the built-in LOB handler
I did for AOLserver many months ago.  The performance arguments have mostly
been of the form "it won't really cost you much and you can use rows that
are so much longer ..."

I think there's been recognition that 8KB is a reasonable default, along with
lamenting (at least on my part) that the fact that this is just a DEFAULT hasn't
been well-communicated,  leading many casual surveyors of DB alternatives to
believe that it is truly a hard-wired limitation.  Causing PG's reputation to
suffer as a result.  One could argue that PG"s reputation would've been
enhanced in past years if a 32KB block size limit rather than 8KB block size
default had been emphasized.

But you wouldn't have to change the DEFAULT in order to make this claim!  It
would've been just a matter of emphasizing the limit rather than the default.

PG 7.1 will pretty much end any confusion.  The segmented approach used by
TOAST should work well (the AOLserver LOB handler I wrote months ago works
well in the OpenACS context, and uses a very similar segmentation scheme, so
I expect TOAST to work even better).  Users will still be able change to
larger blocksizes (perhaps a wise thing to do if a large percentage of their
data won't fit into a single PG block).   Users using the default will
be able to store rows of *awesome* length, efficiently.




- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Re: 8192 BLCKSZ ?]

From
Jan Wieck
Date:
Don Baccus wrote:
>
> ...
> I expect TOAST to work even better).  Users will still be able change to
> larger blocksizes (perhaps a wise thing to do if a large percentage of their
> data won't fit into a single PG block).   Users using the default will
> be able to store rows of *awesome* length, efficiently.
   Depends...
   Actually  the  toaster already jumps in if your tuples exceed   BLKSZ/4, so with the default of 8K blocks it  tries
to keep   all tuples smaller than 2K. The reasons behind that are:
 
   1.  An average tuple size of 8K means an average of 4K unused       space at the end of each block. Wasting  space
means to       waste IO bandwidth.
 
   2.  Since  big  items  are  unlikely  to  be search criteria,       needing to read them into memory for every
chech for  a       match on other columns is a waste again.  So the more big       items are off from the main tuple,
thesmaller  the  main       table becomes, the more likely it is that the main tuples       (holding  the  keys)  are
cached and  the   cheaper   a       sequential scan becomes.
 
   Of  course,  especially  for  2. there is a break even point.   That is when the extra fetches to send toast  values
to  the   client  cost  more  than  there  was  saved from not doing it   during  the  main  scan  already.  A  full
table SELECT   *   definitely  costs  more  if  TOAST  is involved. But who does   unqualified SELECT * from a
multi-gigtable without  problems   anyway?   Usually  you  pick  a single or a few based on some   other key attributes
-don't you?
 
   Let's make an example. You have a forum server that  displays   one  article  plus the date and sender of all
follow-ups.The   article bodies are usually big (1-10K). So you do a SELECT  *   to  fetch  the actually displayed
article,and another SELECT   sender, date_sent just to get the info for the follow-ups. If   we  assume a uniform
distributionof body size and an average   of 10 follow-ups, that'd mean that we  save  52K  of  IO  and   cache usage
foreach article displayed.
 


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #