Thread: 8192 BLCKSZ ?

8192 BLCKSZ ?

From
mlw
Date:
This is just a curiosity.

Why is the default postgres block size 8192? These days, with caching
file systems, high speed DMA disks, hundreds of megabytes of RAM, maybe
even gigabytes. Surely, 8K is inefficient.

Has anyone done any tests to see if a default 32K block would provide a
better overall performance? 8K seems so small, and 32K looks to be where
most x86 operating systems seem to have a sweet spot.

If someone has the answer off the top of their head, and I'm just being
stupid, let me have it. However, I have needed to up the block size to
32K for a text management system and have seen no  performance problems.
(It has not been a scientific experiment, admittedly.)

This isn't a rant, but my gut tells me that a 32k block size as default
would be better, and that smaller deployments should adjust down as
needed.


Re: 8192 BLCKSZ ?

From
"Mitch Vincent"
Date:
I've been using a 32k BLCKSZ for months now without any trouble, though I've
not benchmarked it to see if it's any faster than one with a BLCKSZ of 8k..

-Mitch

> This is just a curiosity.
>
> Why is the default postgres block size 8192? These days, with caching
> file systems, high speed DMA disks, hundreds of megabytes of RAM, maybe
> even gigabytes. Surely, 8K is inefficient.
>
> Has anyone done any tests to see if a default 32K block would provide a
> better overall performance? 8K seems so small, and 32K looks to be where
> most x86 operating systems seem to have a sweet spot.
>
> If someone has the answer off the top of their head, and I'm just being
> stupid, let me have it. However, I have needed to up the block size to
> 32K for a text management system and have seen no  performance problems.
> (It has not been a scientific experiment, admittedly.)
>
> This isn't a rant, but my gut tells me that a 32k block size as default
> would be better, and that smaller deployments should adjust down as
> needed.
>



RE: 8192 BLCKSZ ?

From
"Christopher Kings-Lynne"
Date:
I don't believe it's a performance issue, I believe it's that writes to
blocks greater than 8k cannot be guaranteed 'atomic' by the operating
system.  Hence, 32k blocks would break the transactions system.  (Or
something like that - am I correct?)

Chris

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Mitch Vincent
> Sent: Tuesday, November 28, 2000 8:40 AM
> To: mlw; Hackers List
> Subject: Re: [HACKERS] 8192 BLCKSZ ?
>
>
> I've been using a 32k BLCKSZ for months now without any trouble,
> though I've
> not benchmarked it to see if it's any faster than one with a
> BLCKSZ of 8k..
>
> -Mitch
>
> > This is just a curiosity.
> >
> > Why is the default postgres block size 8192? These days, with caching
> > file systems, high speed DMA disks, hundreds of megabytes of RAM, maybe
> > even gigabytes. Surely, 8K is inefficient.
> >
> > Has anyone done any tests to see if a default 32K block would provide a
> > better overall performance? 8K seems so small, and 32K looks to be where
> > most x86 operating systems seem to have a sweet spot.
> >
> > If someone has the answer off the top of their head, and I'm just being
> > stupid, let me have it. However, I have needed to up the block size to
> > 32K for a text management system and have seen no  performance problems.
> > (It has not been a scientific experiment, admittedly.)
> >
> > This isn't a rant, but my gut tells me that a 32k block size as default
> > would be better, and that smaller deployments should adjust down as
> > needed.
> >
>



Re: 8192 BLCKSZ ?

From
"Mitch Vincent"
Date:
If it breaks anything in PostgreSQL I sure haven't seen any evidence -- the
box this database is running on gets hit pretty hard and I haven't had a
single ounce of trouble since I went to 7.0.X

-Mitch

----- Original Message -----
From: "Christopher Kings-Lynne" <chriskl@familyhealth.com.au>
To: "Hackers List" <pgsql-hackers@postgresql.org>
Sent: Monday, November 27, 2000 5:14 PM
Subject: RE: [HACKERS] 8192 BLCKSZ ?


> I don't believe it's a performance issue, I believe it's that writes to
> blocks greater than 8k cannot be guaranteed 'atomic' by the operating
> system.  Hence, 32k blocks would break the transactions system.  (Or
> something like that - am I correct?)
>
> Chris
>
> > -----Original Message-----
> > From: pgsql-hackers-owner@postgresql.org
> > [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Mitch Vincent
> > Sent: Tuesday, November 28, 2000 8:40 AM
> > To: mlw; Hackers List
> > Subject: Re: [HACKERS] 8192 BLCKSZ ?
> >
> >
> > I've been using a 32k BLCKSZ for months now without any trouble,
> > though I've
> > not benchmarked it to see if it's any faster than one with a
> > BLCKSZ of 8k..
> >
> > -Mitch
> >
> > > This is just a curiosity.
> > >
> > > Why is the default postgres block size 8192? These days, with caching
> > > file systems, high speed DMA disks, hundreds of megabytes of RAM,
maybe
> > > even gigabytes. Surely, 8K is inefficient.
> > >
> > > Has anyone done any tests to see if a default 32K block would provide
a
> > > better overall performance? 8K seems so small, and 32K looks to be
where
> > > most x86 operating systems seem to have a sweet spot.
> > >
> > > If someone has the answer off the top of their head, and I'm just
being
> > > stupid, let me have it. However, I have needed to up the block size to
> > > 32K for a text management system and have seen no  performance
problems.
> > > (It has not been a scientific experiment, admittedly.)
> > >
> > > This isn't a rant, but my gut tells me that a 32k block size as defaul
t
> > > would be better, and that smaller deployments should adjust down as
> > > needed.
> > >
> >
>
>



Re: 8192 BLCKSZ ?

From
Bruce Momjian
Date:
[ Charset ISO-8859-1 unsupported, converting... ]
> If it breaks anything in PostgreSQL I sure haven't seen any evidence -- the
> box this database is running on gets hit pretty hard and I haven't had a
> single ounce of trouble since I went to 7.0.X

Larger block sizes mean larger blocks in the cache, therefore fewer
blocks per megabyte.  The more granular the cache, the better.

8k is the standard Unix file system disk transfer size.  Less than that
would be overhead of transfering more info that we actually retrieve
from the kernel.  Larger and the cache is less granular.

No transaction issues because we use fsync.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: 8192 BLCKSZ ?

From
Nathan Myers
Date:
Nothing is guaranteed for anything larger than 512 bytes, and even 
then you have maybe 1e-13 likelihood of a badly-written block written 
during a power outage going unnoticed.  (That is why the FAQ recommends
you invest in a UPS.)  If PG crashes, you're covered, regardless of 
block size.  If the OS crashes, you're not.  If the power goes out, 
you're not.

The block size affects how much is written when you change only a 
single record within a block.  When you update a two-byte field in a 
100-byte record, do you want to write 32k?  (The answer is "maybe".)

Nathan Myers
ncm@zembu.com

On Tue, Nov 28, 2000 at 09:14:15AM +0800, Christopher Kings-Lynne wrote:
> I don't believe it's a performance issue, I believe it's that writes to
> blocks greater than 8k cannot be guaranteed 'atomic' by the operating
> system.  Hence, 32k blocks would break the transactions system.  (Or
> something like that - am I correct?)
> 
> > From: pgsql-hackers-owner@postgresql.org <On Behalf Of Mitch Vincent>
> > Sent: Tuesday, November 28, 2000 8:40 AM
> > Subject: Re: [HACKERS] 8192 BLCKSZ ?
> >
> > I've been using a 32k BLCKSZ for months now without any trouble,
> > though I've
> > not benchmarked it to see if it's any faster than one with a
> > BLCKSZ of 8k..
> >
> > > This is just a curiosity.
> > >
> > > Why is the default postgres block size 8192? These days, with caching
> > > file systems, high speed DMA disks, hundreds of megabytes of RAM, maybe
> > > even gigabytes. Surely, 8K is inefficient.
> > >
> > > Has anyone done any tests to see if a default 32K block would provide a
> > > better overall performance? 8K seems so small, and 32K looks to be where
> > > most x86 operating systems seem to have a sweet spot.
> > >
> > > If someone has the answer off the top of their head, and I'm just being
> > > stupid, let me have it. However, I have needed to up the block size to
> > > 32K for a text management system and have seen no  performance problems.
> > > (It has not been a scientific experiment, admittedly.)
> > >
> > > This isn't a rant, but my gut tells me that a 32k block size as default
> > > would be better, and that smaller deployments should adjust down as
> > > needed.


Re: 8192 BLCKSZ ?

From
Don Baccus
Date:
At 08:39 PM 11/27/00 -0500, Bruce Momjian wrote:
>[ Charset ISO-8859-1 unsupported, converting... ]
>> If it breaks anything in PostgreSQL I sure haven't seen any evidence -- the
>> box this database is running on gets hit pretty hard and I haven't had a
>> single ounce of trouble since I went to 7.0.X
>
>Larger block sizes mean larger blocks in the cache, therefore fewer
>blocks per megabyte.  The more granular the cache, the better.

Well, true, but when you have 256 MB or a half-gig or more to devote to
the cache, you get plenty of blocks, and in pre-PG 7.1 the 8KB limit is a
pain for a lot of folks.

Though the entire discussion's moot with PG 7.1, with the removal of the
tuple-size limit, it has been unfortunate that the fact that a blocksize
of up to 32KB can easily be configured at build time hasn't been printed
in a flaming-red oversized font on the front page of www.postgresql.org.

THE ENTIRE WORLD seems to believe that PG suffers from a hard-wired 8KB
limit on tuple size, rather than simply defaulting to that limit.  When
I tell the heathens that the REAL limit is 32KB, they're surprised, amazed,
pleased etc.

This default has unfairly contributed to the poor reputation PG has suffered
from for so long due to widespread ignorance that it's only a default, easily
changed.

For instance the November Linux Journal has a column on PG, favorable but
mentions the 8KB limit as though it's absolute.  Tim Perdue's article on
PHP Builder implied the same when he spoke of PG 7.1 removing the limit.

Again, PG 7.1 removes the issue entirely, but it is ironic that so many
people had heard that PG suffered from a hard-wired 8KB limit on tuple
length...



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Re: 8192 BLCKSZ ?

From
Bruce Momjian
Date:
> At 08:39 PM 11/27/00 -0500, Bruce Momjian wrote:
> >[ Charset ISO-8859-1 unsupported, converting... ]
> >> If it breaks anything in PostgreSQL I sure haven't seen any evidence -- the
> >> box this database is running on gets hit pretty hard and I haven't had a
> >> single ounce of trouble since I went to 7.0.X
> >
> >Larger block sizes mean larger blocks in the cache, therefore fewer
> >blocks per megabyte.  The more granular the cache, the better.
> 
> Well, true, but when you have 256 MB or a half-gig or more to devote to
> the cache, you get plenty of blocks, and in pre-PG 7.1 the 8KB limit is a
> pain for a lot of folks.

Agreed.  The other problem is that most people have 2-4MB of cache, so a
32k default would be too big for them.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: 8192 BLCKSZ ?

From
Don Baccus
Date:
At 09:30 PM 11/27/00 -0500, Bruce Momjian wrote:

>> Well, true, but when you have 256 MB or a half-gig or more to devote to
>> the cache, you get plenty of blocks, and in pre-PG 7.1 the 8KB limit is a
>> pain for a lot of folks.
>
>Agreed.  The other problem is that most people have 2-4MB of cache, so a
>32k default would be too big for them.

I've always been fine with the default, and in fact agree with it.  The
OpenACS project recommends a 16KB default for PG 7.0, but that's only so
we can hold reasonable-sized lzText strings in forum tables, etc.

I was only lamenting the fact that the world seems to have the impression
that it's not a default, but rather a hard-wired limit.



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Re: 8192 BLCKSZ ?

From
Tom Lane
Date:
"Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes:
> I don't believe it's a performance issue, I believe it's that writes to
> blocks greater than 8k cannot be guaranteed 'atomic' by the operating
> system.  Hence, 32k blocks would break the transactions system.

As Nathan remarks nearby, it's hard to tell how big a write can be
assumed atomic, unless you have considerable knowledge of your OS and
hardware.  However, on traditional Unix filesystems (BSD-derived) it's
a pretty certain bet that writes larger than 8K will *not* be atomic,
since 8K is the filesystem block size.  You don't even need any crash
scenario to see why not: just consider running your disk down to zero
free space.  If there's one block left when you try to add a
multi-block page to your table, you are left with a corrupted page,
not an unwritten page.

Not sure about the wild-and-wooly world of Linux filesystems...
anybody know what the allocation unit is on the popular Linux FSes?

My feeling is that 8K is an entirely reasonable size now that we have
TOAST, and so there's no longer much interest in changing the default
value of BLCKSZ.

In theory, I think, WAL should reduce the importance of page writes
being atomic --- but it still seems like a good idea to ensure that
they are as atomic as we can make them.
        regards, tom lane


Re: 8192 BLCKSZ ?

From
Bruce Momjian
Date:
> At 09:30 PM 11/27/00 -0500, Bruce Momjian wrote:
> 
> >> Well, true, but when you have 256 MB or a half-gig or more to devote to
> >> the cache, you get plenty of blocks, and in pre-PG 7.1 the 8KB limit is a
> >> pain for a lot of folks.
> >
> >Agreed.  The other problem is that most people have 2-4MB of cache, so a
> >32k default would be too big for them.
> 
> I've always been fine with the default, and in fact agree with it.  The
> OpenACS project recommends a 16KB default for PG 7.0, but that's only so
> we can hold reasonable-sized lzText strings in forum tables, etc.
> 
> I was only lamenting the fact that the world seems to have the impression
> that it's not a default, but rather a hard-wired limit.

Agreed.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: 8192 BLCKSZ ?

From
Tom Samplonius
Date:
On Mon, 27 Nov 2000, mlw wrote:

> This is just a curiosity.
> 
> Why is the default postgres block size 8192? These days, with caching
> file systems, high speed DMA disks, hundreds of megabytes of RAM, maybe
> even gigabytes. Surely, 8K is inefficient.
 I think it is a pretty wild assumption to say that 32k is more efficient
than 8k.  Considering how blocks are used, 32k may be in fact quite a bit
slower than 8k blocks.


Tom



Re: 8192 BLCKSZ ?

From
Bruce Guenter
Date:
On Tue, Nov 28, 2000 at 12:38:37AM -0500, Tom Lane wrote:
> Not sure about the wild-and-wooly world of Linux filesystems...
> anybody know what the allocation unit is on the popular Linux FSes?

It rather depends on the filesystem.  Current ext2 (the most common)
systems default to 1K on small partitions and 4K otherwise.  IIRC,
reiserfs uses 4K blocks in a tree structure that includes tail merging
which makes the question of block size tricky.  Linux 2.3.x passes all
file I/O through its page cache, which deals in 4K pages on most 32-bit
architectures.
--
Bruce Guenter <bruceg@em.ca>                       http://em.ca/~bruceg/

Re: 8192 BLCKSZ ?

From
Nathan Myers
Date:
On Tue, Nov 28, 2000 at 12:38:37AM -0500, Tom Lane wrote:
> "Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes:
> > I don't believe it's a performance issue, I believe it's that writes to
> > blocks greater than 8k cannot be guaranteed 'atomic' by the operating
> > system.  Hence, 32k blocks would break the transactions system.
> 
> As Nathan remarks nearby, it's hard to tell how big a write can be
> assumed atomic, unless you have considerable knowledge of your OS and
> hardware.  

Not to harp on the subject, but even if you _do_ know a great deal
about your OS and hardware, you _still_ can't assume any write is
atomic.

To give an idea of what is involved, consider that modern disk 
drives routinely re-order writes, by themselves.  You think you
have asked for a sequential write of 8K bytes, or 16 sectors,
but the disk might write the first and last sectors first, and 
then the middle sectors in random order.  A block of all zeroes
might not be written at all, but just noted in the track metadata.

Most disks have a "feature" that they report the write complete
as soon as it is in the RAM cache, rather than after the sectors
are on the disk.  (It's a "feature" because it makes their
benchmarks come out better.)  It can usually be turned off, but 
different vendors have different ways to do it.  Have you turned
it off on your production drives?

In the event of a power outage, the drive will stop writing in
mid-sector.  If you're lucky, that sector would have a bad checksum
if you tried to read it.  If the half-written sector happens to 
contain track metadata, you might have a bigger problem.  

----
The short summary is: for power outage or OS-crash recovery purposes,
there is no such thing as atomicity.  This is why backups and 
transaction logs are important.

"Invest in a UPS."  Use a reliable OS, and operate it in a way that
doesn't stress it.  Even a well-built OS will behave oddly when 
resources are badly stressed.  (That the oddities may be documented
doesn't really help much.)

For performance purposes, it may be more or less efficient to group 
writes into 4K, 8K, or 32K chunks.  That's not a matter of database 
atomicity, but of I/O optimization.  It can only confuse people to 
use "atomicity" in that context.

Nathan Myers
ncm@zembu.com



Re: 8192 BLCKSZ ?

From
Tom Lane
Date:
Nathan Myers <ncm@zembu.com> writes:
> In the event of a power outage, the drive will stop writing in
> mid-sector.

Really?  Any competent drive firmware designer would've made sure that
can't happen.  The drive has to detect power loss well before it
actually loses control of its actuators, because it's got to move
the heads to the safe landing zone.  If it checks for power loss and
starts that shutdown process between sector writes, never in the middle
of one, voila: atomic writes.

Of course, there's still no guarantee if you get a hardware failure
or sector write failure (recovery from the write failure might well
take longer than the drive has got).  But guarding against a plain
power-failure scenario is actually simpler than doing it the wrong
way.

But, as you say, customary page sizes are bigger than a sector, so
this is all moot for our purposes anyway :-(
        regards, tom lane


Re: 8192 BLCKSZ ?

From
Nathan Myers
Date:
On Tue, Nov 28, 2000 at 04:24:34PM -0500, Tom Lane wrote:
> Nathan Myers <ncm@zembu.com> writes:
> > In the event of a power outage, the drive will stop writing in
> > mid-sector.
> 
> Really?  Any competent drive firmware designer would've made sure that
> can't happen.  The drive has to detect power loss well before it
> actually loses control of its actuators, because it's got to move
> the heads to the safe landing zone.  If it checks for power loss and
> starts that shutdown process between sector writes, never in the middle
> of one, voila: atomic writes.

I used to think that way too, because that's how I would design a drive.
(Anyway that would still only give you 512-byte-atomic writes, which 
isn't enough.)

Talking to people who build them was a rude awakening.  They have
apparatus to yank the head off the drive and lock it away when the 
power starts to go down, and it will happily operate in mid-write.
(It's possible that some drives are made the way Tom describes, but 
evidently not the commodity stuff.)

The level of software-development competence, and of reliability 
engineering, that I've seen among disk drive firmware maintainers
distresses me whenever I think about it.  A disk drive is best
considered as throwaway cache image of your real medium.

> Of course, there's still no guarantee if you get a hardware failure
> or sector write failure (recovery from the write failure might well
> take longer than the drive has got).  But guarding against a plain
> power-failure scenario is actually simpler than doing it the wrong
> way.

If only the disk-drive vendors (and buyers!) thought that way...

Nathan Myers
ncm@zembu.com



Re: 8192 BLCKSZ ?

From
Matthew Kirkwood
Date:
On Tue, 28 Nov 2000, Tom Lane wrote:

> Nathan Myers <ncm@zembu.com> writes:
> > In the event of a power outage, the drive will stop writing in
> > mid-sector.
> 
> Really?  Any competent drive firmware designer would've made sure that
> can't happen.  The drive has to detect power loss well before it
> actually loses control of its actuators, because it's got to move the
> heads to the safe landing zone.  If it checks for power loss and
> starts that shutdown process between sector writes, never in the
> middle of one, voila: atomic writes.

In principle, that is correct.  However, the SGI XFS people
have apparently found otherwise -- what can happen is that
the drive itself has enough power to complete a write, but
that the disk/controller buffers lose power and so you end
up writing a (perhaps partial) block of zeroes.

Matthew.



Re: 8192 BLCKSZ ?

From
mlw
Date:
Matthew Kirkwood wrote:
> 
> On Tue, 28 Nov 2000, Tom Lane wrote:
> 
> > Nathan Myers <ncm@zembu.com> writes:
> > > In the event of a power outage, the drive will stop writing in
> > > mid-sector.
> >
> > Really?  Any competent drive firmware designer would've made sure that
> > can't happen.  The drive has to detect power loss well before it
> > actually loses control of its actuators, because it's got to move the
> > heads to the safe landing zone.  If it checks for power loss and
> > starts that shutdown process between sector writes, never in the
> > middle of one, voila: atomic writes.
> 
> In principle, that is correct.  However, the SGI XFS people
> have apparently found otherwise -- what can happen is that
> the drive itself has enough power to complete a write, but
> that the disk/controller buffers lose power and so you end
> up writing a (perhaps partial) block of zeroes.

I have worked on a few systems that intend to take a hard power failure
gracefully. It is a very hard thing to do, with a lot of specialized
circuitry.

While it is nice to think about, on a normal computer systems one can
not depend on a system shutting down gracefully on a hard power loss
without a smart UPS and daemon to shut down the system.

It does not matter one bit about disk write sizes or what ever. Unless
the computer can know it is about to lose power, it can not halt its
operations and enter a safe mode.

The whole "pull the plug" mentality is silly. Unless a system hardware
is specifically designed to manage this and proper software in place, it
can not be done, and any "compliance" you think you see is simply luck.

Any computer that has important data should have a smart UPS and a
daemon to manage it. 

-- 
http://www.mohawksoft.com