Thread: 8192 BLCKSZ ?
This is just a curiosity. Why is the default postgres block size 8192? These days, with caching file systems, high speed DMA disks, hundreds of megabytes of RAM, maybe even gigabytes. Surely, 8K is inefficient. Has anyone done any tests to see if a default 32K block would provide a better overall performance? 8K seems so small, and 32K looks to be where most x86 operating systems seem to have a sweet spot. If someone has the answer off the top of their head, and I'm just being stupid, let me have it. However, I have needed to up the block size to 32K for a text management system and have seen no performance problems. (It has not been a scientific experiment, admittedly.) This isn't a rant, but my gut tells me that a 32k block size as default would be better, and that smaller deployments should adjust down as needed.
I've been using a 32k BLCKSZ for months now without any trouble, though I've not benchmarked it to see if it's any faster than one with a BLCKSZ of 8k.. -Mitch > This is just a curiosity. > > Why is the default postgres block size 8192? These days, with caching > file systems, high speed DMA disks, hundreds of megabytes of RAM, maybe > even gigabytes. Surely, 8K is inefficient. > > Has anyone done any tests to see if a default 32K block would provide a > better overall performance? 8K seems so small, and 32K looks to be where > most x86 operating systems seem to have a sweet spot. > > If someone has the answer off the top of their head, and I'm just being > stupid, let me have it. However, I have needed to up the block size to > 32K for a text management system and have seen no performance problems. > (It has not been a scientific experiment, admittedly.) > > This isn't a rant, but my gut tells me that a 32k block size as default > would be better, and that smaller deployments should adjust down as > needed. >
I don't believe it's a performance issue, I believe it's that writes to blocks greater than 8k cannot be guaranteed 'atomic' by the operating system. Hence, 32k blocks would break the transactions system. (Or something like that - am I correct?) Chris > -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Mitch Vincent > Sent: Tuesday, November 28, 2000 8:40 AM > To: mlw; Hackers List > Subject: Re: [HACKERS] 8192 BLCKSZ ? > > > I've been using a 32k BLCKSZ for months now without any trouble, > though I've > not benchmarked it to see if it's any faster than one with a > BLCKSZ of 8k.. > > -Mitch > > > This is just a curiosity. > > > > Why is the default postgres block size 8192? These days, with caching > > file systems, high speed DMA disks, hundreds of megabytes of RAM, maybe > > even gigabytes. Surely, 8K is inefficient. > > > > Has anyone done any tests to see if a default 32K block would provide a > > better overall performance? 8K seems so small, and 32K looks to be where > > most x86 operating systems seem to have a sweet spot. > > > > If someone has the answer off the top of their head, and I'm just being > > stupid, let me have it. However, I have needed to up the block size to > > 32K for a text management system and have seen no performance problems. > > (It has not been a scientific experiment, admittedly.) > > > > This isn't a rant, but my gut tells me that a 32k block size as default > > would be better, and that smaller deployments should adjust down as > > needed. > > >
If it breaks anything in PostgreSQL I sure haven't seen any evidence -- the box this database is running on gets hit pretty hard and I haven't had a single ounce of trouble since I went to 7.0.X -Mitch ----- Original Message ----- From: "Christopher Kings-Lynne" <chriskl@familyhealth.com.au> To: "Hackers List" <pgsql-hackers@postgresql.org> Sent: Monday, November 27, 2000 5:14 PM Subject: RE: [HACKERS] 8192 BLCKSZ ? > I don't believe it's a performance issue, I believe it's that writes to > blocks greater than 8k cannot be guaranteed 'atomic' by the operating > system. Hence, 32k blocks would break the transactions system. (Or > something like that - am I correct?) > > Chris > > > -----Original Message----- > > From: pgsql-hackers-owner@postgresql.org > > [mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Mitch Vincent > > Sent: Tuesday, November 28, 2000 8:40 AM > > To: mlw; Hackers List > > Subject: Re: [HACKERS] 8192 BLCKSZ ? > > > > > > I've been using a 32k BLCKSZ for months now without any trouble, > > though I've > > not benchmarked it to see if it's any faster than one with a > > BLCKSZ of 8k.. > > > > -Mitch > > > > > This is just a curiosity. > > > > > > Why is the default postgres block size 8192? These days, with caching > > > file systems, high speed DMA disks, hundreds of megabytes of RAM, maybe > > > even gigabytes. Surely, 8K is inefficient. > > > > > > Has anyone done any tests to see if a default 32K block would provide a > > > better overall performance? 8K seems so small, and 32K looks to be where > > > most x86 operating systems seem to have a sweet spot. > > > > > > If someone has the answer off the top of their head, and I'm just being > > > stupid, let me have it. However, I have needed to up the block size to > > > 32K for a text management system and have seen no performance problems. > > > (It has not been a scientific experiment, admittedly.) > > > > > > This isn't a rant, but my gut tells me that a 32k block size as defaul t > > > would be better, and that smaller deployments should adjust down as > > > needed. > > > > > > >
[ Charset ISO-8859-1 unsupported, converting... ] > If it breaks anything in PostgreSQL I sure haven't seen any evidence -- the > box this database is running on gets hit pretty hard and I haven't had a > single ounce of trouble since I went to 7.0.X Larger block sizes mean larger blocks in the cache, therefore fewer blocks per megabyte. The more granular the cache, the better. 8k is the standard Unix file system disk transfer size. Less than that would be overhead of transfering more info that we actually retrieve from the kernel. Larger and the cache is less granular. No transaction issues because we use fsync. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Nothing is guaranteed for anything larger than 512 bytes, and even then you have maybe 1e-13 likelihood of a badly-written block written during a power outage going unnoticed. (That is why the FAQ recommends you invest in a UPS.) If PG crashes, you're covered, regardless of block size. If the OS crashes, you're not. If the power goes out, you're not. The block size affects how much is written when you change only a single record within a block. When you update a two-byte field in a 100-byte record, do you want to write 32k? (The answer is "maybe".) Nathan Myers ncm@zembu.com On Tue, Nov 28, 2000 at 09:14:15AM +0800, Christopher Kings-Lynne wrote: > I don't believe it's a performance issue, I believe it's that writes to > blocks greater than 8k cannot be guaranteed 'atomic' by the operating > system. Hence, 32k blocks would break the transactions system. (Or > something like that - am I correct?) > > > From: pgsql-hackers-owner@postgresql.org <On Behalf Of Mitch Vincent> > > Sent: Tuesday, November 28, 2000 8:40 AM > > Subject: Re: [HACKERS] 8192 BLCKSZ ? > > > > I've been using a 32k BLCKSZ for months now without any trouble, > > though I've > > not benchmarked it to see if it's any faster than one with a > > BLCKSZ of 8k.. > > > > > This is just a curiosity. > > > > > > Why is the default postgres block size 8192? These days, with caching > > > file systems, high speed DMA disks, hundreds of megabytes of RAM, maybe > > > even gigabytes. Surely, 8K is inefficient. > > > > > > Has anyone done any tests to see if a default 32K block would provide a > > > better overall performance? 8K seems so small, and 32K looks to be where > > > most x86 operating systems seem to have a sweet spot. > > > > > > If someone has the answer off the top of their head, and I'm just being > > > stupid, let me have it. However, I have needed to up the block size to > > > 32K for a text management system and have seen no performance problems. > > > (It has not been a scientific experiment, admittedly.) > > > > > > This isn't a rant, but my gut tells me that a 32k block size as default > > > would be better, and that smaller deployments should adjust down as > > > needed.
At 08:39 PM 11/27/00 -0500, Bruce Momjian wrote: >[ Charset ISO-8859-1 unsupported, converting... ] >> If it breaks anything in PostgreSQL I sure haven't seen any evidence -- the >> box this database is running on gets hit pretty hard and I haven't had a >> single ounce of trouble since I went to 7.0.X > >Larger block sizes mean larger blocks in the cache, therefore fewer >blocks per megabyte. The more granular the cache, the better. Well, true, but when you have 256 MB or a half-gig or more to devote to the cache, you get plenty of blocks, and in pre-PG 7.1 the 8KB limit is a pain for a lot of folks. Though the entire discussion's moot with PG 7.1, with the removal of the tuple-size limit, it has been unfortunate that the fact that a blocksize of up to 32KB can easily be configured at build time hasn't been printed in a flaming-red oversized font on the front page of www.postgresql.org. THE ENTIRE WORLD seems to believe that PG suffers from a hard-wired 8KB limit on tuple size, rather than simply defaulting to that limit. When I tell the heathens that the REAL limit is 32KB, they're surprised, amazed, pleased etc. This default has unfairly contributed to the poor reputation PG has suffered from for so long due to widespread ignorance that it's only a default, easily changed. For instance the November Linux Journal has a column on PG, favorable but mentions the 8KB limit as though it's absolute. Tim Perdue's article on PHP Builder implied the same when he spoke of PG 7.1 removing the limit. Again, PG 7.1 removes the issue entirely, but it is ironic that so many people had heard that PG suffered from a hard-wired 8KB limit on tuple length... - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
> At 08:39 PM 11/27/00 -0500, Bruce Momjian wrote: > >[ Charset ISO-8859-1 unsupported, converting... ] > >> If it breaks anything in PostgreSQL I sure haven't seen any evidence -- the > >> box this database is running on gets hit pretty hard and I haven't had a > >> single ounce of trouble since I went to 7.0.X > > > >Larger block sizes mean larger blocks in the cache, therefore fewer > >blocks per megabyte. The more granular the cache, the better. > > Well, true, but when you have 256 MB or a half-gig or more to devote to > the cache, you get plenty of blocks, and in pre-PG 7.1 the 8KB limit is a > pain for a lot of folks. Agreed. The other problem is that most people have 2-4MB of cache, so a 32k default would be too big for them. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
At 09:30 PM 11/27/00 -0500, Bruce Momjian wrote: >> Well, true, but when you have 256 MB or a half-gig or more to devote to >> the cache, you get plenty of blocks, and in pre-PG 7.1 the 8KB limit is a >> pain for a lot of folks. > >Agreed. The other problem is that most people have 2-4MB of cache, so a >32k default would be too big for them. I've always been fine with the default, and in fact agree with it. The OpenACS project recommends a 16KB default for PG 7.0, but that's only so we can hold reasonable-sized lzText strings in forum tables, etc. I was only lamenting the fact that the world seems to have the impression that it's not a default, but rather a hard-wired limit. - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
"Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes: > I don't believe it's a performance issue, I believe it's that writes to > blocks greater than 8k cannot be guaranteed 'atomic' by the operating > system. Hence, 32k blocks would break the transactions system. As Nathan remarks nearby, it's hard to tell how big a write can be assumed atomic, unless you have considerable knowledge of your OS and hardware. However, on traditional Unix filesystems (BSD-derived) it's a pretty certain bet that writes larger than 8K will *not* be atomic, since 8K is the filesystem block size. You don't even need any crash scenario to see why not: just consider running your disk down to zero free space. If there's one block left when you try to add a multi-block page to your table, you are left with a corrupted page, not an unwritten page. Not sure about the wild-and-wooly world of Linux filesystems... anybody know what the allocation unit is on the popular Linux FSes? My feeling is that 8K is an entirely reasonable size now that we have TOAST, and so there's no longer much interest in changing the default value of BLCKSZ. In theory, I think, WAL should reduce the importance of page writes being atomic --- but it still seems like a good idea to ensure that they are as atomic as we can make them. regards, tom lane
> At 09:30 PM 11/27/00 -0500, Bruce Momjian wrote: > > >> Well, true, but when you have 256 MB or a half-gig or more to devote to > >> the cache, you get plenty of blocks, and in pre-PG 7.1 the 8KB limit is a > >> pain for a lot of folks. > > > >Agreed. The other problem is that most people have 2-4MB of cache, so a > >32k default would be too big for them. > > I've always been fine with the default, and in fact agree with it. The > OpenACS project recommends a 16KB default for PG 7.0, but that's only so > we can hold reasonable-sized lzText strings in forum tables, etc. > > I was only lamenting the fact that the world seems to have the impression > that it's not a default, but rather a hard-wired limit. Agreed. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Mon, 27 Nov 2000, mlw wrote: > This is just a curiosity. > > Why is the default postgres block size 8192? These days, with caching > file systems, high speed DMA disks, hundreds of megabytes of RAM, maybe > even gigabytes. Surely, 8K is inefficient. I think it is a pretty wild assumption to say that 32k is more efficient than 8k. Considering how blocks are used, 32k may be in fact quite a bit slower than 8k blocks. Tom
On Tue, Nov 28, 2000 at 12:38:37AM -0500, Tom Lane wrote: > Not sure about the wild-and-wooly world of Linux filesystems... > anybody know what the allocation unit is on the popular Linux FSes? It rather depends on the filesystem. Current ext2 (the most common) systems default to 1K on small partitions and 4K otherwise. IIRC, reiserfs uses 4K blocks in a tree structure that includes tail merging which makes the question of block size tricky. Linux 2.3.x passes all file I/O through its page cache, which deals in 4K pages on most 32-bit architectures. -- Bruce Guenter <bruceg@em.ca> http://em.ca/~bruceg/
On Tue, Nov 28, 2000 at 12:38:37AM -0500, Tom Lane wrote: > "Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes: > > I don't believe it's a performance issue, I believe it's that writes to > > blocks greater than 8k cannot be guaranteed 'atomic' by the operating > > system. Hence, 32k blocks would break the transactions system. > > As Nathan remarks nearby, it's hard to tell how big a write can be > assumed atomic, unless you have considerable knowledge of your OS and > hardware. Not to harp on the subject, but even if you _do_ know a great deal about your OS and hardware, you _still_ can't assume any write is atomic. To give an idea of what is involved, consider that modern disk drives routinely re-order writes, by themselves. You think you have asked for a sequential write of 8K bytes, or 16 sectors, but the disk might write the first and last sectors first, and then the middle sectors in random order. A block of all zeroes might not be written at all, but just noted in the track metadata. Most disks have a "feature" that they report the write complete as soon as it is in the RAM cache, rather than after the sectors are on the disk. (It's a "feature" because it makes their benchmarks come out better.) It can usually be turned off, but different vendors have different ways to do it. Have you turned it off on your production drives? In the event of a power outage, the drive will stop writing in mid-sector. If you're lucky, that sector would have a bad checksum if you tried to read it. If the half-written sector happens to contain track metadata, you might have a bigger problem. ---- The short summary is: for power outage or OS-crash recovery purposes, there is no such thing as atomicity. This is why backups and transaction logs are important. "Invest in a UPS." Use a reliable OS, and operate it in a way that doesn't stress it. Even a well-built OS will behave oddly when resources are badly stressed. (That the oddities may be documented doesn't really help much.) For performance purposes, it may be more or less efficient to group writes into 4K, 8K, or 32K chunks. That's not a matter of database atomicity, but of I/O optimization. It can only confuse people to use "atomicity" in that context. Nathan Myers ncm@zembu.com
Nathan Myers <ncm@zembu.com> writes: > In the event of a power outage, the drive will stop writing in > mid-sector. Really? Any competent drive firmware designer would've made sure that can't happen. The drive has to detect power loss well before it actually loses control of its actuators, because it's got to move the heads to the safe landing zone. If it checks for power loss and starts that shutdown process between sector writes, never in the middle of one, voila: atomic writes. Of course, there's still no guarantee if you get a hardware failure or sector write failure (recovery from the write failure might well take longer than the drive has got). But guarding against a plain power-failure scenario is actually simpler than doing it the wrong way. But, as you say, customary page sizes are bigger than a sector, so this is all moot for our purposes anyway :-( regards, tom lane
On Tue, Nov 28, 2000 at 04:24:34PM -0500, Tom Lane wrote: > Nathan Myers <ncm@zembu.com> writes: > > In the event of a power outage, the drive will stop writing in > > mid-sector. > > Really? Any competent drive firmware designer would've made sure that > can't happen. The drive has to detect power loss well before it > actually loses control of its actuators, because it's got to move > the heads to the safe landing zone. If it checks for power loss and > starts that shutdown process between sector writes, never in the middle > of one, voila: atomic writes. I used to think that way too, because that's how I would design a drive. (Anyway that would still only give you 512-byte-atomic writes, which isn't enough.) Talking to people who build them was a rude awakening. They have apparatus to yank the head off the drive and lock it away when the power starts to go down, and it will happily operate in mid-write. (It's possible that some drives are made the way Tom describes, but evidently not the commodity stuff.) The level of software-development competence, and of reliability engineering, that I've seen among disk drive firmware maintainers distresses me whenever I think about it. A disk drive is best considered as throwaway cache image of your real medium. > Of course, there's still no guarantee if you get a hardware failure > or sector write failure (recovery from the write failure might well > take longer than the drive has got). But guarding against a plain > power-failure scenario is actually simpler than doing it the wrong > way. If only the disk-drive vendors (and buyers!) thought that way... Nathan Myers ncm@zembu.com
On Tue, 28 Nov 2000, Tom Lane wrote: > Nathan Myers <ncm@zembu.com> writes: > > In the event of a power outage, the drive will stop writing in > > mid-sector. > > Really? Any competent drive firmware designer would've made sure that > can't happen. The drive has to detect power loss well before it > actually loses control of its actuators, because it's got to move the > heads to the safe landing zone. If it checks for power loss and > starts that shutdown process between sector writes, never in the > middle of one, voila: atomic writes. In principle, that is correct. However, the SGI XFS people have apparently found otherwise -- what can happen is that the drive itself has enough power to complete a write, but that the disk/controller buffers lose power and so you end up writing a (perhaps partial) block of zeroes. Matthew.
Matthew Kirkwood wrote: > > On Tue, 28 Nov 2000, Tom Lane wrote: > > > Nathan Myers <ncm@zembu.com> writes: > > > In the event of a power outage, the drive will stop writing in > > > mid-sector. > > > > Really? Any competent drive firmware designer would've made sure that > > can't happen. The drive has to detect power loss well before it > > actually loses control of its actuators, because it's got to move the > > heads to the safe landing zone. If it checks for power loss and > > starts that shutdown process between sector writes, never in the > > middle of one, voila: atomic writes. > > In principle, that is correct. However, the SGI XFS people > have apparently found otherwise -- what can happen is that > the drive itself has enough power to complete a write, but > that the disk/controller buffers lose power and so you end > up writing a (perhaps partial) block of zeroes. I have worked on a few systems that intend to take a hard power failure gracefully. It is a very hard thing to do, with a lot of specialized circuitry. While it is nice to think about, on a normal computer systems one can not depend on a system shutting down gracefully on a hard power loss without a smart UPS and daemon to shut down the system. It does not matter one bit about disk write sizes or what ever. Unless the computer can know it is about to lose power, it can not halt its operations and enter a safe mode. The whole "pull the plug" mentality is silly. Unless a system hardware is specifically designed to manage this and proper software in place, it can not be done, and any "compliance" you think you see is simply luck. Any computer that has important data should have a smart UPS and a daemon to manage it. -- http://www.mohawksoft.com