Thread: Bumping block size to 16K on FreeBSD...
This is a spill over from some discussions on some of the FreeBSD mailing lists about FS performance. After FreeBSD 4.5-RELEASE, the file system block size was bumped from 8K to 16K. Right now, PostgreSQL still stores data in 8K blocks. Are there any objections to me increasing the block size for FreeBSD installations to 16K for the upcoming 7.4 release? 'bout the only reason I can think of to _not_ increase the block size for FreeBSD would be if someone was mounting PGDATA on an ext2 partition. :-P Early performance tests on my laptop suggest it's about 8% faster for writing when both the FS and PostgreSQL use 16K blocks. From my tests loading a database: With 8K blocks: 15.188u 3.404s 7:12.27 4.2% 209+340k 1251+0io 0pf+0w 14.867u 3.686s 7:32.54 4.0% 201+327k 1252+0io 0pf+0w avg wall clock sec to complete: 442 With 16K blocks: 15.192u 3.312s 6:44.43 4.5% 198+322k 1253+0io 0pf+0w 15.120u 3.330s 6:51.43 4.4% 205+334k 1254+0io 0pf+0w avg wall clock sec to complete: 407 I'll take the 35sec/8% speedup any day of the week and twice on Sunday. Granted these tests were done on my laptop and were 100% write. If someone wants to do some good read tests, I'd be interested in those results to see if it's still 8% faster. In using 16K blocks, I'd imagine this'll make using seq scans cheaper on FreeBSD. Comments? -sc -- Sean Chittenden
Sean Chittenden <sean@chittenden.org> writes: > Are there any objections > to me increasing the block size for FreeBSD installations to 16K for > the upcoming 7.4 release? I'm a little uncomfortable with introducing a cross-platform variation in the standard block size. That would have implications for things like whether a table definition that works on FreeBSD could be expected to work elsewhere; to say nothing of recommendations for shared_buffer settings and suchlike. Also, there is no infrastructure for adjusting BLCKSZ automatically at configure time, and I don't much want to add it. regards, tom lane
Sean Chittenden <sean@chittenden.org> writes: >> You do realize that you'll be forcing initdbs on people if you >> blithely add and remove such a patch? > Yup. I was tempted to include another patch just to bump the catalog > version in the event that it doesn't gracefully detect differing block > sizes... haven't tested if it does that or not. It does, and you will definitely incur my wrath if you start putting in unnecessary platform-specific catversion hacks. >> I think we forced an initdb for beta2 for other reasons, but I'm >> hoping to avoid any more in the 7.4 branch. > 1) People shouldn't be using -devel for production No, but they might reasonably hope to segue a beta installation into production without yet-another reload ... regards, tom lane
Thomas Swan <tswan@idigx.com> writes: > Tom Lane wrote: >> I'm a little uncomfortable with introducing a cross-platform variation >> in the standard block size. >> > Has anyone looked at changing the default block size across the board > and what the performance improvements/penalties might be? Hardware has > changed quite a bit over the years. Not that I know of. That might actually be a more reasonable proposal than changing it only on one platform. It would take a fair amount of legwork to generate enough evidence to convince people, though ... regards, tom lane
> > Are there any objections to me increasing the block size for > > FreeBSD installations to 16K for the upcoming 7.4 release? > > I'm a little uncomfortable with introducing a cross-platform > variation in the standard block size. That would have implications > for things like whether a table definition that works on FreeBSD > could be expected to work elsewhere; to say nothing of > recommendations for shared_buffer settings and suchlike. Hrm, well, given things just went to beta2, I'm going to bump postgresql-devel to beta2 and include this patch for now, however, I'm going to explicitly requests that people who have problems or successes with beta2 on FreeBSD ask me before possible reporting problems with a platform specific alteration. That said, however, an 8% speedup on writes is non-trivial and something I'd like to pick up if possible. :) I have faith that PG does the right thing as I'm sure other people have done this w/o incident in the past, I just don't think anyone's tuned this for the platform and all of its users. > Also, there is no infrastructure for adjusting BLCKSZ automatically > at configure time, and I don't much want to add it. The patch gets applied when the port gets built, so there doesn't need to be a configuration option for it for this to work. -- Sean Chittenden
Tom Lane wrote: >Sean Chittenden <sean@chittenden.org> writes: > > >>Are there any objections >>to me increasing the block size for FreeBSD installations to 16K for >>the upcoming 7.4 release? >> >> > >I'm a little uncomfortable with introducing a cross-platform variation >in the standard block size. That would have implications for things >like whether a table definition that works on FreeBSD could be expected >to work elsewhere; to say nothing of recommendations for shared_buffer >settings and suchlike. > >Also, there is no infrastructure for adjusting BLCKSZ automatically at >configure time, and I don't much want to add it. > > Has anyone looked at changing the default block size across the board and what the performance improvements/penalties might be? Hardware has changed quite a bit over the years.
Sean Chittenden <sean@chittenden.org> writes: > Hrm, well, given things just went to beta2, I'm going to bump > postgresql-devel to beta2 and include this patch for now, however, I'm > going to explicitly requests that people who have problems or > successes with beta2 on FreeBSD ask me before possible reporting > problems with a platform specific alteration. You do realize that you'll be forcing initdbs on people if you blithely add and remove such a patch? I think we forced an initdb for beta2 for other reasons, but I'm hoping to avoid any more in the 7.4 branch. regards, tom lane
> > Hrm, well, given things just went to beta2, I'm going to bump > > postgresql-devel to beta2 and include this patch for now, however, > > I'm going to explicitly requests that people who have problems or > > successes with beta2 on FreeBSD ask me before possible reporting > > problems with a platform specific alteration. > > You do realize that you'll be forcing initdbs on people if you > blithely add and remove such a patch? Yup. I was tempted to include another patch just to bump the catalog version in the event that it doesn't gracefully detect differing block sizes... haven't tested if it does that or not. > I think we forced an initdb for beta2 for other reasons, but I'm > hoping to avoid any more in the 7.4 branch. 1) People shouldn't be using -devel for production 2) Reloading is still a PITA, which is why I asked for comments. Other than you feeling uneasy about the possibility of uncovering bugs because this hasn't been widely done like this before, do you have any other concerns, or do you think the possibility of finding bugs very likely? -sc -- Sean Chittenden
On Thu, 28 Aug 2003, Thomas Swan wrote: > Has anyone looked at changing the default block size across the board > and what the performance improvements/penalties might be? Hardware has > changed quite a bit over the years. I *think* that the reason for the performance improvement on FreeBSD is that our FS block size is 16k, instead of 8k ... are there any other OSs that have increased theirs?
On Thu, Aug 28, 2003 at 03:56:18PM -0400, Tom Lane wrote: > Sean Chittenden <sean@chittenden.org> writes: > > Hrm, well, given things just went to beta2, I'm going to bump > > postgresql-devel to beta2 and include this patch for now, however, I'm > > going to explicitly requests that people who have problems or > > successes with beta2 on FreeBSD ask me before possible reporting > > problems with a platform specific alteration. > > You do realize that you'll be forcing initdbs on people if you blithely > add and remove such a patch? > > I think we forced an initdb for beta2 for other reasons, but I'm hoping > to avoid any more in the 7.4 branch. if it does in fact give performance for FreeBSD (and possibly others), at least make it a build time option. -- [ Jim Mercer jim@reptiles.org +1 416 410-5633 ] [ I want to live forever, or die trying. ]
Tom Lane wrote: >Thomas Swan <tswan@idigx.com> writes: > > >>Tom Lane wrote: >> >> >>>I'm a little uncomfortable with introducing a cross-platform variation >>>in the standard block size. >>> >>> >>> >>Has anyone looked at changing the default block size across the board >>and what the performance improvements/penalties might be? Hardware has >>changed quite a bit over the years. >> >> > >Not that I know of. That might actually be a more reasonable proposal >than changing it only on one platform. It would take a fair amount >of legwork to generate enough evidence to convince people, though ... > > > I know that you can specify different block sizes for different fs/OS combinations, notably there were discussions before about running the WAL on a fat16/32 disks with different performance characteristics. Also, it's not just an OS abstraction; hardware has changed and evolved in such a way that the physical disks are reading and writing in larger chunks. To me it would seem wasteful to not use that bandwidth that is available for little or no extra cost. Perhaps testing it for 8K, 16K, 32K, and 64K blocksizes would be a worthwhile venture. I will have time this weekend with the holiday to work on some benchmarking for these sizes if only on a linux system. Tom, what would you consider to be acceptable for a preliminary investigation? What should I look at: runtime, disk space required before and after, fsync (on/off)? -- Thomas
On Thu, 28 Aug 2003, Marc G. Fournier wrote: > > > On Thu, 28 Aug 2003, Thomas Swan wrote: > > > Has anyone looked at changing the default block size across the board > > and what the performance improvements/penalties might be? Hardware has > > changed quite a bit over the years. > > I *think* that the reason for the performance improvement on FreeBSD is > that our FS block size is 16k, instead of 8k ... are there any other > OSs that have increased theirs? Linux, is still, as far as I know, limited to the max page size of the CPU it's on, which for most x86 is 4k. Windows 2k can go up to 64k block sizes.
Sean Chittenden <sean@chittenden.org> writes: > Early performance tests on my laptop suggest it's about 8% faster for > writing when both the FS and PostgreSQL use 16K blocks. BTW, I don't really believe that one set of tests, conducted on one single machine, are anywhere near enough justification for changing this value. Especially not if it's a laptop rather than a typical server configuration. You've got considerably less I/O bandwidth in proportion to CPU horsepower than a server. Why is that an issue? Well, a larger block size will substantially increase our WAL overhead (because we tend to dump whole blocks into WAL at the slightest provocation) and on slower machines the CRC64 calculations involved in WAL entries are a significant cost. On a machine with less CPU and more disk horsepower than you tested, the tradeoffs could be a lot different. regards, tom lane
Sean, can we get a copy of your test set? And any scripts that you have for running the tests? On Thu, 28 Aug 2003, Tom Lane wrote: > Sean Chittenden <sean@chittenden.org> writes: > > Early performance tests on my laptop suggest it's about 8% faster for > > writing when both the FS and PostgreSQL use 16K blocks. > > BTW, I don't really believe that one set of tests, conducted on one > single machine, are anywhere near enough justification for changing this > value. Especially not if it's a laptop rather than a typical server > configuration. You've got considerably less I/O bandwidth in proportion > to CPU horsepower than a server. Why is that an issue? Well, a larger > block size will substantially increase our WAL overhead (because we tend > to dump whole blocks into WAL at the slightest provocation) and on > slower machines the CRC64 calculations involved in WAL entries are a > significant cost. On a machine with less CPU and more disk horsepower > than you tested, the tradeoffs could be a lot different. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org >
On Thu, Aug 28, 2003, Tom Lane wrote: > Sean Chittenden <sean@chittenden.org> writes: > > Are there any objections > > to me increasing the block size for FreeBSD installations to 16K for > > the upcoming 7.4 release? > > I'm a little uncomfortable with introducing a cross-platform variation > in the standard block size. That would have implications for things > like whether a table definition that works on FreeBSD could be expected > to work elsewhere; to say nothing of recommendations for shared_buffer > settings and suchlike. > > Also, there is no infrastructure for adjusting BLCKSZ automatically at > configure time, and I don't much want to add it. On recent versions of FreeBSD (and Solaris too, I think), the default UFS block size is 16K, and file fragments are 2K. This works great for many workloads, but it kills pgsql's random write performance unless pgsql uses 16K blocks as well, due to the read-modify-write involved. Either the filesystem or the database needs to be changed in order to get decent performance. I have not compared 16K UFS/16K pgsql to 8K UFS/8K pgsql, so I can't say which option makes more sense, though. There probably isn't anything wrong with the pgsql default, except that it's set in stone. It's entirely feasible for administrators to create 8K/1K UFS filesystems specifically for pgsql, but they need to be aware of the issue. On the other hand, I don't see how it would be a bad thing if pgsql were able to adapt at runtime either. Thus, I've come up with two possible fixes: (1) Document the problem with having a filesystem block size larger than the database block size. With a simple call to statvfs(2), the postmaster could warn about this on startup, too. (2) Make BLCKSZ a runtime constant, stored as part of the database. Grepping through the source, I didn't see any places using BLCKSZ where efficiency appeared to be so critical that you had to have constant folding. Of course, onecould introduce a 'lg2blksz' constant to avoid divides and multiplies. This would NOT introduce cross-platform incompatibilities, only efficiency problems with databases that have been movedacross filesystems in some cases. The ability to adapt at database creation time is also useful in that it allowsthe database to be tuned to the characteristics of the particular device on which it resides.[1] I don't know very much about pgsql, so corrections and feedback regarding these ideas would be appreciated. [1] Right now, the seek time to transfer time ratio of the drive is mostly hidden by the operating system's clusteringand read-ahead. I tried modifying pgsql to use direct I/O, but it seems that pgsql doesn't do its own clusteringor read-ahead, so that was a lose...
On Thu, Aug 28, 2003, scott.marlowe wrote: > On Thu, 28 Aug 2003, Marc G. Fournier wrote: > > On Thu, 28 Aug 2003, Thomas Swan wrote: > > > > > Has anyone looked at changing the default block size across the board > > > and what the performance improvements/penalties might be? Hardware has > > > changed quite a bit over the years. > > > > I *think* that the reason for the performance improvement on FreeBSD is > > that our FS block size is 16k, instead of 8k ... are there any other > > OSs that have increased theirs? > > Linux, is still, as far as I know, limited to the max page size of the CPU > it's on, which for most x86 is 4k. I don't know about the page size issue, but Linux has the additional problem that ext2/ext3 do not support fragments or variable block sizes within the same filesystem. Therefore, Linux wastes an excessive amount of space for larger block sizes.
> > > Early performance tests on my laptop suggest it's about 8% > > > faster for writing when both the FS and PostgreSQL use 16K > > > blocks. > > > > BTW, I don't really believe that one set of tests, conducted on > > one single machine, are anywhere near enough justification for > > changing this value. Especially not if it's a laptop rather than > > a typical server configuration. You've got considerably less I/O > > bandwidth in proportion to CPU horsepower than a server. Why is > > that an issue? Well, a larger block size will substantially > > increase our WAL overhead (because we tend to dump whole blocks > > into WAL at the slightest provocation) and on slower machines the > > CRC64 calculations involved in WAL entries are a significant cost. > > On a machine with less CPU and more disk horsepower than you > > tested, the tradeoffs could be a lot different. > > Sean, can we get a copy of your test set? And any scripts that you > have for running the tests? Unfortunately not, my tests were simply re-initdb'ing and loading in my schema. I have some read tests I'm going to perform here in a bit, but I'm waiting for kde to finish compiling before I start testing. I have another tests machine that I'm going to task with comparing 16K and 8K blocks. It's not SCSI, but I don't have any available machines that I can newfs + reinstall PostgreSQL on. I was thinking about running the regression tests 10x ... -sc -- Sean Chittenden
On Thu, Aug 28, 2003 at 01:00:44PM -0700, Sean Chittenden wrote: > Other than you feeling uneasy about the possibility of uncovering bugs > because this hasn't been widely done like this before, do you have any > other concerns, or do you think the possibility of finding bugs very > likely? In case Tom didn't make this clear, I'm strongly opposed to making this change without doing the necessary (non-FreeBSD-specific) legwork. The bottom-line is that if we're going to be changing the block size on a regular basis, it needs to be completely transparent to the user, from a functionality perspective. That's currently not the case: changing the BLCKSZ changes the meaning of shared_buffers and effective_cache_size, for example, so tuning documents written for other operating systems won't apply as easily to PostgreSQL on FreeBSD. Until the user-visible effects of BLCKSZ have been ironed over[1], I definately think you shouldn't include the patch in the FreeBSD port. [1] - Other improvements, like making it easier to change the blocksize (making it a configure option?) would be cooltoo. -Neil
On Thu, 28 Aug 2003, Neil Conway wrote: > On Thu, Aug 28, 2003 at 01:00:44PM -0700, Sean Chittenden wrote: > > Other than you feeling uneasy about the possibility of uncovering bugs > > because this hasn't been widely done like this before, do you have any > > other concerns, or do you think the possibility of finding bugs very > > likely? > > In case Tom didn't make this clear, I'm strongly opposed to making > this change without doing the necessary (non-FreeBSD-specific) legwork. > The bottom-line is that if we're going to be changing the block size > on a regular basis, it needs to be completely transparent to the user, > from a functionality perspective. That's currently not the case: > changing the BLCKSZ changes the meaning of shared_buffers and > effective_cache_size, for example, so tuning documents written for > other operating systems won't apply as easily to PostgreSQL on > FreeBSD. Until the user-visible effects of BLCKSZ have been ironed > over[1], I definately think you shouldn't include the patch in the > FreeBSD port. "tuning documents" is *not* a valid reason for not doing this ... that's like saying "we can make it faster on some operating systems, but because we're going to have to modify the tuning documents, we're not going to do it" ... wait, that is exactly what you are saying ... Now, Tom made one point in his original that *was* valid ... a table definition made under a 16k BLCKSZ db will not necessarily work under an 8k compiled server .. the example that he made to me was that a table of float8 under a 16k server could have N fields, but if you tried to dump/import that table into an 8k BLCKSZ one with that max # of fields, it would fail ... that is a *serious* concern against doing this ... Now, here's a question for someone running a non-FreeBSD OS ... if we were to jump the BLCKSZ to 16k, would it cause a degradation in performance, or would it make no difference to them? Would they see an 8% reduction in performance? The thing is ... there has been presented a strong, valid reason for moving to 16k (at least under FreeBSD) ... and there has been a valid reason for not making it "easily configurable" ... but, are there any strong reasons not to just move to 16k across the board?
On Fri, Aug 29, 2003 at 12:06:59AM -0300, Marc G. Fournier wrote: > "tuning documents" is *not* a valid reason for not doing this ... that's > like saying "we can make it faster on some operating systems, but because > we're going to have to modify the tuning documents, we're not going to do > it" ... wait, that is exactly what you are saying ... No, it's a perfectly valid reason for not doing this (in the present, half-baked form that has been presented). PostgreSQL is at the moment fairly simple to configure. Adding a significant amount of complexity to the configuration / tuning process and making a given configuration non-portable between different platforms and different compiles of PostgreSQL is something I'd like to avoid, if possible. And I think it's possible to avoid it, it's just that the original patch makes no attempt to do so. For example, why does shared_buffers need to be specified in disk pages, anyway? ISTM it could just as easily be specified in bytes, and PostgreSQL could internally round up/down to the nearest multiple of the BLCKSZ that this instance of PostgreSQL happened to be compiled with. > Now, Tom made one point in his original that *was* valid ... a table > definition made under a 16k BLCKSZ db will not necessarily work under an > 8k compiled server .. the example that he made to me was that a table of > float8 under a 16k server could have N fields, but if you tried to > dump/import that table into an 8k BLCKSZ one with that max # of fields, it > would fail ... that is a *serious* concern against doing this ... Uh, yeah -- I was talking about that as well. I said "it needs to be completely transparent to the user, from a functionality perspective". If changing the BLCKSZ makes things faster or slower, then fine; if it changes the meaning of various random configuration parameters, makes certain schemas work or not work, and makes other changes to postgres functionality, then it's not fine. -Neil
On Thu, 28 Aug 2003, David Schultz wrote: > (2) Make BLCKSZ a runtime constant, stored as part of the database. Now this I really like. It would make benchmarking 8K vs. 16K blocksizes much easer, as well as of course avoiding the "initdb required after rebuilding" problem. BTW, pretty much every BSD system is going to be using 16K block sizes on large partitions; the cylinder group size and filesystem overhead is way, way too small when using 8K blocks. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.NetBSD.org Don't you know, in this new Dark Age, we're alllight. --XTC
I am not 100% sure that 16K blocksize is the best size, for instance : Using FreebSD 5.1 - I got the best read and write performance using a blocksize of 32K with 4K fragments - [ reading and writing 8K blocks, ufs1 and ufs2 fs ]. I dont have the results in front of me, but I think I tried fs blocksizes from 4K upwards.... I am also not convinced that using 16K in Pg will be better than 8K (you would expect sequential performance to improve, but maybe at the expense of random ....) regards Mark
On Fri, Aug 29, 2003 at 12:06:59AM -0300, Marc G. Fournier wrote: > The thing is ... there has been presented a strong, valid reason for > moving to 16k (at least under FreeBSD) ... and there has been a valid It sounds to me, actually, like there is a strong reason for telling people running FreeBSD, "Hey, you can get this big speedup at the possible expense of compatibility by compiling with changes XYZ." But quietly putting that into packages for distribution strikes me as the sort of support headache that one really doesn't want. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Marc G. Fournier wrote: > Now, here's a question for someone running a non-FreeBSD OS ... if we were > to jump the BLCKSZ to 16k, would it cause a degradation in performance, or > would it make no difference to them? Would they see an 8% reduction in > performance? > > The thing is ... there has been presented a strong, valid reason for > moving to 16k (at least under FreeBSD) ... and there has been a valid > reason for not making it "easily configurable" ... but, are there any > strong reasons not to just move to 16k across the board? First, I assume all this discussion about default block size is for 7.5, not for 7.4, which is in beta. Second, the tests were done only for _write_ performance. We can expect random read performance to be worse for larger block sizes, so I think more research needs to be done. Also, Tatsuo reported years ago that he got ~15% performance improvement with a 32k PostgreSQL block size. The OS was AIX or Linux. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073