Thread: 8K recordsize bad on ZFS?
Jignesh, All: Most of our Solaris users have been, I think, following Jignesh's advice from his benchmark tests to set ZFS page size to 8K for the data zpool. However, I've discovered that this is sometimes a serious problem for some hardware. For example, having the recordsize set to 8K on a Sun 4170 with 8 drives recently gave me these appalling Bonnie++ results: Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 4 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP db111 24G 260044 33 62110 17 89914 15 1167 25 Latency 6549ms 4882ms 3395ms 107ms I know that's hard to read. What it's saying is: Seq Writes: 260mb/s combined Seq Reads: 89mb/s combined Read Latency: 3.3s Best guess is that this is a result of overloading the array/drives with commands for all those small blocks; certainly the behavior observed (stuttering I/O, latency) is in line with that issue. Anyway, since this is a DW-like workload, we just bumped the recordsize up to 128K and the performance issues went away ... reads up over 300mb/s. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh, it'll be great if you explain how did you change the records size to 128K? - as this size is assigned on the file creation and cannot be changed later - I suppose that you made a backup of your data and then process a full restore.. is it so? Rgds, -Dimitri On 5/8/10, Josh Berkus <josh@agliodbs.com> wrote: > Jignesh, All: > > Most of our Solaris users have been, I think, following Jignesh's advice > from his benchmark tests to set ZFS page size to 8K for the data zpool. > However, I've discovered that this is sometimes a serious problem for > some hardware. > > For example, having the recordsize set to 8K on a Sun 4170 with 8 drives > recently gave me these appalling Bonnie++ results: > > Version 1.96 ------Sequential Output------ --Sequential Input- > --Random- > Concurrency 4 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- > --Seeks-- > Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP > /sec %CP > db111 24G 260044 33 62110 17 89914 15 > 1167 25 > Latency 6549ms 4882ms 3395ms > 107ms > > I know that's hard to read. What it's saying is: > > Seq Writes: 260mb/s combined > Seq Reads: 89mb/s combined > Read Latency: 3.3s > > Best guess is that this is a result of overloading the array/drives with > commands for all those small blocks; certainly the behavior observed > (stuttering I/O, latency) is in line with that issue. > > Anyway, since this is a DW-like workload, we just bumped the recordsize > up to 128K and the performance issues went away ... reads up over 300mb/s. > > -- > -- Josh Berkus > PostgreSQL Experts Inc. > http://www.pgexperts.com > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
On 5/9/10 1:45 AM, Dimitri wrote: > Josh, > > it'll be great if you explain how did you change the records size to > 128K? - as this size is assigned on the file creation and cannot be > changed later - I suppose that you made a backup of your data and then > process a full restore.. is it so? You can change the recordsize of the zpool dynamically, then simply copy the data directory (with PostgreSQL shut down) to a new directory on that zpool. This assumes that you have enough space on the zpool, of course. We didn't test how it would work to let the files in the Postgres instance get gradually replaced by "natural" updating. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On 05/10/10 20:39, Josh Berkus wrote: > On 5/9/10 1:45 AM, Dimitri wrote: >> Josh, >> >> it'll be great if you explain how did you change the records size to >> 128K? - as this size is assigned on the file creation and cannot be >> changed later - I suppose that you made a backup of your data and then >> process a full restore.. is it so? > > You can change the recordsize of the zpool dynamically, then simply copy > the data directory (with PostgreSQL shut down) to a new directory on > that zpool. This assumes that you have enough space on the zpool, of > course. Other things could have influenced your result - 260 MB/s vs 300 MB/s is close enough to be influenced by data position on (some of) the drives. (I'm not saying anything about the original question.)
As I said, the record size is applied on the file creation :-) so by copying your data from one directory to another one you've made the new record size applied on the newly created files :-) (equal to backup restore if there was not enough space).. Did you try to redo the same but still keeping record size equal 8K ? ;-) I think the problem you've observed is simply related to the copy-on-write nature of ZFS - if you bring any modification to the data your sequential order of pages was broken with a time, and finally the sequential read was transformed to the random access.. But once you've re-copied your files again - the right order was applied again. BTW, 8K is recommended for OLTP workloads, but for DW you may stay with 128K without problem. Rgds, -Dimitri On 5/10/10, Josh Berkus <josh@agliodbs.com> wrote: > On 5/9/10 1:45 AM, Dimitri wrote: >> Josh, >> >> it'll be great if you explain how did you change the records size to >> 128K? - as this size is assigned on the file creation and cannot be >> changed later - I suppose that you made a backup of your data and then >> process a full restore.. is it so? > > You can change the recordsize of the zpool dynamically, then simply copy > the data directory (with PostgreSQL shut down) to a new directory on > that zpool. This assumes that you have enough space on the zpool, of > course. > > We didn't test how it would work to let the files in the Postgres > instance get gradually replaced by "natural" updating. > > -- > -- Josh Berkus > PostgreSQL Experts Inc. > http://www.pgexperts.com > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
Ivan, > Other things could have influenced your result - 260 MB/s vs 300 MB/s is > close enough to be influenced by data position on (some of) the drives. > (I'm not saying anything about the original question.) You misread my post. It's *87mb/s* vs. 300mb/s. I kinda doubt that's position on the drive. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Mon, May 10, 2010 at 8:30 PM, Josh Berkus <josh@agliodbs.com> wrote: > Ivan, > >> Other things could have influenced your result - 260 MB/s vs 300 MB/s is >> close enough to be influenced by data position on (some of) the drives. >> (I'm not saying anything about the original question.) > > You misread my post. It's *87mb/s* vs. 300mb/s. I kinda doubt that's > position on the drive. That still is consistent with it being caused by the files being discontiguous. Copying them moved all the blocks to be contiguous and sequential on disk and might have had the same effect even if you had left the settings at 8kB blocks. You described it as "overloading the array/drives with commands" which is probably accurate but sounds less exotic if you say "the files were fragmented causing lots of seeks so our drives we saturated the drives' iops capacity". How many iops were you doing before and after anyways? That said that doesn't change very much. The point remains that with 8kB blocks ZFS is susceptible to files becoming discontinuous and sequential i/o performing poorly whereas with 128kB blocks hopefully that would happen less. Of course with 128kB blocks updates become a whole lot more expensive too. -- greg
> That still is consistent with it being caused by the files being > discontiguous. Copying them moved all the blocks to be contiguous and > sequential on disk and might have had the same effect even if you had > left the settings at 8kB blocks. You described it as "overloading the > array/drives with commands" which is probably accurate but sounds less > exotic if you say "the files were fragmented causing lots of seeks so > our drives we saturated the drives' iops capacity". How many iops were > you doing before and after anyways? Don't know. This was a client system and once we got the target numbers, they stopped wanting me to run tests on in. :-( Note that this was a brand-new system, so there wasn't much time for fragmentation to occur. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Fri, May 7, 2010 at 8:09 PM, Josh Berkus <josh@agliodbs.com> wrote: > Jignesh, All: > > Most of our Solaris users have been, I think, following Jignesh's advice > from his benchmark tests to set ZFS page size to 8K for the data zpool. > However, I've discovered that this is sometimes a serious problem for > some hardware. > > For example, having the recordsize set to 8K on a Sun 4170 with 8 drives > recently gave me these appalling Bonnie++ results: > > Version 1.96 ------Sequential Output------ --Sequential Input- > --Random- > Concurrency 4 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- > --Seeks-- > Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP > /sec %CP > db111 24G 260044 33 62110 17 89914 15 > 1167 25 > Latency 6549ms 4882ms 3395ms > 107ms > > I know that's hard to read. What it's saying is: > > Seq Writes: 260mb/s combined > Seq Reads: 89mb/s combined > Read Latency: 3.3s > > Best guess is that this is a result of overloading the array/drives with > commands for all those small blocks; certainly the behavior observed > (stuttering I/O, latency) is in line with that issue. > > Anyway, since this is a DW-like workload, we just bumped the recordsize > up to 128K and the performance issues went away ... reads up over 300mb/s. > > -- > -- Josh Berkus > PostgreSQL Experts Inc. > http://www.pgexperts.com > Hi Josh, The 8K recommendation is for OLTP Applications.. So if you seen somewhere to use it for DSS/DW workload then I need to change it. DW Workloads require throughput and if they use 8K then they are limited by 8K x max IOPS which with 8 disk is about 120 (typical) x 8 SAS drives which is roughly about 8MB/sec.. (Prefetching with read drives and other optimizations can help it to push to about 24-30MB/sec with 8K on 12 disk arrays).. So yes that advice is typically bad for DSS.. And I believe I generally recommend them to use 128KB for DSS.So if you have seen the 8K for DSS let me know and hopefully if I still have access to it I can change it. However for OLTP you are generally want more IOPS with low latency which is what 8K provides (The smallest blocksize in ZFS). Hope this clarifies. -Jignesh
> Sure, but bulk load + reandom selects is going to *guarentee* > fragmentatioon on a COW system (like ZFS, BTRFS, etc) as the selects > start to write out all the hint-bit-dirtied blocks in random orders... > > i.e. it doesn't take long to make an originally nicely continuous block > random.... I'm testing with DD and Bonnie++, though, which create their own files. For that matter, running an ETL procedure with a newly created database on both recordsizes was notably (2.5x) faster on the 128K system. So I don't think fragmentation is the difference. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com