Thread: 8K recordsize bad on ZFS?

8K recordsize bad on ZFS?

From
Josh Berkus
Date:
Jignesh, All:

Most of our Solaris users have been, I think, following Jignesh's advice
from his benchmark tests to set ZFS page size to 8K for the data zpool.
 However, I've discovered that this is sometimes a serious problem for
some hardware.

For example, having the recordsize set to 8K on a Sun 4170 with 8 drives
recently gave me these appalling Bonnie++ results:

Version  1.96       ------Sequential Output------ --Sequential Input-
--Random-
Concurrency   4     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
db111           24G           260044  33 62110  17           89914  15
1167  25
Latency                        6549ms    4882ms              3395ms
107ms

I know that's hard to read.  What it's saying is:

Seq Writes: 260mb/s combined
Seq Reads: 89mb/s combined
Read Latency: 3.3s

Best guess is that this is a result of overloading the array/drives with
commands for all those small blocks; certainly the behavior observed
(stuttering I/O, latency) is in line with that issue.

Anyway, since this is a DW-like workload, we just bumped the recordsize
up to 128K and the performance issues went away ... reads up over 300mb/s.

--
                                  -- Josh Berkus
                                     PostgreSQL Experts Inc.
                                     http://www.pgexperts.com

Re: 8K recordsize bad on ZFS?

From
Dimitri
Date:
Josh,

it'll be great if you explain how did you change the records size to
128K? - as this size is assigned on the file creation and cannot be
changed later - I suppose that you made a backup of your data and then
process a full restore.. is it so?

Rgds,
-Dimitri


On 5/8/10, Josh Berkus <josh@agliodbs.com> wrote:
> Jignesh, All:
>
> Most of our Solaris users have been, I think, following Jignesh's advice
> from his benchmark tests to set ZFS page size to 8K for the data zpool.
>  However, I've discovered that this is sometimes a serious problem for
> some hardware.
>
> For example, having the recordsize set to 8K on a Sun 4170 with 8 drives
> recently gave me these appalling Bonnie++ results:
>
> Version  1.96       ------Sequential Output------ --Sequential Input-
> --Random-
> Concurrency   4     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
> /sec %CP
> db111           24G           260044  33 62110  17           89914  15
> 1167  25
> Latency                        6549ms    4882ms              3395ms
> 107ms
>
> I know that's hard to read.  What it's saying is:
>
> Seq Writes: 260mb/s combined
> Seq Reads: 89mb/s combined
> Read Latency: 3.3s
>
> Best guess is that this is a result of overloading the array/drives with
> commands for all those small blocks; certainly the behavior observed
> (stuttering I/O, latency) is in line with that issue.
>
> Anyway, since this is a DW-like workload, we just bumped the recordsize
> up to 128K and the performance issues went away ... reads up over 300mb/s.
>
> --
>                                   -- Josh Berkus
>                                      PostgreSQL Experts Inc.
>                                      http://www.pgexperts.com
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: 8K recordsize bad on ZFS?

From
Josh Berkus
Date:
On 5/9/10 1:45 AM, Dimitri wrote:
> Josh,
>
> it'll be great if you explain how did you change the records size to
> 128K? - as this size is assigned on the file creation and cannot be
> changed later - I suppose that you made a backup of your data and then
> process a full restore.. is it so?

You can change the recordsize of the zpool dynamically, then simply copy
the data directory (with PostgreSQL shut down) to a new directory on
that zpool.  This assumes that you have enough space on the zpool, of
course.

We didn't test how it would work to let the files in the Postgres
instance get gradually replaced by "natural" updating.

--
                                  -- Josh Berkus
                                     PostgreSQL Experts Inc.
                                     http://www.pgexperts.com

Re: 8K recordsize bad on ZFS?

From
Ivan Voras
Date:
On 05/10/10 20:39, Josh Berkus wrote:
> On 5/9/10 1:45 AM, Dimitri wrote:
>> Josh,
>>
>> it'll be great if you explain how did you change the records size to
>> 128K? - as this size is assigned on the file creation and cannot be
>> changed later - I suppose that you made a backup of your data and then
>> process a full restore.. is it so?
>
> You can change the recordsize of the zpool dynamically, then simply copy
> the data directory (with PostgreSQL shut down) to a new directory on
> that zpool.  This assumes that you have enough space on the zpool, of
> course.

Other things could have influenced your result - 260 MB/s vs 300 MB/s is
close enough to be influenced by data position on (some of) the drives.
(I'm not saying anything about the original question.)

Re: 8K recordsize bad on ZFS?

From
Dimitri
Date:
As I said, the record size is applied on the file creation :-)
so by copying your data from one directory to another one you've made
the new record size applied on the newly created files :-)  (equal to
backup restore if there was not enough space)..

Did you try to redo the same but still keeping record size equal 8K ? ;-)

I think the problem you've observed is simply related to the
copy-on-write nature of ZFS - if you bring any modification to the
data your sequential order of pages was broken with a time, and
finally the sequential read was transformed to the random access.. But
once you've re-copied your files again - the right order was applied
again.

BTW, 8K is recommended for OLTP workloads, but for DW you may stay
with 128K without problem.

Rgds,
-Dimitri


On 5/10/10, Josh Berkus <josh@agliodbs.com> wrote:
> On 5/9/10 1:45 AM, Dimitri wrote:
>> Josh,
>>
>> it'll be great if you explain how did you change the records size to
>> 128K? - as this size is assigned on the file creation and cannot be
>> changed later - I suppose that you made a backup of your data and then
>> process a full restore.. is it so?
>
> You can change the recordsize of the zpool dynamically, then simply copy
> the data directory (with PostgreSQL shut down) to a new directory on
> that zpool.  This assumes that you have enough space on the zpool, of
> course.
>
> We didn't test how it would work to let the files in the Postgres
> instance get gradually replaced by "natural" updating.
>
> --
>                                   -- Josh Berkus
>                                      PostgreSQL Experts Inc.
>                                      http://www.pgexperts.com
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: 8K recordsize bad on ZFS?

From
Josh Berkus
Date:
Ivan,

> Other things could have influenced your result - 260 MB/s vs 300 MB/s is
> close enough to be influenced by data position on (some of) the drives.
> (I'm not saying anything about the original question.)

You misread my post.  It's *87mb/s* vs. 300mb/s.  I kinda doubt that's
position on the drive.

--
                                  -- Josh Berkus
                                     PostgreSQL Experts Inc.
                                     http://www.pgexperts.com

Re: 8K recordsize bad on ZFS?

From
Greg Stark
Date:
On Mon, May 10, 2010 at 8:30 PM, Josh Berkus <josh@agliodbs.com> wrote:
> Ivan,
>
>> Other things could have influenced your result - 260 MB/s vs 300 MB/s is
>> close enough to be influenced by data position on (some of) the drives.
>> (I'm not saying anything about the original question.)
>
> You misread my post.  It's *87mb/s* vs. 300mb/s.  I kinda doubt that's
> position on the drive.

That still is consistent with it being caused by the files being
discontiguous. Copying them moved all the blocks to be contiguous and
sequential on disk and might have had the same effect even if you had
left the settings at 8kB blocks. You described it as "overloading the
array/drives with commands" which is probably accurate but sounds less
exotic if you say "the files were fragmented causing lots of seeks so
our drives we saturated the drives' iops capacity". How many iops were
you doing before and after anyways?

That said that doesn't change very much. The point remains that with
8kB blocks ZFS is susceptible to files becoming discontinuous and
sequential i/o performing poorly whereas with 128kB blocks hopefully
that would happen less. Of course with 128kB blocks updates become a
whole lot more expensive too.


--
greg

Re: 8K recordsize bad on ZFS?

From
Josh Berkus
Date:
> That still is consistent with it being caused by the files being
> discontiguous. Copying them moved all the blocks to be contiguous and
> sequential on disk and might have had the same effect even if you had
> left the settings at 8kB blocks. You described it as "overloading the
> array/drives with commands" which is probably accurate but sounds less
> exotic if you say "the files were fragmented causing lots of seeks so
> our drives we saturated the drives' iops capacity". How many iops were
> you doing before and after anyways?

Don't know.  This was a client system and once we got the target
numbers, they stopped wanting me to run tests on in.  :-(

Note that this was a brand-new system, so there wasn't much time for
fragmentation to occur.

--
                                  -- Josh Berkus
                                     PostgreSQL Experts Inc.
                                     http://www.pgexperts.com

Re: 8K recordsize bad on ZFS?

From
Jignesh Shah
Date:
On Fri, May 7, 2010 at 8:09 PM, Josh Berkus <josh@agliodbs.com> wrote:
> Jignesh, All:
>
> Most of our Solaris users have been, I think, following Jignesh's advice
> from his benchmark tests to set ZFS page size to 8K for the data zpool.
>  However, I've discovered that this is sometimes a serious problem for
> some hardware.
>
> For example, having the recordsize set to 8K on a Sun 4170 with 8 drives
> recently gave me these appalling Bonnie++ results:
>
> Version  1.96       ------Sequential Output------ --Sequential Input-
> --Random-
> Concurrency   4     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
> /sec %CP
> db111           24G           260044  33 62110  17           89914  15
> 1167  25
> Latency                        6549ms    4882ms              3395ms
> 107ms
>
> I know that's hard to read.  What it's saying is:
>
> Seq Writes: 260mb/s combined
> Seq Reads: 89mb/s combined
> Read Latency: 3.3s
>
> Best guess is that this is a result of overloading the array/drives with
> commands for all those small blocks; certainly the behavior observed
> (stuttering I/O, latency) is in line with that issue.
>
> Anyway, since this is a DW-like workload, we just bumped the recordsize
> up to 128K and the performance issues went away ... reads up over 300mb/s.
>
> --
>                                  -- Josh Berkus
>                                     PostgreSQL Experts Inc.
>                                     http://www.pgexperts.com
>

Hi Josh,

The 8K recommendation is for OLTP Applications.. So if you seen
somewhere to use it for DSS/DW workload then I need to change it. DW
Workloads require throughput and if they use 8K then they are limited
by 8K x max IOPS which with 8 disk is about 120 (typical) x 8 SAS
drives which is roughly about  8MB/sec.. (Prefetching with read drives
and other optimizations can help it to push to about 24-30MB/sec  with
8K on 12 disk arrays).. So yes that advice is typically bad for DSS..
And I believe I generally recommend them to use 128KB for DSS.So if
you have seen the 8K for DSS let me know and hopefully if I still have
access to it I can change it.  However for OLTP you are generally want
more IOPS with low latency which is what 8K provides (The smallest
blocksize in ZFS).

Hope this clarifies.

-Jignesh

Re: 8K recordsize bad on ZFS?

From
Josh Berkus
Date:
> Sure, but bulk load + reandom selects is going to *guarentee*
> fragmentatioon on a COW system (like ZFS, BTRFS, etc) as the selects
> start to write out all the hint-bit-dirtied blocks in random orders...
>
> i.e. it doesn't take long to make an originally nicely continuous block
> random....

I'm testing with DD and Bonnie++, though, which create their own files.

For that matter, running an ETL procedure with a newly created database
on both recordsizes was notably (2.5x) faster on the 128K system.

So I don't think fragmentation is the difference.

--
                                  -- Josh Berkus
                                     PostgreSQL Experts Inc.
                                     http://www.pgexperts.com