Thread: Postgresql 9.4 and ZFS?

Postgresql 9.4 and ZFS?

From

Benjamin Smith

Date:

29 September 2015, 17:19:01

Does anybody here have any recommendations for using PostgreSQL 9.4 (latest)
with ZFS?

We've been running both on ZFS/CentOS 6 with excellent results, and are
considering putting the two together. In particular, the CoW nature (and
subsequent fragmentation/thrashing) of ZFS becomes largely irrelevant on SSDs;
the very act of wear leveling on an SSD is itself a form of intentional
thrashing that doesn't affect performance since SSDs have no meaningful seek
time. It would seem that PGCon  2013 even had a workshop on it!
https://www.pgcon.org/2013/schedule/events/612.en.html

The exact configuration we're contemplating is either (3x 400 RAIDZ1) or (4x
400 RAIDZ2) with Intel Enterprise SATA3 SSDs, with default (lz4) compression
enabled.

If this is a particularly good or bad idea, I'd like to hear it, and why?

Thanks,

BenP

Re: Postgresql 9.4 and ZFS?

From

John R Pierce

Date:

29 September 2015, 17:35:46

On 9/29/2015 10:01 AM, Benjamin Smith wrote:
> Does anybody here have any recommendations for using PostgreSQL 9.4 (latest)
> with ZFS?

For databases, I've always used mirrored pools, not raidz*.

put pgdata in its own zfs file system in your zpool.  on that dedicated
zfs, set the blocksize to 8k.

--
john r pierce, recycling bits in santa cruz

Re: Postgresql 9.4 and ZFS?

From

Benjamin Smith

Date:

29 September 2015, 20:08:30

On Tuesday, September 29, 2015 10:35:28 AM John R Pierce wrote:
> On 9/29/2015 10:01 AM, Benjamin Smith wrote:
> > Does anybody here have any recommendations for using PostgreSQL 9.4
> > (latest) with ZFS?
>
> For databases, I've always used mirrored pools, not raidz*.


> put pgdata in its own zfs file system in your zpool.  on that dedicated
> zfs, set the blocksize to 8k.

Based on my reading here, that would be -o ashift=13 ?
HowDoesZFSonLinuxHandleAdvacedFormatDrives

EG: 2^13 = 8192

Re: Postgresql 9.4 and ZFS?

From

Larry Rosenman

Date:

29 September 2015, 20:17:20

On Tue, Sep 29, 2015 at 01:08:20PM -0700, Benjamin Smith wrote:
> On Tuesday, September 29, 2015 10:35:28 AM John R Pierce wrote:
> > On 9/29/2015 10:01 AM, Benjamin Smith wrote:
> > > Does anybody here have any recommendations for using PostgreSQL 9.4
> > > (latest) with ZFS?
> >
> > For databases, I've always used mirrored pools, not raidz*.
>
>
> > put pgdata in its own zfs file system in your zpool.  on that dedicated
> > zfs, set the blocksize to 8k.
>
> Based on my reading here, that would be -o ashift=13 ?
> HowDoesZFSonLinuxHandleAdvacedFormatDrives
>
> EG: 2^13 = 8192
>
>
>
>
No, that would be:

zfs create -o blocksize=8192 <pool>/path/to/pgdata

this is for the DATASET, not the POOL
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

--
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 214-642-9640                 E-Mail: ler@lerctr.org
US Mail: 7011 W Parmer Ln, Apt 1115, Austin, TX 78729-6961

Re: Postgresql 9.4 and ZFS?

From

John R Pierce

Date:

29 September 2015, 20:17:54

On 9/29/2015 1:08 PM, Benjamin Smith wrote:

put pgdata in its own zfs file system in your zpool.  on that dedicated
> zfs, set the blocksize to 8k.

Based on my reading here, that would be -o ashift=13 ? 
HowDoesZFSonLinuxHandleAdvacedFormatDrives

EG: 2^{^13} = 8192

sorry, I meant recordsize.

zfs set recordsize=8192 zpool/pgdata

-- 
john r pierce, recycling bits in santa cruz

Re: Postgresql 9.4 and ZFS?

From

Patric Bechtel

Date:

29 September 2015, 22:21:18

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Benjamin,

if you're using compression, forget about that. You need to synchronize the ashift value to the
internal rowsize of you SSD, that's it. Make sure your SSD doesn't lie to you regarding writing
blocks and their respective order. In that case you might even choose to set sync=disabled. Also,
set atime=off and relatime=on. For faster snapshot transfers, you might like to set the checksum
algo to SHA256.

As always, put zfs.conf into /etc/modprobe.d with

options spl spl_kmem_cache_slab_limit=16384
options zfs zfs_arc_max=8589934592

you might want to adjust the zfs_arc_max value to your liking. Don't set it to more than 1/3 of
your RAM, just saying.

I running above configuration in >30 servers atm in production, about 10 in test/dev environments,
speed is awesome. No data loss so far, depite quite some brown/blackouts already.

I'm using Ubuntu LTS (precise/trusty) with newest HWE, btw.

hope that helps,

Patric

Benjamin Smith schrieb am 29.09.2015 um 22:08:
> On Tuesday, September 29, 2015 10:35:28 AM John R Pierce wrote:
>> On 9/29/2015 10:01 AM, Benjamin Smith wrote:
>>> Does anybody here have any recommendations for using PostgreSQL 9.4 (latest) with ZFS?
>>
>> For databases, I've always used mirrored pools, not raidz*.
>
>
>> put pgdata in its own zfs file system in your zpool.  on that dedicated zfs, set the
>> blocksize to 8k.
>
> Based on my reading here, that would be -o ashift=13 ?
> HowDoesZFSonLinuxHandleAdvacedFormatDrives
>
> EG: 2^13 = 8192
>
>
>
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
Comment: GnuPT 2.5.2

iEYEARECAAYFAlYLDtgACgkQfGgGu8y7ypBTywCfXvyWjmhAW+2AVl2ZVFBk45zy
190An1/OgNGHw7o48ZQiGQQbr2MTvqQ5
=yYUr
-----END PGP SIGNATURE-----

Re: Postgresql 9.4 and ZFS?

From

Jov

Date:

30 September 2015, 03:15:04

I use pg with zfs on freebsd,it work great.I test zol 1 years ago,it will crash the os on load.

Some note for pg on freebsd from my experience:
1.if you use compression,8k recordsize make the compression ratio poor.I reach 7.x with gzip using default record size(128k if I remember) while get 2.x with 8k record size.

2.use slog to reduce fragment,and improve the sync write performance.

Re: Postgresql 9.4 and ZFS?

From

Tomas Vondra

Date:

30 September 2015, 12:02:00

Hi,

On 09/30/2015 12:21 AM, Patric Bechtel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi Benjamin,
>
> if you're using compression, forget about that. You need to
> synchronize the ashift value to the internal rowsize of you SSD,
> that's it. Make sure your SSD doesn't lie to you regarding writing
> blocks and their respective order. In that case you might even choose
> to set sync=disabled. Also, set atime=off and relatime=on. For faster
> snapshot transfers, you might like to set the checksum algo to
> SHA256.

What is "SSD rowsize". Do you mean size of the internal pages?

FWIW I've been doing extensive benchmarking of ZFS (on Linux), including
tests of different ashift values, and I see pretty much no difference
between ashift=12 and ashift=13 (4k vs 8k).

To show some numbers, these are pgbench results with 16 clients:

   type      scale    ashift=12   ashift=13  rsize=8k   logbias
   ----------------------------------------------------------------
   ro        small        53097       53159     53696     53221
   ro        medium       42869       43112     47039     46952
   ro        large         3127        3108     27736     28027
   rw        small         6593        6301      6384      6753
   rw        medium        1902        1890      4639      5034
   rw        large          561         554      2168      2585

small=150MB, medium=2GB, large=16GB (on a machine with 8GB of RAM)

The tests are "adding" the features, i.e. the columns are actually:

* ashift=12
* ashift=13
* ashift=13 + recordsize=8kB
* ashift=13 + recordsize=8kB + logbias=throughput

I've also done a few runs with compression, but that reduces the
performance a bit (understandably).

>
> As always, put zfs.conf into /etc/modprobe.d with
>
> options spl spl_kmem_cache_slab_limit=16384
> options zfs zfs_arc_max=8589934592
>
> you might want to adjust the zfs_arc_max value to your liking. Don't
> set it to more than 1/3 ofyour RAM, just saying.

Why? My understanding is that ARC cache is ~ page cache, although
implemented differently and not as tightly integrated with the kernel,
but it should release the memory when needed and such. Perhaps not
letting it to use all the RAM is a good idea, but 1/3 seems a bit too
aggressive?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Postgresql 9.4 and ZFS?

From

Tomas Vondra

Date:

30 September 2015, 12:22:38

Hi,

On 09/29/2015 07:01 PM, Benjamin Smith wrote:
> Does anybody here have any recommendations for using PostgreSQL 9.4
> (latest)with ZFS?

I think this really depends on the workload - if you have a lot of
random writes, CoW filesystems will perform significantly worse than
e.g. EXT4 or XFS, even on SSD.

>
> We've been running both on ZFS/CentOS 6 with excellent results, and
> are considering putting the two together. In particular, the CoW
> nature (and subsequent fragmentation/thrashing) of ZFS becomes
> largely irrelevant on SSDs; the very act of wear leveling on an SSD
> is itself a form of intentional thrashing that doesn't affect
> performance since SSDs have no meaningful seek time.

I don't think that's entirely true. Sure, SSD drives handle random I/O
much better than rotational storage, but it's not entirely free and
sequential I/O is still measurably faster.

It's true that the drives do internal wear leveling, but it probably
uses tricks that are impossible to do at the filesystem level (which is
oblivious to internal details of the SSD). CoW also increases the amount
of blocks that need to be reclaimed.

In the benchmarks I've recently done on SSD, EXT4 / XFS are ~2x faster
than ZFS. But of course, if the ZFS features are interesting for you,
maybe it's a reasonable price.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Postgresql 9.4 and ZFS?

From

Patric Bechtel

Date:

30 September 2015, 13:45:52

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Tomas,

Tomas Vondra schrieb am 30.09.2015 um 14:01:
> Hi,
>
> On 09/30/2015 12:21 AM, Patric Bechtel wrote:
>> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
>>
>> Hi Benjamin,
>>
>> if you're using compression, forget about that. You need to synchronize the ashift value to
>> the internal rowsize of you SSD, that's it. Make sure your SSD doesn't lie to you regarding
>> writing blocks and their respective order. In that case you might even choose to set
>> sync=disabled. Also, set atime=off and relatime=on. For faster snapshot transfers, you might
>> like to set the checksum algo to SHA256.
>
> What is "SSD rowsize". Do you mean size of the internal pages?

Yep. In my experience, it helps write performance a lot. At least over extended period of time
(less write amplification).

> FWIW I've been doing extensive benchmarking of ZFS (on Linux), including tests of different
> ashift values, and I see pretty much no difference between ashift=12 and ashift=13 (4k vs 8k).
>
> To show some numbers, these are pgbench results with 16 clients:
>
> type      scale    ashift=12   ashift=13  rsize=8k   logbias
> ---------------------------------------------------------------- ro        small        53097
> 53159     53696     53221 ro        medium       42869       43112     47039     46952 ro
> large         3127        3108     27736     28027 rw        small         6593        6301
> 6384      6753 rw        medium        1902        1890      4639      5034 rw        large
> 561         554      2168      2585
>
> small=150MB, medium=2GB, large=16GB (on a machine with 8GB of RAM)
>
> The tests are "adding" the features, i.e. the columns are actually:
>
> * ashift=12 * ashift=13 * ashift=13 + recordsize=8kB * ashift=13 + recordsize=8kB +
> logbias=throughput
>
> I've also done a few runs with compression, but that reduces the performance a bit
> (understandably).

I'm somewhat surprised by the influence of the rsize value. I will recheck that. In my case, the
compression actually improved throughput quite a bit, but that might change depending on CPU speed
vs IO speed. Our CPU's are quite powerful, but the SSD are just SATA Samsung/OCZ models at least
18 months old. Also, I measured the write performance over several hours, to push the internal gc
of the SSD to its limits. We had some problems in the past with (e.g. Intel) SSD's and their
behaviour (<1MB/s), so that's why I put some emphasis on that.

>>
>> As always, put zfs.conf into /etc/modprobe.d with
>>
>> options spl spl_kmem_cache_slab_limit=16384 options zfs zfs_arc_max=8589934592
>>
>> you might want to adjust the zfs_arc_max value to your liking. Don't set it to more than 1/3
>> ofyour RAM, just saying.
>
> Why? My understanding is that ARC cache is ~ page cache, although implemented differently and
> not as tightly integrated with the kernel, but it should release the memory when needed and
> such. Perhaps not letting it to use all the RAM is a good idea, but 1/3 seems a bit too
> aggressive?

First of all: The setting is somewhat 'disregarded' by zfs, as it's the net size of the buffer.
The gross side (with padding and aligning) isn't counted there, so in fact the cache fills up to
2/3 of the memory, which is plenty enough. Also, sometimes the arc shrinking process isn't as fast
as necessary, so leaving some headroom in case isn't a bad strategy, IMHO.

Patric
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
Comment: GnuPT 2.5.2

iEYEARECAAYFAlYL54cACgkQfGgGu8y7ypBXKACg6fuuvzdUtDvHRbdyisJXZwxF
ORMAoK3mEQhsB+AybHTQzhZ6hR6xT+30
=9yFi
-----END PGP SIGNATURE-----

Re: Postgresql 9.4 and ZFS?

From

Benjamin Smith

Date:

30 September 2015, 17:33:29

On Wednesday, September 30, 2015 02:22:31 PM Tomas Vondra wrote:
> I think this really depends on the workload - if you have a lot of
> random writes, CoW filesystems will perform significantly worse than
> e.g. EXT4 or XFS, even on SSD.

I'd be curious about the information you have that leads you to this
conclusion. As with many (most?) "rules of thumb", the devil is quite often
the details.

> > We've been running both on ZFS/CentOS 6 with excellent results, and
> > are considering putting the two together. In particular, the CoW
> > nature (and subsequent fragmentation/thrashing) of ZFS becomes
> > largely irrelevant on SSDs; the very act of wear leveling on an SSD
> > is itself a form of intentional thrashing that doesn't affect
> > performance since SSDs have no meaningful seek time.
>
> I don't think that's entirely true. Sure, SSD drives handle random I/O
> much better than rotational storage, but it's not entirely free and
> sequential I/O is still measurably faster.
>
> It's true that the drives do internal wear leveling, but it probably
> uses tricks that are impossible to do at the filesystem level (which is
> oblivious to internal details of the SSD). CoW also increases the amount
> of blocks that need to be reclaimed.
>
> In the benchmarks I've recently done on SSD, EXT4 / XFS are ~2x faster
> than ZFS. But of course, if the ZFS features are interesting for you,
> maybe it's a reasonable price.

Again, the details would be highly interesting to me. What memory optimization
was done? Status of snapshots? Was the pool RAIDZ or mirrored vdevs? How many
vdevs? Was compression enabled? What ZFS release was this? Was this on Linux,
Free/Open/Net BSD, Solaris, or something else?

A 2x performance difference is almost inconsequential in my experience, where
growth is exponential. 2x performance change generally means 1 to 2 years of
advancement or deferment against the progression of hardware; our current,
relatively beefy DB servers are already older than that, and have an
anticipated life cycle of at least another couple years.

// Our situation //
Lots of RAM for the workload: 128 GB of ECC RAM with an on-disk DB size of ~
150 GB. Pretty much, everything runs straight out of RAM cache, with only
writes hitting disk. Smart reports 4/96 read/write ratio.

Query load: Constant, heavy writes and heavy use of temp tables in order to
assemble very complex queries. Pretty much the "worst case" mix of reads and
writes, average daily peak of about 200-250 queries/second.

16 Core XEON servers, 32 HT "cores".

SAS 3 Gbps

CentOS 6 is our O/S of choice.

Currently, we're running Intel 710 SSDs in a software RAID1 without trim
enabled and generally happy with the reliability and performance we see. We're
planning to upgrade storage soon (since we're over 50% utilization) and in the
process, bring the magic goodness of snapshots/clones from ZFS.

Re: Postgresql 9.4 and ZFS?

From

Keith Fiske

Date:

30 September 2015, 19:50:32

On Tue, Sep 29, 2015 at 1:01 PM, Benjamin Smith <ben@chico.com> wrote:

Does anybody here have any recommendations for using PostgreSQL 9.4 (latest)
with ZFS?

We've been running both on ZFS/CentOS 6 with excellent results, and are
considering putting the two together. In particular, the CoW nature (and
subsequent fragmentation/thrashing) of ZFS becomes largely irrelevant on SSDs;
the very act of wear leveling on an SSD is itself a form of intentional
thrashing that doesn't affect performance since SSDs have no meaningful seek
time. It would seem that PGCon 2013 even had a workshop on it!
https://www.pgcon.org/2013/schedule/events/612.en.html

The exact configuration we're contemplating is either (3x 400 RAIDZ1) or (4x
400 RAIDZ2) with Intel Enterprise SATA3 SSDs, with default (lz4) compression
enabled.

If this is a particularly good or bad idea, I'd like to hear it, and why?

Thanks,

BenP

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

We've run postgres on ZFS for years with great success (first on OpenSolaris, now on OmniOS, and I personally run it on FreeBSD). The snapshotting feature makes upgrades on large clusters much less scary (snapshot and revert if it goes bad) and being able to bring a snapshot backup up as a clone to restore an accidentally dropped table is great.

Others have given a lot of great advice as far as system tuning. Only other thing I can add is you definitely do want your data directory on its own pool. But I recommend putting the actual data in a folder under that pool (possibly by major version name). For example if your pool is

/data/postgres

Create a folder under that directory to actually put the data:

mkdir /data/postgres/9.4

This allows pg_upgrade's --link option to work during major upgrades since you can't have an upgrade destination on a different filesystem. Just make a 9.5 directory in the same spot when the time comes around. With ZFS snapshots available, there's really no reason not to use the --link option to greatly speed up upgrades.

--
Keith Fiske
Database Administrator
OmniTI Computer Consulting, Inc.
http://www.keithf4.com

Re: Postgresql 9.4 and ZFS?

From

Tomas Vondra

Date:

30 September 2015, 19:58:16

On 09/30/2015 07:33 PM, Benjamin Smith wrote:
> On Wednesday, September 30, 2015 02:22:31 PM Tomas Vondra wrote:
>> I think this really depends on the workload - if you have a lot of
>> random writes, CoW filesystems will perform significantly worse than
>> e.g. EXT4 or XFS, even on SSD.
>
> I'd be curious about the information you have that leads you to this
> conclusion. As with many (most?) "rules of thumb", the devil is
> quiteoften the details.

A lot of testing done recently, and also experience with other CoW
filesystems (e.g. BTRFS explicitly warns about workloads with a lot of
random writes).

>>> We've been running both on ZFS/CentOS 6 with excellent results, and
>>> are considering putting the two together. In particular, the CoW
>>> nature (and subsequent fragmentation/thrashing) of ZFS becomes
>>> largely irrelevant on SSDs; the very act of wear leveling on an SSD
>>> is itself a form of intentional thrashing that doesn't affect
>>> performance since SSDs have no meaningful seek time.
>>
>> I don't think that's entirely true. Sure, SSD drives handle random I/O
>> much better than rotational storage, but it's not entirely free and
>> sequential I/O is still measurably faster.
>>
>> It's true that the drives do internal wear leveling, but it probably
>> uses tricks that are impossible to do at the filesystem level (which is
>> oblivious to internal details of the SSD). CoW also increases the amount
>> of blocks that need to be reclaimed.
>>
>> In the benchmarks I've recently done on SSD, EXT4 / XFS are ~2x
>> faster than ZFS. But of course, if the ZFS features are interesting
>> for you, maybe it's a reasonable price.
>
> Again, the details would be highly interesting to me. What memory
> optimization was done? Status of snapshots? Was the pool RAIDZ or
> mirrored vdevs? How many vdevs? Was compression enabled? What ZFS
> release was this? Was this on Linux,Free/Open/Net BSD, Solaris, or
> something else?

I'm not sure what you mean by "memory optimization" so the answer is
probably "no".

FWIW I don't have much experience with ZFS in production, all I have is
data from benchmarks I've recently done exactly with the goal to educate
myself on the differences of current filesystems.

The tests were done on Linux, with kernel 4.0.4 / zfs 0.6.4. So fairly
recent versions, IMHO.

My goal was to test the file systems under the same conditions and used
a single device (Intel S3700 SSD). I'm aware that this is not a perfect
test and ZFS offers interesting options (e.g. moving ZIL to a separate
device). I plan to benchmark some additional configurations with more
devices and such.

>
> A 2x performance difference is almost inconsequential in my
> experience, where growth is exponential. 2x performance change
> generally means 1 to 2 years of advancement or deferment against the
> progression of hardware; our current, relatively beefy DB servers
> are already older than that, and have an anticipated life cycle of at
> leastanother couple years.

I'm not sure I understand what you suggest here. What I'm saying is that
when I do a stress test on the same hardware, I do get ~2x the
throughput with EXT4/XFS, compared to ZFS.

> // Our situation // Lots of RAM for the workload: 128 GB of ECC RAM
> with an on-disk DB size of ~ 150 GB. Pretty much, everything runs
> straight out of RAM cache, with only writes hitting disk. Smart
> reports 4/96 read/write ratio.

So your active set fits into RAM? I'd guess all your writes are then WAL
+ checkpoints, which probably makes them rather sequential.

If that's the case, CoW filesystems may perform quite well - I was
mostly referring to workloads with a lot of random writes to he device.

> Query load: Constant, heavy writes and heavy use of temp tables in
> order to assemble very complex queries. Pretty much the "worst case"
> mix of reads and writes, average daily peak of about 200-250
 > queries/second.

I'm not sure how much random I/O that actually translates to. According
to the numbers I've posted to this thread few hours ago, a tuned ZFS on
a single SSD device handles ~2.5k tps (with dataset ~2x the RAM). But
those are OLTP queries - your queries may write much more data. OTOH it
really does not matter that much if your active set fits into RAM,
because then it's mostly about writing to ZIL.

>
> 16 Core XEON servers, 32 HT "cores".
>
> SAS 3 Gbps
>
> CentOS 6 is our O/S of choice.
>
> Currently, we're running Intel 710 SSDs in a software RAID1 without
> trim enabled and generally happy with the reliability and performance
> we see. We're planning to upgrade storage soon (since we're over 50%
> utilization) and in the process, bring the magic goodness of
> snapshots/clones from ZFS.

I presume by "software RAID1" you mean "mirrored vdev zpool", correct?

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Postgresql 9.4 and ZFS?

From

Tomas Vondra

Date:

30 September 2015, 20:14:17

On 09/30/2015 03:45 PM, Patric Bechtel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi Tomas,
>
> Tomas Vondra schrieb am 30.09.2015 um 14:01:
>> Hi,
>>
...
>>
>> I've also done a few runs with compression, but that reduces the performance a bit
>> (understandably).
>
> I'm somewhat surprised by the influence of the rsize value. I will recheck that. In my case, the
> compression actually improved throughput quite a bit, but that might change depending on CPU speed
> vs IO speed. Our CPU's are quite powerful, but the SSD are just SATA Samsung/OCZ models at least
> 18 months old. Also, I measured the write performance over several hours, to push the internal gc
> of the SSD to its limits. We had some problems in the past with (e.g. Intel) SSD's and their
> behaviour (<1MB/s), so that's why I put some emphasis on that.

I think it really depends on how random the workload. If the workload is
random (as for example the workload simulated by pgbench), the
recordsize seems to matter a lot - perhaps your workload is not as
random? Same for compression.

I'm no ZFS expert, but I assume ARC tracks those records chunks, so
using 128kB records (default) means you have 16x less slots than with
8kB records. You may still cache the same amount of data, but the cache
may not adapt that well to your active set.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Postgresql 9.4 and ZFS?

From

Benjamin Smith

Date:

30 September 2015, 21:13:10

On Wednesday, September 30, 2015 09:58:08 PM Tomas Vondra wrote:
> On 09/30/2015 07:33 PM, Benjamin Smith wrote:
> > On Wednesday, September 30, 2015 02:22:31 PM Tomas Vondra wrote:
> >> I think this really depends on the workload - if you have a lot of
> >> random writes, CoW filesystems will perform significantly worse than
> >> e.g. EXT4 or XFS, even on SSD.
> >
> > I'd be curious about the information you have that leads you to this
> > conclusion. As with many (most?) "rules of thumb", the devil is
> > quiteoften the details.
>
> A lot of testing done recently, and also experience with other CoW
> filesystems (e.g. BTRFS explicitly warns about workloads with a lot of
> random writes).
>
> >>> We've been running both on ZFS/CentOS 6 with excellent results, and
> >>> are considering putting the two together. In particular, the CoW
> >>> nature (and subsequent fragmentation/thrashing) of ZFS becomes
> >>> largely irrelevant on SSDs; the very act of wear leveling on an SSD
> >>> is itself a form of intentional thrashing that doesn't affect
> >>> performance since SSDs have no meaningful seek time.
> >>
> >> I don't think that's entirely true. Sure, SSD drives handle random I/O
> >> much better than rotational storage, but it's not entirely free and
> >> sequential I/O is still measurably faster.
> >>
> >> It's true that the drives do internal wear leveling, but it probably
> >> uses tricks that are impossible to do at the filesystem level (which is
> >> oblivious to internal details of the SSD). CoW also increases the amount
> >> of blocks that need to be reclaimed.
> >>
> >> In the benchmarks I've recently done on SSD, EXT4 / XFS are ~2x
> >> faster than ZFS. But of course, if the ZFS features are interesting
> >> for you, maybe it's a reasonable price.
> >
> > Again, the details would be highly interesting to me. What memory
> > optimization was done? Status of snapshots? Was the pool RAIDZ or
> > mirrored vdevs? How many vdevs? Was compression enabled? What ZFS
> > release was this? Was this on Linux,Free/Open/Net BSD, Solaris, or
> > something else?
>
> I'm not sure what you mean by "memory optimization" so the answer is
> probably "no".

I mean the full gamut:

Did you use an l2arc? Did you use a dedicated ZIL? What was arc_max set to?
How much RAM/GB was installed on the machine? How did you set up PG? (PG
defaults are historically horrible for higher-RAM machines)

> FWIW I don't have much experience with ZFS in production, all I have is
> data from benchmarks I've recently done exactly with the goal to educate
> myself on the differences of current filesystems.
>
> The tests were done on Linux, with kernel 4.0.4 / zfs 0.6.4. So fairly
> recent versions, IMHO.
>
> My goal was to test the file systems under the same conditions and used
> a single device (Intel S3700 SSD). I'm aware that this is not a perfect
> test and ZFS offers interesting options (e.g. moving ZIL to a separate
> device). I plan to benchmark some additional configurations with more
> devices and such.

Also, did you try with/without compression? My information so far is that
compression significantly improves overall performance.

> > A 2x performance difference is almost inconsequential in my
> > experience, where growth is exponential. 2x performance change
> > generally means 1 to 2 years of advancement or deferment against the
> > progression of hardware; our current, relatively beefy DB servers
> > are already older than that, and have an anticipated life cycle of at
> > leastanother couple years.
>
> I'm not sure I understand what you suggest here. What I'm saying is that
> when I do a stress test on the same hardware, I do get ~2x the
> throughput with EXT4/XFS, compared to ZFS.

What I'm saying is only what it says on its face: A 50% performance difference
is rarely enough to make or break a production system; performance/capacity
reserves of 95% or more are fairly typical, which means the difference between
5% utilization and 10%. Even if latency rose by 50%, that's typically the
difference between 20ms and 30ms, not enough that, over the 'net for a
SOAP/REST call, that anybody'd notice even if it's enough to make you want to
optimize things a bit.

> > // Our situation // Lots of RAM for the workload: 128 GB of ECC RAM
> > with an on-disk DB size of ~ 150 GB. Pretty much, everything runs
> > straight out of RAM cache, with only writes hitting disk. Smart
> > reports 4/96 read/write ratio.
>
> So your active set fits into RAM? I'd guess all your writes are then WAL
> + checkpoints, which probably makes them rather sequential.
>
> If that's the case, CoW filesystems may perform quite well - I was
> mostly referring to workloads with a lot of random writes to he device.

That's *MY* hope, anyway! :)

> > Query load: Constant, heavy writes and heavy use of temp tables in
> > order to assemble very complex queries. Pretty much the "worst case"
> > mix of reads and writes, average daily peak of about 200-250
> >
>  > queries/second.
>
> I'm not sure how much random I/O that actually translates to. According
> to the numbers I've posted to this thread few hours ago, a tuned ZFS on
> a single SSD device handles ~2.5k tps (with dataset ~2x the RAM). But
> those are OLTP queries - your queries may write much more data. OTOH it
> really does not matter that much if your active set fits into RAM,
> because then it's mostly about writing to ZIL.

I personally don't yet know how much sense an SSD-backed ZIL makes when the
storage media is also SSD-based.

> > 16 Core XEON servers, 32 HT "cores".
> >
> > SAS 3 Gbps
> >
> > CentOS 6 is our O/S of choice.
> >
> > Currently, we're running Intel 710 SSDs in a software RAID1 without
> > trim enabled and generally happy with the reliability and performance
> > we see. We're planning to upgrade storage soon (since we're over 50%
> > utilization) and in the process, bring the magic goodness of
> > snapshots/clones from ZFS.
>
> I presume by "software RAID1" you mean "mirrored vdev zpool", correct?

I mean "software RAID 1" with Linux/mdadm. We haven't put ZFS into production
use on any of our DB servers, yet.

Thanks for your input.

Ben

Re: Postgresql 9.4 and ZFS?

From

Benjamin Smith

Date:

30 September 2015, 22:57:59

On Wednesday, September 30, 2015 03:49:44 PM Keith Fiske wrote:
> We've run postgres on ZFS for years with great success (first on
> OpenSolaris, now on OmniOS, and I personally run it on FreeBSD). The
> snapshotting feature makes upgrades on large clusters much less scary
> (snapshot and revert if it goes bad) and being able to bring a snapshot
> backup up as a clone to restore an accidentally dropped table is great.

Somebody mentioned some trouble running it with ZFS on Linux, which is exactly
how we're planning our roll out. (We're a RHEL/CentOS shop) Have you tried
that config, and has it worked for you?

> Others have given a lot of great advice as far as system tuning. Only other
> thing I can add is you definitely do want your data directory on its own
> pool. But I recommend putting the actual data in a folder under that pool
> (possibly by major version name). For example if your pool is
>
> /data/postgres
>
> Create a folder under that directory to actually put the data:
>
> mkdir /data/postgres/9.4
>
> This allows pg_upgrade's --link option to work during major upgrades since
> you can't have an upgrade destination on a different filesystem. Just make
> a 9.5 directory in the same spot when the time comes around. With ZFS
> snapshots available, there's really no reason not to use the --link option
> to greatly speed up upgrades.

Recently, the PGDG RPMs provided by PostgreSQL have done something similar by
moving from /var/lib/pgsql/ to (EG) /var/lib/pgsql/9.4 and we've followed
suit, trying to keep things "stock" where possible.

Our intent is to make /var/lib/pgsql a filesystem in a pool containing no other
file systems, with SSD-based VDEVs that aren't shared for any other purpose.

Re: Postgresql 9.4 and ZFS?

From

Keith Fiske

Date:

01 October 2015, 04:29:22

On Wed, Sep 30, 2015 at 4:58 PM, Benjamin Smith <ben@chico.com> wrote:

On Wednesday, September 30, 2015 03:49:44 PM Keith Fiske wrote:
> We've run postgres on ZFS for years with great success (first on
> OpenSolaris, now on OmniOS, and I personally run it on FreeBSD). The
> snapshotting feature makes upgrades on large clusters much less scary
> (snapshot and revert if it goes bad) and being able to bring a snapshot
> backup up as a clone to restore an accidentally dropped table is great.

Somebody mentioned some trouble running it with ZFS on Linux, which is exactly
how we're planning our roll out. (We're a RHEL/CentOS shop) Have you tried
that config, and has it worked for you?

We've not run it in production where I work and I haven't met anyone that is doing it either. Personally, I tried it at home for a while when I used to use Linux on my home server. But whenever there was a kernel or zfs update, i'd occasionally have problems with it booting up or seeing the zfs mount. Rebooting again usually fixed it, but it made me nervous every time there was a kernel update. I switched to FreeBSD a few years ago to get native ZFS support and haven't looked back since. As that was a few years ago, things may have improved, but I couldn't speak to those improvements anymore.

> Others have given a lot of great advice as far as system tuning. Only other
> thing I can add is you definitely do want your data directory on its own
> pool. But I recommend putting the actual data in a folder under that pool
> (possibly by major version name). For example if your pool is
>
> /data/postgres
>
> Create a folder under that directory to actually put the data:
>
> mkdir /data/postgres/9.4
>
> This allows pg_upgrade's --link option to work during major upgrades since
> you can't have an upgrade destination on a different filesystem. Just make
> a 9.5 directory in the same spot when the time comes around. With ZFS
> snapshots available, there's really no reason not to use the --link option
> to greatly speed up upgrades.

Recently, the PGDG RPMs provided by PostgreSQL have done something similar by
moving from /var/lib/pgsql/ to (EG) /var/lib/pgsql/9.4 and we've followed
suit, trying to keep things "stock" where possible.

Our intent is to make /var/lib/pgsql a filesystem in a pool containing no other
file systems, with SSD-based VDEVs that aren't shared for any other purpose.

Re: Postgresql 9.4 and ZFS?

From

Joseph Kregloh

Date:

01 October 2015, 13:51:05

On Wed, Sep 30, 2015 at 5:12 PM, Benjamin Smith <lists@benjamindsmith.com> wrote:

On Wednesday, September 30, 2015 09:58:08 PM Tomas Vondra wrote:
> On 09/30/2015 07:33 PM, Benjamin Smith wrote:
> > On Wednesday, September 30, 2015 02:22:31 PM Tomas Vondra wrote:
> >> I think this really depends on the workload - if you have a lot of
> >> random writes, CoW filesystems will perform significantly worse than
> >> e.g. EXT4 or XFS, even on SSD.
> >
> > I'd be curious about the information you have that leads you to this
> > conclusion. As with many (most?) "rules of thumb", the devil is
> > quiteoften the details.
>
> A lot of testing done recently, and also experience with other CoW
> filesystems (e.g. BTRFS explicitly warns about workloads with a lot of
> random writes).
>
> >>> We've been running both on ZFS/CentOS 6 with excellent results, and
> >>> are considering putting the two together. In particular, the CoW
> >>> nature (and subsequent fragmentation/thrashing) of ZFS becomes
> >>> largely irrelevant on SSDs; the very act of wear leveling on an SSD
> >>> is itself a form of intentional thrashing that doesn't affect
> >>> performance since SSDs have no meaningful seek time.
> >>
> >> I don't think that's entirely true. Sure, SSD drives handle random I/O
> >> much better than rotational storage, but it's not entirely free and
> >> sequential I/O is still measurably faster.
> >>
> >> It's true that the drives do internal wear leveling, but it probably
> >> uses tricks that are impossible to do at the filesystem level (which is
> >> oblivious to internal details of the SSD). CoW also increases the amount
> >> of blocks that need to be reclaimed.
> >>
> >> In the benchmarks I've recently done on SSD, EXT4 / XFS are ~2x
> >> faster than ZFS. But of course, if the ZFS features are interesting
> >> for you, maybe it's a reasonable price.
> >
> > Again, the details would be highly interesting to me. What memory
> > optimization was done? Status of snapshots? Was the pool RAIDZ or
> > mirrored vdevs? How many vdevs? Was compression enabled? What ZFS
> > release was this? Was this on Linux,Free/Open/Net BSD, Solaris, or
> > something else?
>
> I'm not sure what you mean by "memory optimization" so the answer is
> probably "no".

I mean the full gamut:

Did you use an l2arc? Did you use a dedicated ZIL? What was arc_max set to?
How much RAM/GB was installed on the machine? How did you set up PG? (PG
defaults are historically horrible for higher-RAM machines)

In my testing with pgbench I actually saw a decrease in performance with a ZIL enabled. I ended up just keeping the L2ARC and dropping the. ZIL will not provide you with any speed boost as a database. On a NAS with NFS shared for example, a ZIL would work well. ZIL is more for data protection than anything.

I run in Production FreeBSD 10.1 with an NVMe mirror for L2ARC, the rest of the storage is spinning drives. With a combination of filesystem compressions. For example, archival tablespaces and the log folder are on gzip compression on an external array. Faster stuff like the xlog are lz4 and on an internal array.

If you are interested I might still have the data from when I executed the tests.

-Joseph Kregloh

> FWIW I don't have much experience with ZFS in production, all I have is
> data from benchmarks I've recently done exactly with the goal to educate
> myself on the differences of current filesystems.
>
> The tests were done on Linux, with kernel 4.0.4 / zfs 0.6.4. So fairly
> recent versions, IMHO.
>
> My goal was to test the file systems under the same conditions and used
> a single device (Intel S3700 SSD). I'm aware that this is not a perfect
> test and ZFS offers interesting options (e.g. moving ZIL to a separate
> device). I plan to benchmark some additional configurations with more
> devices and such.

Also, did you try with/without compression? My information so far is that
compression significantly improves overall performance.

> > A 2x performance difference is almost inconsequential in my
> > experience, where growth is exponential. 2x performance change
> > generally means 1 to 2 years of advancement or deferment against the
> > progression of hardware; our current, relatively beefy DB servers
> > are already older than that, and have an anticipated life cycle of at
> > leastanother couple years.
>
> I'm not sure I understand what you suggest here. What I'm saying is that
> when I do a stress test on the same hardware, I do get ~2x the
> throughput with EXT4/XFS, compared to ZFS.

What I'm saying is only what it says on its face: A 50% performance difference
is rarely enough to make or break a production system; performance/capacity
reserves of 95% or more are fairly typical, which means the difference between
5% utilization and 10%. Even if latency rose by 50%, that's typically the
difference between 20ms and 30ms, not enough that, over the 'net for a
SOAP/REST call, that anybody'd notice even if it's enough to make you want to
optimize things a bit.

> > // Our situation // Lots of RAM for the workload: 128 GB of ECC RAM
> > with an on-disk DB size of ~ 150 GB. Pretty much, everything runs
> > straight out of RAM cache, with only writes hitting disk. Smart
> > reports 4/96 read/write ratio.
>
> So your active set fits into RAM? I'd guess all your writes are then WAL
> + checkpoints, which probably makes them rather sequential.
>
> If that's the case, CoW filesystems may perform quite well - I was
> mostly referring to workloads with a lot of random writes to he device.

That's *MY* hope, anyway! :)

> > Query load: Constant, heavy writes and heavy use of temp tables in
> > order to assemble very complex queries. Pretty much the "worst case"
> > mix of reads and writes, average daily peak of about 200-250
> >
> > queries/second.
>
> I'm not sure how much random I/O that actually translates to. According
> to the numbers I've posted to this thread few hours ago, a tuned ZFS on
> a single SSD device handles ~2.5k tps (with dataset ~2x the RAM). But
> those are OLTP queries - your queries may write much more data. OTOH it
> really does not matter that much if your active set fits into RAM,
> because then it's mostly about writing to ZIL.

I personally don't yet know how much sense an SSD-backed ZIL makes when the
storage media is also SSD-based.

> > 16 Core XEON servers, 32 HT "cores".
> >
> > SAS 3 Gbps
> >
> > CentOS 6 is our O/S of choice.
> >
> > Currently, we're running Intel 710 SSDs in a software RAID1 without
> > trim enabled and generally happy with the reliability and performance
> > we see. We're planning to upgrade storage soon (since we're over 50%
> > utilization) and in the process, bring the magic goodness of
> > snapshots/clones from ZFS.
>
> I presume by "software RAID1" you mean "mirrored vdev zpool", correct?

I mean "software RAID 1" with Linux/mdadm. We haven't put ZFS into production
use on any of our DB servers, yet.

Thanks for your input.

Ben

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Postgresql 9.4 and ZFS?

From

Jim Nasby

Date:

01 October 2015, 21:51:28

On 10/1/15 8:50 AM, Joseph Kregloh wrote:
> In my testing with pgbench I actually saw a decrease in performance with
> a ZIL enabled. I ended up just keeping the L2ARC and dropping the. ZIL
> will not provide you with any speed boost as a database. On a NAS with
> NFS shared for example, a ZIL would work well. ZIL is more for data
> protection than anything.
>
> I run in Production FreeBSD 10.1 with an NVMe mirror for L2ARC, the rest
> of the storage is spinning drives. With a combination of filesystem
> compressions. For example, archival tablespaces and the log folder are
> on gzip compression on an external array. Faster stuff like the xlog are
> lz4 and on an internal array.

I'm not a ZFS expert, but my understanding is that a ZIL *that has lower
latency than main storage* can be a performance win. This is similar to
the idea of giving pg_xlog it's own dedicated volume so that it's not
competing with all the other IO traffic every time you do a COMMIT.

Recent versions of Postgres go to a lot of trouble to make fsync as
painless as possible, so a ZIL might not help much in many cases. Where
it could still help is if you're running synchronous_commit = true and
you consistently get lower latency on the ZIL than on the vdev's; that
will make every COMMIT run faster.

(BTW, this is all based on the assumption that ZFS treats fsync as a
synchronous request.)
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Postgresql 9.4 and ZFS?

From

Joseph Kregloh

Date:

02 October 2015, 01:04:57

On Thu, Oct 1, 2015 at 5:51 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

On 10/1/15 8:50 AM, Joseph Kregloh wrote:
In my testing with pgbench I actually saw a decrease in performance with
a ZIL enabled. I ended up just keeping the L2ARC and dropping the. ZIL
will not provide you with any speed boost as a database. On a NAS with
NFS shared for example, a ZIL would work well. ZIL is more for data
protection than anything.

I run in Production FreeBSD 10.1 with an NVMe mirror for L2ARC, the rest
of the storage is spinning drives. With a combination of filesystem
compressions. For example, archival tablespaces and the log folder are
on gzip compression on an external array. Faster stuff like the xlog are
lz4 and on an internal array.

I'm not a ZFS expert, but my understanding is that a ZIL *that has lower latency than main storage* can be a performance win. This is similar to the idea of giving pg_xlog it's own dedicated volume so that it's not competing with all the other IO traffic every time you do a COMMIT.

Recent versions of Postgres go to a lot of trouble to make fsync as painless as possible, so a ZIL might not help much in many cases. Where it could still help is if you're running synchronous_commit = true and you consistently get lower latency on the ZIL than on the vdev's; that will make every COMMIT run faster.

(BTW, this is all based on the assumption that ZFS treats fsync as a synchronous request.)

The ZIL or ZFS Intent Log as the name describe is just a log. It just replays transactions that may have been lost in the event of machine failure. If the machine crashes upon startup of ZFS it will replay the data stored in the ZIL drive and try to fix any errors. During runtime the ZIL is never read from only written to.

When there is no separate ZIL device. With a synchronous write ZFS will store the data on RAM and the ZIL residing on the vdev. Once it acknowledges that the data is all there it will flush from RAM into it's final write location on the vdev.

When you have a fast ZIL device like an SSD or NVMe drive. It will do the same store the data on RAM and on the fast ZIL device. Once acknowledge it will also write from RAM into the vdev. In theory it does give you a faster acknowledgement time.

In either case you are still "bottlenecked" by the speed of the write from RAM to the zpool. Now for a small database with not many writes a ZIL would be awesome. But on a write heavy database you will be acknowledging more writes because of the ZIL that what you are physically able to write from RAM to zpool, thereby degrading performance.

At least this is how it works in my head.

-Joseph Kregloh

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Postgresql 9.4 and ZFS?

From

Jim Nasby

Date:

02 October 2015, 19:07:31

On 10/1/15 8:04 PM, Joseph Kregloh wrote:
> In either case you are still "bottlenecked" by the speed of the write
> from RAM to the zpool. Now for a small database with not many writes a
> ZIL would be awesome. But on a write heavy database you will be
> acknowledging more writes because of the ZIL that what you are
> physically able to write from RAM to zpool, thereby degrading performance.

Unless ZFS lies about fsync, you'll have to wait for writes to go
somewhere, either the main pool or the ZIL. Because the ZIL is
effectively write-only, having a ZIL could significantly reduce fsync
latency because it's a) only writing and b) writing sequentially.

Essentially, it's the same idea as having a separate pg_xlog partition
(except that depending on your setup there may well be pg_xlog read
activity as well).
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com