Re: Postgresql 9.4 and ZFS? - Mailing list pgsql-general

From Benjamin Smith
Subject Re: Postgresql 9.4 and ZFS?
Date
Msg-id 3154674.83D5nIIxW5@tesla.schoolpathways.com
Whole thread Raw
In response to Re: Postgresql 9.4 and ZFS?  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: Postgresql 9.4 and ZFS?  (Joseph Kregloh <jkregloh@sproutloud.com>)
List pgsql-general
On Wednesday, September 30, 2015 09:58:08 PM Tomas Vondra wrote:
> On 09/30/2015 07:33 PM, Benjamin Smith wrote:
> > On Wednesday, September 30, 2015 02:22:31 PM Tomas Vondra wrote:
> >> I think this really depends on the workload - if you have a lot of
> >> random writes, CoW filesystems will perform significantly worse than
> >> e.g. EXT4 or XFS, even on SSD.
> >
> > I'd be curious about the information you have that leads you to this
> > conclusion. As with many (most?) "rules of thumb", the devil is
> > quiteoften the details.
>
> A lot of testing done recently, and also experience with other CoW
> filesystems (e.g. BTRFS explicitly warns about workloads with a lot of
> random writes).
>
> >>> We've been running both on ZFS/CentOS 6 with excellent results, and
> >>> are considering putting the two together. In particular, the CoW
> >>> nature (and subsequent fragmentation/thrashing) of ZFS becomes
> >>> largely irrelevant on SSDs; the very act of wear leveling on an SSD
> >>> is itself a form of intentional thrashing that doesn't affect
> >>> performance since SSDs have no meaningful seek time.
> >>
> >> I don't think that's entirely true. Sure, SSD drives handle random I/O
> >> much better than rotational storage, but it's not entirely free and
> >> sequential I/O is still measurably faster.
> >>
> >> It's true that the drives do internal wear leveling, but it probably
> >> uses tricks that are impossible to do at the filesystem level (which is
> >> oblivious to internal details of the SSD). CoW also increases the amount
> >> of blocks that need to be reclaimed.
> >>
> >> In the benchmarks I've recently done on SSD, EXT4 / XFS are ~2x
> >> faster than ZFS. But of course, if the ZFS features are interesting
> >> for you, maybe it's a reasonable price.
> >
> > Again, the details would be highly interesting to me. What memory
> > optimization was done? Status of snapshots? Was the pool RAIDZ or
> > mirrored vdevs? How many vdevs? Was compression enabled? What ZFS
> > release was this? Was this on Linux,Free/Open/Net BSD, Solaris, or
> > something else?
>
> I'm not sure what you mean by "memory optimization" so the answer is
> probably "no".

I mean the full gamut:

Did you use an l2arc? Did you use a dedicated ZIL? What was arc_max set to?
How much RAM/GB was installed on the machine? How did you set up PG? (PG
defaults are historically horrible for higher-RAM machines)

> FWIW I don't have much experience with ZFS in production, all I have is
> data from benchmarks I've recently done exactly with the goal to educate
> myself on the differences of current filesystems.
>
> The tests were done on Linux, with kernel 4.0.4 / zfs 0.6.4. So fairly
> recent versions, IMHO.
>
> My goal was to test the file systems under the same conditions and used
> a single device (Intel S3700 SSD). I'm aware that this is not a perfect
> test and ZFS offers interesting options (e.g. moving ZIL to a separate
> device). I plan to benchmark some additional configurations with more
> devices and such.

Also, did you try with/without compression? My information so far is that
compression significantly improves overall performance.

> > A 2x performance difference is almost inconsequential in my
> > experience, where growth is exponential. 2x performance change
> > generally means 1 to 2 years of advancement or deferment against the
> > progression of hardware; our current, relatively beefy DB servers
> > are already older than that, and have an anticipated life cycle of at
> > leastanother couple years.
>
> I'm not sure I understand what you suggest here. What I'm saying is that
> when I do a stress test on the same hardware, I do get ~2x the
> throughput with EXT4/XFS, compared to ZFS.

What I'm saying is only what it says on its face: A 50% performance difference
is rarely enough to make or break a production system; performance/capacity
reserves of 95% or more are fairly typical, which means the difference between
5% utilization and 10%. Even if latency rose by 50%, that's typically the
difference between 20ms and 30ms, not enough that, over the 'net for a
SOAP/REST call, that anybody'd notice even if it's enough to make you want to
optimize things a bit.

> > // Our situation // Lots of RAM for the workload: 128 GB of ECC RAM
> > with an on-disk DB size of ~ 150 GB. Pretty much, everything runs
> > straight out of RAM cache, with only writes hitting disk. Smart
> > reports 4/96 read/write ratio.
>
> So your active set fits into RAM? I'd guess all your writes are then WAL
> + checkpoints, which probably makes them rather sequential.
>
> If that's the case, CoW filesystems may perform quite well - I was
> mostly referring to workloads with a lot of random writes to he device.

That's *MY* hope, anyway! :)

> > Query load: Constant, heavy writes and heavy use of temp tables in
> > order to assemble very complex queries. Pretty much the "worst case"
> > mix of reads and writes, average daily peak of about 200-250
> >
>  > queries/second.
>
> I'm not sure how much random I/O that actually translates to. According
> to the numbers I've posted to this thread few hours ago, a tuned ZFS on
> a single SSD device handles ~2.5k tps (with dataset ~2x the RAM). But
> those are OLTP queries - your queries may write much more data. OTOH it
> really does not matter that much if your active set fits into RAM,
> because then it's mostly about writing to ZIL.

I personally don't yet know how much sense an SSD-backed ZIL makes when the
storage media is also SSD-based.

> > 16 Core XEON servers, 32 HT "cores".
> >
> > SAS 3 Gbps
> >
> > CentOS 6 is our O/S of choice.
> >
> > Currently, we're running Intel 710 SSDs in a software RAID1 without
> > trim enabled and generally happy with the reliability and performance
> > we see. We're planning to upgrade storage soon (since we're over 50%
> > utilization) and in the process, bring the magic goodness of
> > snapshots/clones from ZFS.
>
> I presume by "software RAID1" you mean "mirrored vdev zpool", correct?

I mean "software RAID 1" with Linux/mdadm. We haven't put ZFS into production
use on any of our DB servers, yet.

Thanks for your input.

Ben


pgsql-general by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Postgresql 9.4 and ZFS?
Next
From: Benjamin Smith
Date:
Subject: Re: Postgresql 9.4 and ZFS?