Re: ZFS prefetch considered evil? - Mailing list pgsql-general

From Yaroslav Tykhiy
Subject Re: ZFS prefetch considered evil?
Date
Msg-id AA91F652-D0B0-48A9-9191-669D3ABE6778@barnet.com.au
Whole thread Raw
In response to Re: ZFS prefetch considered evil?  (Alban Hertroys <dalroi@solfertje.student.utwente.nl>)
Responses Re: ZFS prefetch considered evil?
Re: ZFS prefetch considered evil?
List pgsql-general
On 08/07/2009, at 8:39 PM, Alban Hertroys wrote:

> On Jul 8, 2009, at 2:50 AM, Yaroslav Tykhiy wrote:
>
>> Hi All,
>>
>> I have a mid-size database (~300G) used as an email store and
>> running on a FreeBSD + ZFS combo.  Its PG_DATA is on ZFS whilst
>> xlog goes to a different FFS disk.  ZFS prefetch was enabled by
>> default and disk time on PG_DATA was near 100% all the time with
>> transfer rates heavily biased to read: ~50-100M/s read vs ~2-5M/s
>> write.  A former researcher, I was going to set up disk performance
>> monitoring to collect some history and see if disabling prefetch
>> would have any effect, but today I had to find out the difference
>> the hard way.  Sorry, but that's why the numbers I can provide are
>> quite approximate.
>>
>> Due to a peak in user activity the server just melted down, with
>> mail data queries taking minutes to execute.  As the last resort, I
>> rebooted the server with ZFS prefetch disabled -- it couldn't be
>> disabled at run time in FreeBSD.  Now IMAP feels much more
>> responsive; transfer rates on PG_DATA are mostly <10M/s read and
>> 1-2M/s write; and disk time stays way below 100% unless a bunch of
>> email is being inserted.
>>
>> My conclusion is that although ZFS prefetch is supposed to be
>> adaptive and handle random access more or less OK, in reality there
>> is plenty of room for improvement, so to speak, and for now
>> Postgresql performance can benefit from its staying just disabled.
>> The same may apply to other database systems as well.
>
>
> Are you sure you weren't hitting swap?

A sceptic myself, I genuinely understand your doubt.  But this time I
was sure because I paid attention to the name of the device involved.
Moreover, a thrashing system wouldn't have had such a disparity
between disk read and write rates.

> IIRC prefetch tries to keep data (disk blocks?) in memory that it
> fetched recently.

What you described is just a disk cache.  And a trivial implementation
of prefetch would work as follows:  An application or other file/disk
consumer asks the provider (driver, kernel, whatever) to read, say, 2
disk blocks worth of data.  The provider thinks, "I know you are short-
sighted; I bet you are going to ask for more contiguous blocks very
soon," so it schedules a disk read for many more contiguous blocks
than requested and caches them in RAM.  For bulk data applications
such as file serving this trick works as a charm.  But other
applications do truly random access and they never come back after the
prefetched blocks; in this case both disk bandwidth and cache space
are wasted.  An advanced implementation can try to distinguish
sequential and random access patterns, but in reality it appears to be
a challenging task.

> ZFS uses quite a bit of memory, so if you distributed all your
> memory to be used by just postgres and disk cache then you didn't
> leave enough space for the prefetch data and _something_ will be
> moved to swap.

I hope you know that FreeBSD is exceptionally good at distributing
available memory between its consumers.  That said, useless prefetch
indeed puts extra pressure on disk cache and results in unnecessary
cache evictions, thus making things even worse.  It is true that ZFS
is memory hungry and so rather sensitive to non-optimal memory use
patterns.  Useless prefetch wastes memory that could be used to speed
up other ZFS operations.

> If you're running FreeBSD i386 then ZFS requires some careful tuning
> due to the limits a 32-bit OS puts on memory. I recall ZFS not being
> very stable on i386 a while ago for those reasons, which has by now
> been fixed as far as possible, but it's not ideal (and it likely
> never will be).

I use FreeBSD/amd64 and I'm generally happy with ZFS on that platform.

> You'll probably want to ask about this on the FreeBSD mailing lists
> as well, they'll know much better than I do ;)

Are you a local FreeBSD expert? ;-)  Jokes apart, I don't think this
topic has to do with FreeBSD as such; it is mostly about making the
advanced technologies of Postgresql and ZFS go well together.  Even
ZFS developers admit that in database related applications exceptions
from general ZFS practices and rules may be called for.

When I set up my next ZFS based Postgresql server, I think I'll play
with the recordsize property of ZFS and see if setting it to PAGESIZE
makes any difference.

Thanks,

Yar

pgsql-general by date:

Previous
From: Adrian Klaver
Date:
Subject: Re: Password?
Next
From: Greg Smith
Date:
Subject: Re: PostgreSQL and Poker