Thread: PSA: XFS and Linux Cache Poisoning

PSA: XFS and Linux Cache Poisoning

From
Shaun Thomas
Date:
Hey everyone,

We recently got bit by this, and I wanted to make sure it was known to
the general community.

In new(er) Linux kernels, including late versions of the 2.6 tree, XFS
has introduced dynamic speculative preallocation. What does this do? It
was added to prevent filesystem fragmentation by preallocating a large
chunk of memory to files so extensions to those files can go on the same
allocation. The "dynamic" part just means it adjusts the size of this
preallocation based on internal heuristics.

Unfortunately, they also changed the logic in how this extra space is
tracked. At least in previous kernels, this space would eventually be
deallocated. Now, it survives as long as there are any in-memory
references to a file, such as in a busy PG database. The filesystem
itself sees this space as "used" and will be reported as such with tools
such as df or du.

How do you check if this is affecting you?

du -sm --apparent-size /your/pg/dir; du -sm /your/pg/dir

If you're using XFS, and there is a large difference in these numbers,
you've been bitten by the speculative preallocation system.

But where does it go while allocated? Why, to your OS system cache, of
course. Systems with several GB of RAM may experience extreme phantom
database "bloat", because of the dynamic aspect of the preallocation
system, So there are actually two problems:

1. Data files are reported as larger than their actual size and have
extra space around "just in case". Since PG has a maximum file size of
1GB, this is basically pointless.
2. Blocks that could be used for inode caching to improve query
performance are reserved instead for caching empty segments for XFS.

The first can theoretically exhaust the free space on a file system. We
were seeing 45GB(!) of bloat on one of our databases caused directly by
this. The second, due to the new and improved PG planner, can result in
terrible query performance and high system load since the OS cache does
not match assumptions.

So how is this fixed? Luckily, the dynamic allocator can be disabled by
choosing an allocation size. Add "allocsize" to your mount options. We
used a size of 1m (for 1 megabyte) to retain some of the defragmentation
benefits, while still blocking the dynamic allocator. The minimum size
is 64k, so some experimentation is probably warranted.

This mount option *is not compatible* with the "remount" mount option,
so you'll need to completely shut everything down and unmount the
filesystem to apply.

We spent days trying to track down the reason our systems were reporting
a load of 20-30 after a recent OS upgrade. I figured it was only fair to
share this to save others the same effort.

Good luck!

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: PSA: XFS and Linux Cache Poisoning

From
Ben Chobot
Date:
On Nov 12, 2012, at 7:37 AM, Shaun Thomas wrote:

> Hey everyone,
>
> We recently got bit by this, and I wanted to make sure it was known to the general community.
>
> In new(er) Linux kernels, including late versions of the 2.6 tree, XFS has introduced dynamic speculative
preallocation.What does this do? It was added to prevent filesystem fragmentation by preallocating a large chunk of
memoryto files so extensions to those files can go on the same allocation. The "dynamic" part just means it adjusts the
sizeof this preallocation based on internal heuristics. 
>
> Unfortunately, they also changed the logic in how this extra space is tracked. At least in previous kernels, this
spacewould eventually be deallocated. Now, it survives as long as there are any in-memory references to a file, such as
ina busy PG database. The filesystem itself sees this space as "used" and will be reported as such with tools such as
dfor du. 
>
> How do you check if this is affecting you?
>
> du -sm --apparent-size /your/pg/dir; du -sm /your/pg/dir
>
> If you're using XFS, and there is a large difference in these numbers, you've been bitten by the speculative
preallocationsystem. 
>
> But where does it go while allocated? Why, to your OS system cache, of course. Systems with several GB of RAM may
experienceextreme phantom database "bloat", because of the dynamic aspect of the preallocation system, So there are
actuallytwo problems: 
>
> 1. Data files are reported as larger than their actual size and have extra space around "just in case". Since PG has
amaximum file size of 1GB, this is basically pointless. 
> 2. Blocks that could be used for inode caching to improve query performance are reserved instead for caching empty
segmentsfor XFS. 
>
> The first can theoretically exhaust the free space on a file system. We were seeing 45GB(!) of bloat on one of our
databasescaused directly by this. The second, due to the new and improved PG planner, can result in terrible query
performanceand high system load since the OS cache does not match assumptions. 
>
> So how is this fixed? Luckily, the dynamic allocator can be disabled by choosing an allocation size. Add "allocsize"
toyour mount options. We used a size of 1m (for 1 megabyte) to retain some of the defragmentation benefits, while still
blockingthe dynamic allocator. The minimum size is 64k, so some experimentation is probably warranted. 
>
> This mount option *is not compatible* with the "remount" mount option, so you'll need to completely shut everything
downand unmount the filesystem to apply. 
>
> We spent days trying to track down the reason our systems were reporting a load of 20-30 after a recent OS upgrade. I
figuredit was only fair to share this to save others the same effort. 
>
> Good luck!


Oh hey, I've been wondering for a while why our master dbs seem to be using so much more space than their slaves. This
appearsto be the reason. Thanks for the work in tracking it down!