Thread: postgres files in use not staying in linux file cache

postgres files in use not staying in linux file cache

From
Brio
Date:
Hi, I'm trying to investigate a performance problem.

We have a large database (over 1TB) running on a server with 160GB of RAM and 32 cores (Xeon E5-2650). The database files are on a NetApp mount.

The software is Postgres 9.3.1 on Ubuntu 12.04, Linux 3.2.0-38-generic.

Generally, when a query is slow, it's because it's waiting for I/O. Since only about 10% of the database can be in RAM at any time, this is expected. I'm trying to analyze the working set in the cache to see that relevant tables and indexes are cached. I can map our database's objects to files using pg_class.relfilenode to get the name(s), and then I use fincore from linux-ftools to inspect the Linux cache.

I have noticed a few times that an index scan may be taking a long time, and the query's backend process is reading from disk at about 2 MB/s, spending 99% of its time waiting for I/O (using iotop). This makes sense, if scanning an index that is not in cache.

As this is happening, I expect bits of the index and table to be pulled into cache, so the index scan may speed up as it goes, or will at least finish with some portion of the index in cache, so the scan won't be so slow next time. I use fincore to see how much of the index and table are in cache (taking into account that large objects >1GB will be split into multiple files). To my surprise, the files are cached 0%!

Some research about the Linux page cache suggests that any file can be essentially forced into cache by cat-ting to /dev/null. So I cat a file of one of these database objects, and I can see the cat process reading at about 100MB/s, so it takes 10 sec for a 1GB file. Then I check fincore again -- the file is still not cached. cat-ting the file again still takes 10 sec.

I cat several times in a row, and the file refuses to cache. Sometimes I see a bit of the file appear in the cache, but get removed a few seconds later. I also tried the fadvise program in the ftools, which didn't help. The I/O on this machine is not all that high (a few MB/s for various postgres processes), and there are a few GB free (shown in top). Most of the RAM (150GB+) is used by the page cache. No swap is in use.

Eventually the query finishes (or I cancel it). Then I find that running cat on the file does leave it in cache! So, is having multiple readers confusing Linux, or is Postgres doing any madvise on the file (I'd expect no to both).

So here's where I'm stuck. How can reading a file not leave it in the Linux cache? I'd expect it to enter the inactive list (which is about 80GB), so I'd expect another 80GB would need to be read before it would be its turn to be evicted.... which should take a long time if my maximum read speed is 100MB/s.

Since I don't understand the behaviour of simply running cat and the file not being cached, I wonder if this is an issue with Linux, not Postgres. But the issue seems to happen with files in use by Postgres, so maybe there's an interaction, so I thought I'd start here.

Any ideas how I can debug this further? Thanks!

Brian

Re: postgres files in use not staying in linux file cache

From
Shaun Thomas
Date:
On 06/05/2014 04:32 PM, Brio wrote:

> So here's where I'm stuck. How can reading a file not leave it in the
> Linux cache? I'd expect it to enter the inactive list (which is about
> 80GB), so I'd expect another 80GB would need to be read before it would
> be its turn to be evicted.... which should take a long time if my
> maximum read speed is 100MB/s.

So here's the thing. The Linux page reclamation code is *extremely
broken* in everything before 3.11. Take a look at this, then realize
that this is *only one patch* from several that target the memory
manager weightings:

http://linux-kernel.2935.n7.nabble.com/patch-v2-0-3-mm-improve-page-aging-fairness-between-zones-nodes-td696105.html

This is especially true of the 3.2 kernel you're using. It's extremely
aggressive about ageing pages out of memory when there's high memory
pressure from frequent disk reads. Chances of promotion into the active
set is dismal, so you end up with a constant churn between inactive and
disk. Worse, if you kick it hard enough by having too many PostgreSQL
backends using memory, it'll actively purge the active set while still
failing to promote the inactive set.

The dev that linked me to this patch said he tested it against 3.10,
meaning it probably went into 3.11 or 3.12. So I personally wouldn't
trust anything before 3.13. :p

Since you're using Ubuntu 12.04, I strongly suggest upgrading your core
to 12.04.4 and apply the linux-generic-lts-saucy pseudo-package to at
least get onto the 3.11 instead. The 3.2 kernel is pants-on-head
retarded; we've had a lot more luck with 3.8 and above.

Cheers!

--
Shaun Thomas
OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: postgres files in use not staying in linux file cache

From
Tim Kane
Date:


From: Shaun Thomas <sthomas@optionshouse.com>
Date: Tuesday, 10 June 2014 22:07
So here's the thing. The Linux page reclamation code is *extremely
broken* in everything before 3.11. Take a look at this, then realize
that this is *only one patch* from several that target the memory
manager weightings:

<snipped>

Since you're using Ubuntu 12.04, I strongly suggest upgrading your core
to 12.04.4 and apply the linux-generic-lts-saucy pseudo-package to at
least get onto the 3.11 instead. The 3.2 kernel is pants-on-head
retarded; we've had a lot more luck with 3.8 and above.


Without trying to hijack this thread, is there any documentation around recommended or known-good kernels for Postgres/DB type workloads?
I can see there has been a lot of discussion around recent kernel developments being detrimental (or potentially detrimental) -  http://www.postgresql.org/message-id/20140115141408.GJ4963@suse.de - , but there doesn’t seem to be a definitive resource to know which kernels to avoid.

I ask because I’ve just read your statement above about 3.2 being pants-on-head, and having had more luck with 3.8 and above – despite most installations being on much older (2.6.19) kernels (as per the thread).
I’d be interested to see how much benefit we’d get from moving off 3.2, but I’d like to be aware of the risks of more recent kernel version’s also.

Cheers,

TIm



Re: postgres files in use not staying in linux file cache

From
Shaun Thomas
Date:
On 06/13/2014 02:19 AM, Tim Kane wrote:

> I ask because I’ve just read your statement above about 3.2 being
> pants-on-head, and having had more luck with 3.8 and above – despite
>  most installations being on much older (2.6.19) kernels (as per the
> thread).

Well, the issue is that the 3.2 kernel was a huge departure from the 2.6
tree that most people are still using. Thanks to RHEL, CentOS and their
ilk, the 2.6 tree has had a much longer lifetime than it probably should
have. As a result, the newer kernels haven't had sufficient real-world
server testing.

With 3.2 being the first of those, it was a bit wonky, to be honest. The
new CPU scheduler didn't have enough knobs, and the knobs that *were*
there, were set more appropriately for desktop use. The memory manager
was a train wreck and has been patched numerous times with rather
sweeping changes since. For a while, there was even a bug at how system
load was calculated that caused it to be off by an order of magnitude or
more based on process switching activity.

The overall situation has improved significantly, as has the development
momentum. It's extremely difficult to say which kernel versions have
more risk than others, since no kernel seems to be around for more than
a month or two before the next one comes out. My opinion has been to get
on the latest stable kernel for the distribution, and ignore everything
else.

For Ubuntu 12.04.4 LTS, that's 3.11.

Our systems are definitely much happier since the upgrade, but the
plural of anecdote is not data. :)

--
Shaun Thomas
OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: postgres files in use not staying in linux file cache

From
Jeff Janes
Date:
On Thu, Jun 5, 2014 at 2:32 PM, Brio <brianoraas@gmail.com> wrote:
> Hi, I'm trying to investigate a performance problem.
>
> We have a large database (over 1TB) running on a server with 160GB of RAM
> and 32 cores (Xeon E5-2650). The database files are on a NetApp mount.
>
...
>
> I have noticed a few times that an index scan may be taking a long time, and
> the query's backend process is reading from disk at about 2 MB/s, spending
> 99% of its time waiting for I/O (using iotop). This makes sense, if scanning
> an index that is not in cache.

Does the index scan dirty most of the index blocks it touches?  (When
an index scan follows an index entry to a heap page and finds that the
tuple is no longer needed, when it gets back to the index it might
kill that entry, so that the next index scan doesn't need to do the
futile heap look up.  This dirties the index block, even for a "read
only" scan.  However, It would be unusual for a typical index scan to
do this for most of the blocks it touches.  It could happen if the
index scan is to support a giant rarely run reporting query, for
example, or if your vacuuming schedule is not tuned correctly.)

The reason I ask that is that I have previously seen the dirty blocks
of NetApp-served files get dropped from the linux page cache as soon
as they are written back to the NetApp.

I had written a little Perl script to cut postgresql out of the loop
entirely to demonstrate this effect, but I no longer have access to
it.

Cheers,

Jeff


Re: postgres files in use not staying in linux file cache

From
Brio
Date:
(Sorry, our last exchange forgot to cc the pgsql-performance list.)

Yes, I did see the original problem only when postgres was also accessing the file. But the issue is intermittent, so I can't reproduce on demand, so I'm only reporting what I saw a small number of times, and not necessarily (or likely) the whole story.

I've upgraded the kernel on my test machine, and I haven't seen the original problem. But I am seeing what looks like it might be the problem you describe, Jeff. Here's what I saw:

This machine has 64 GB of RAM. There was about 20 GB free, and the rest was mostly file cache, mostly our large 1TB database. I ran a script that did various reading and writing to the database, but mostly updated many rows over and over again to new updated values. As this script ran, the cached memory slowly dropped, and free memory increased. I now have 43 GB free! I'd expect practically any activity to leave files in the cache, and no significant evictions to occur until memory runs low. What actually happens is the cache increases gradually, and then drops down in chunks. I would think that the only file activity that would evict from cache would be deleting files, which would only happen when dropping tables (not happening in my test script), and also WAL file cycling, which should stay a constant amount of memory.

But, if blocks that are written are evicted from the cache, that would explain it, so I'd like to test that. As a very basic test, I tried:
cd /path-to-nfs-mount
echo "foo" > foo.txt
sync   # This command forces a write? I haven't really used it before
linux-fincore foo
shows the file is cached 100%.

Although you don't have the Perl script you mentioned, could you give a basic description of what it does, so I could try to recreate it? I'm not familiar with Perl, but I've done plenty of C programming, so demonstrating this with the actual Linux APIs would be ideal.

Thanks Jeff!



On Mon, Jun 23, 2014 at 3:56 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Wed, Jun 18, 2014 at 11:18 PM, Brio <brianoraas@gmail.com> wrote:
> Hi Jeff,
>
> That is interesting -- I hadn't thought about how a read-only index scan
> might actually write the index.
>
> But, to avoid effects like that, that's why I dropped down to simply using
> "cat" on the file, and I saw the same problem there, with no writing back.

I thought that you saw the same problem with cat only when it was
running concurrently with the index scan, and when the index scan
stopped the problem in cat went away.

> So the problem really seemed to be in Linux, not Postgres.
>
> But why would dirty blocks of NetApp-served files get dropped from the Linux
> page cache as soon as they are written back to the NetApp? Is it a bug in
> the NetApp driver? Isn't the driver just NFS?

I don't know why it would do that, it never made much sense to me.
But that is what the experimental evidence indicated.

What I was using was NetApp on the back-end and just the plain linux
NFS driver on the client end, and I assume the problem was on the
client end.  (Maybe you can get a custom client driver from Net-App
designed to work specifically with their server, but if so, I didn't
do that.  For that matter, maybe just the default linux NFS driver has
improved.)

> That sounds like a serious
> issue. Is there any online documentation of bugs like that with NetApp?

Yes, it was a serious issue for one intended use.  But it is was
partially mitigated by the fact that I would probably never run an
important production database over NFS anyway, out of corruption
concerns.  I was hoping to use it just for testing purposes, but this
limit made it rather useless for that as well.  I don't think it would
be a NetApp specific issue and didn't approach it from that angle,
just that NetApp didn't save from the issue.

Cheers,

Jeff