Thread: postgres files in use not staying in linux file cache
Hi, I'm trying to investigate a performance problem.
We have a large database (over 1TB) running on a server with 160GB of RAM and 32 cores (Xeon E5-2650). The database files are on a NetApp mount.Some research about the Linux page cache suggests that any file can be essentially forced into cache by cat-ting to /dev/null. So I cat a file of one of these database objects, and I can see the cat process reading at about 100MB/s, so it takes 10 sec for a 1GB file. Then I check fincore again -- the file is still not cached. cat-ting the file again still takes 10 sec.
I cat several times in a row, and the file refuses to cache. Sometimes I see a bit of the file appear in the cache, but get removed a few seconds later. I also tried the fadvise program in the ftools, which didn't help. The I/O on this machine is not all that high (a few MB/s for various postgres processes), and there are a few GB free (shown in top). Most of the RAM (150GB+) is used by the page cache. No swap is in use.
I cat several times in a row, and the file refuses to cache. Sometimes I see a bit of the file appear in the cache, but get removed a few seconds later. I also tried the fadvise program in the ftools, which didn't help. The I/O on this machine is not all that high (a few MB/s for various postgres processes), and there are a few GB free (shown in top). Most of the RAM (150GB+) is used by the page cache. No swap is in use.
Eventually the query finishes (or I cancel it). Then I find that running cat on the file does leave it in cache! So, is having multiple readers confusing Linux, or is Postgres doing any madvise on the file (I'd expect no to both).
So here's where I'm stuck. How can reading a file not leave it in the Linux cache? I'd expect it to enter the inactive list (which is about 80GB), so I'd expect another 80GB would need to be read before it would be its turn to be evicted.... which should take a long time if my maximum read speed is 100MB/s.
Since I don't understand the behaviour of simply running cat and the file not being cached, I wonder if this is an issue with Linux, not Postgres. But the issue seems to happen with files in use by Postgres, so maybe there's an interaction, so I thought I'd start here.
Any ideas how I can debug this further? Thanks!
Brian
Brian
On 06/05/2014 04:32 PM, Brio wrote: > So here's where I'm stuck. How can reading a file not leave it in the > Linux cache? I'd expect it to enter the inactive list (which is about > 80GB), so I'd expect another 80GB would need to be read before it would > be its turn to be evicted.... which should take a long time if my > maximum read speed is 100MB/s. So here's the thing. The Linux page reclamation code is *extremely broken* in everything before 3.11. Take a look at this, then realize that this is *only one patch* from several that target the memory manager weightings: http://linux-kernel.2935.n7.nabble.com/patch-v2-0-3-mm-improve-page-aging-fairness-between-zones-nodes-td696105.html This is especially true of the 3.2 kernel you're using. It's extremely aggressive about ageing pages out of memory when there's high memory pressure from frequent disk reads. Chances of promotion into the active set is dismal, so you end up with a constant churn between inactive and disk. Worse, if you kick it hard enough by having too many PostgreSQL backends using memory, it'll actively purge the active set while still failing to promote the inactive set. The dev that linked me to this patch said he tested it against 3.10, meaning it probably went into 3.11 or 3.12. So I personally wouldn't trust anything before 3.13. :p Since you're using Ubuntu 12.04, I strongly suggest upgrading your core to 12.04.4 and apply the linux-generic-lts-saucy pseudo-package to at least get onto the 3.11 instead. The 3.2 kernel is pants-on-head retarded; we've had a lot more luck with 3.8 and above. Cheers! -- Shaun Thomas OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
So here's the thing. The Linux page reclamation code is *extremelybroken* in everything before 3.11. Take a look at this, then realizethat this is *only one patch* from several that target the memorymanager weightings:<snipped>
Since you're using Ubuntu 12.04, I strongly suggest upgrading your coreto 12.04.4 and apply the linux-generic-lts-saucy pseudo-package to atleast get onto the 3.11 instead. The 3.2 kernel is pants-on-headretarded; we've had a lot more luck with 3.8 and above.
Without trying to hijack this thread, is there any documentation around recommended or known-good kernels for Postgres/DB type workloads?
I can see there has been a lot of discussion around recent kernel developments being detrimental (or potentially detrimental) - http://www.postgresql.org/message-id/20140115141408.GJ4963@suse.de - , but there doesn’t seem to be a definitive resource to know which kernels to avoid.
I ask because I’ve just read your statement above about 3.2 being pants-on-head, and having had more luck with 3.8 and above – despite most installations being on much older (2.6.19) kernels (as per the thread).
I’d be interested to see how much benefit we’d get from moving off 3.2, but I’d like to be aware of the risks of more recent kernel version’s also.
Cheers,
TIm
On 06/13/2014 02:19 AM, Tim Kane wrote: > I ask because I’ve just read your statement above about 3.2 being > pants-on-head, and having had more luck with 3.8 and above – despite > most installations being on much older (2.6.19) kernels (as per the > thread). Well, the issue is that the 3.2 kernel was a huge departure from the 2.6 tree that most people are still using. Thanks to RHEL, CentOS and their ilk, the 2.6 tree has had a much longer lifetime than it probably should have. As a result, the newer kernels haven't had sufficient real-world server testing. With 3.2 being the first of those, it was a bit wonky, to be honest. The new CPU scheduler didn't have enough knobs, and the knobs that *were* there, were set more appropriately for desktop use. The memory manager was a train wreck and has been patched numerous times with rather sweeping changes since. For a while, there was even a bug at how system load was calculated that caused it to be off by an order of magnitude or more based on process switching activity. The overall situation has improved significantly, as has the development momentum. It's extremely difficult to say which kernel versions have more risk than others, since no kernel seems to be around for more than a month or two before the next one comes out. My opinion has been to get on the latest stable kernel for the distribution, and ignore everything else. For Ubuntu 12.04.4 LTS, that's 3.11. Our systems are definitely much happier since the upgrade, but the plural of anecdote is not data. :) -- Shaun Thomas OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
On Thu, Jun 5, 2014 at 2:32 PM, Brio <brianoraas@gmail.com> wrote: > Hi, I'm trying to investigate a performance problem. > > We have a large database (over 1TB) running on a server with 160GB of RAM > and 32 cores (Xeon E5-2650). The database files are on a NetApp mount. > ... > > I have noticed a few times that an index scan may be taking a long time, and > the query's backend process is reading from disk at about 2 MB/s, spending > 99% of its time waiting for I/O (using iotop). This makes sense, if scanning > an index that is not in cache. Does the index scan dirty most of the index blocks it touches? (When an index scan follows an index entry to a heap page and finds that the tuple is no longer needed, when it gets back to the index it might kill that entry, so that the next index scan doesn't need to do the futile heap look up. This dirties the index block, even for a "read only" scan. However, It would be unusual for a typical index scan to do this for most of the blocks it touches. It could happen if the index scan is to support a giant rarely run reporting query, for example, or if your vacuuming schedule is not tuned correctly.) The reason I ask that is that I have previously seen the dirty blocks of NetApp-served files get dropped from the linux page cache as soon as they are written back to the NetApp. I had written a little Perl script to cut postgresql out of the loop entirely to demonstrate this effect, but I no longer have access to it. Cheers, Jeff
(Sorry, our last exchange forgot to cc the pgsql-performance list.)
Yes, I did see the original problem only when postgres was also accessing the file. But the issue is intermittent, so I can't reproduce on demand, so I'm only reporting what I saw a small number of times, and not necessarily (or likely) the whole story.This machine has 64 GB of RAM. There was about 20 GB free, and the rest was mostly file cache, mostly our large 1TB database. I ran a script that did various reading and writing to the database, but mostly updated many rows over and over again to new updated values. As this script ran, the cached memory slowly dropped, and free memory increased. I now have 43 GB free! I'd expect practically any activity to leave files in the cache, and no significant evictions to occur until memory runs low. What actually happens is the cache increases gradually, and then drops down in chunks. I would think that the only file activity that would evict from cache would be deleting files, which would only happen when dropping tables (not happening in my test script), and also WAL file cycling, which should stay a constant amount of memory.
But, if blocks that are written are evicted from the cache, that would explain it, so I'd like to test that. As a very basic test, I tried:
cd /path-to-nfs-mount
echo "foo" > foo.txtOn Mon, Jun 23, 2014 at 3:56 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Wed, Jun 18, 2014 at 11:18 PM, Brio <brianoraas@gmail.com> wrote:I thought that you saw the same problem with cat only when it was
> Hi Jeff,
>
> That is interesting -- I hadn't thought about how a read-only index scan
> might actually write the index.
>
> But, to avoid effects like that, that's why I dropped down to simply using
> "cat" on the file, and I saw the same problem there, with no writing back.
running concurrently with the index scan, and when the index scan
stopped the problem in cat went away.I don't know why it would do that, it never made much sense to me.
> So the problem really seemed to be in Linux, not Postgres.
>
> But why would dirty blocks of NetApp-served files get dropped from the Linux
> page cache as soon as they are written back to the NetApp? Is it a bug in
> the NetApp driver? Isn't the driver just NFS?
But that is what the experimental evidence indicated.
What I was using was NetApp on the back-end and just the plain linux
NFS driver on the client end, and I assume the problem was on the
client end. (Maybe you can get a custom client driver from Net-App
designed to work specifically with their server, but if so, I didn't
do that. For that matter, maybe just the default linux NFS driver has
improved.)Yes, it was a serious issue for one intended use. But it is was
> That sounds like a serious
> issue. Is there any online documentation of bugs like that with NetApp?
partially mitigated by the fact that I would probably never run an
important production database over NFS anyway, out of corruption
concerns. I was hoping to use it just for testing purposes, but this
limit made it rather useless for that as well. I don't think it would
be a NetApp specific issue and didn't approach it from that angle,
just that NetApp didn't save from the issue.
Cheers,
Jeff