BUG #4575: All page cache in shared_buffers pinned (duplicated by OS, always) - Mailing list pgsql-bugs
From | Scott Carey |
---|---|
Subject | BUG #4575: All page cache in shared_buffers pinned (duplicated by OS, always) |
Date | |
Msg-id | 200812110007.mBB07fHF042546@wwwmaster.postgresql.org Whole thread Raw |
Responses |
Re: BUG #4575: All page cache in shared_buffers pinned (duplicated by OS, always)
Re: BUG #4575: All page cache in shared_buffers pinned (duplicated by OS, always) |
List | pgsql-bugs |
The following bug has been logged online: Bug reference: 4575 Logged by: Scott Carey Email address: scott@richrelevance.com PostgreSQL version: 8.3.5, 8.3.4 Operating system: Linux (CentOS 5.2 2.6.18-92.1.10.el5) Description: All page cache in shared_buffers pinned (duplicated by OS, always) Details: I have determined that nearly every cached page within shared_buffers is being pinned in memory, preventing the OS from dropping any such pages from its page cache. Effectively, the memory used by shared_buffers for cached file pages is 2* the page size, these won't page out, they are pinned. ----------- To Reproduce: ---------- Stop Postgres. drop the OS page caches as follows: # sync;echo 3 > /proc/sys/vm/drop_caches Using 'free -m' or 'top' note the number of bytes that remain in the disk cache after forcing all to clear. This number is the baseline un-evictable page cache size. Configure postgres shared_buffers for ~30% of total RAM. The purpose of this is to make the effect obvious. Start Postgres. Run the same test as above. The value will be only slightly larger than before. Run 'top' and note the size of the "SHR" column for the postgres process with the largest value of "SHR". This is the shared memory used by postgres currently. Execute a query that fills the page cache. This is any insert (create table as select * from other_table works), or any select that does not cause a sequential scan. It does not appear that sequential scans ever end up putting pages in shared_buffers (nor does Vacuum, but I knew about vacuum's ring buffer before this). I used a query on a large table (5GB) with an index, known to do an index scan, and ran repeated variants on the where clause to hit most of it. Run top, and note the largest value of the "SHR" column on all postgres processes. Now execute the os cache eviction. Check the remaining cached memory. Note that it is now larger than the baseline by essentially the exact size of the postgres shared memory. Note, that you can wait for a checkpoint, sync, and be certain that there are no dirty pages from the OS point of view, and the other baselines rule out executable pages. Yet all the pages that correspond to those in shared_buffers seem to be pinned in memory. Running such tests over and over, with different data and different values for shared_buffers, it is consistent and easily reproducible on CentOS 5.2 linux as described above. --------- Context: --------- This is a rather large performance and scale issue. For some cases, a large shared_buffers helps because indexes and random-access data tends to be cached in this area well (seq scans don't kick things out of there). If there is memory pressure on the system, it can't kick out the pinned os pages, effectively making this cache cost twice the space. Additionally, for heavy write situations a large shared_buffers is required along with changes to the background writer and checkpoints to make it perform at peak capability (typically, it helps to be around 5 seconds * the MB/sec that the I/O is capable of -- 6GB for me). As far as my previous understanding of how postgres works, pages cached in shared_buffers should be able to differ from those the OS caches. So, if one heavily accessed index is always in shared_buffers, it will not be in the OS cache. When pages are written it will pass through the OS cache, but since no reads make it to the OS, these are evicted relatively quickly from the OS page cache and highly conserved in shared_buffers. This bug is preventing this behavior. There is one other behavior I have seen: kswapd CPU use is much higher when there are a large number of pinned pages in the system. kswapd will use significant CPU during disk reads (10% of a 3Ghz CPU for every 250MB/sec read) if there is ~8GB of pinned data. The more data pinned the higher the CPU used here. Without postgres running, pinning memory, the kswapd time doing the same activity, is significantly lower. This bug was first noticed due to how the linux kernel behaves when it wants to evict items from page cache but they are pinned. With shared_buffers at 25% of RAM, the system was unable to allocate more than 50% of the remaining 75% to processes. The result when approaching that limit was higher and higher system CPU% use, until all 8 cores spin at 100% when the remaining page cache is roughly equal in size to shared_buffers. It takes a low 'swapppiness' value to expose this before a swap storm occurs. Although I have not tried it, it would not surprise me if configuring shared_buffers to 55% of RAM with low swappiness, then filling shared_buffers with data from reads, may make the system fall down. Default swappiness would likely swap storm instead but may have similarly bad behavior, considering that close to half of the system memory should still be freely available. Given the various articles or blog posts out there where some have configured shared_buffers to well over 50% of memory (90% even on Solaris), this may be particular to recent versions of postgres, or linux.
pgsql-bugs by date: