RE: Protect syscache from bloating with negative cache entries - Mailing list pgsql-hackers
From | Tsunakawa, Takayuki |
---|---|
Subject | RE: Protect syscache from bloating with negative cache entries |
Date | |
Msg-id | 0A3221C70F24FB45833433255569204D1FB97565@G01JPEXMBYT05 Whole thread Raw |
In response to | Re: Protect syscache from bloating with negative cache entries (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
List | pgsql-hackers |
From: Tomas Vondra [mailto:tomas.vondra@2ndquadrant.com] > > I meant that the time-based eviction is not very good, because it > > could cause less frequently entries to vanish even when memory is not > > short. Time-based eviction reminds me of Memcached, Redis, DNS, etc. > > that evicts long-lived entries to avoid stale data, not to free space > > for other entries. I think size-based eviction is sufficient like > > shared_buffers, OS page cache, CPU cache, disk cache, etc. > > > > Right. But the logic behind time-based approach is that evicting such > entries should not cause any issues exactly because they are accessed > infrequently. It might incur some latency when we need them for the > first time after the eviction, but IMHO that's acceptable (although I > see Andres did not like that). Yes, that's what I expressed. That is, I'm probably with Andres. > FWIW we might even evict entries after some time passes since inserting > them into the cache - that's what memcached et al do, IIRC. The logic is > that frequently accessed entries will get immediately loaded back (thus > keeping cache hit ratio high). But there are reasons why the other dbs > do that - like not having any cache invalidation (unlike us). These are what Memcached and Redis do: 1. Evict entries that have lived longer than their TTLs. This is independent of the cache size. This is to avoid keeping stale data in the cache when the underlying data (such asin the database) is modified. This doesn't apply to PostgreSQL. 2. Evict most least recently accessed entries. This is to make room for new entries when the cache is full. This is similar or the same as PostgreSQL and other DBMSs dofor their database cache. Oracle and MySQL also do this for their dictionary caches, where "dictionary cache" correspondsto syscache in PostgreSQL. Here's my sketch for this feature. Although it may not meet all (contradictory) requirements as you said, it's simple andfamiliar for those who have used PostgreSQL and other DBMSs. What do you think? The points are simplicity, familiarity,and memory consumption control for the DBA. * Add a GUC parameter syscache_size which imposes the upper limit on the total size of all catcaches, not on individual catcache. The naming follows effective_cache_size. It can be syscache_mem to follow work_mem and maintenance_work_mem. The default value is 0, which doesn't limit the cache size as now. * A new member variable in CatCacheHeader tracks the total size of all cached entries. * A single new LRU list in CatCacheHeader links all cache tuples in LRU order. Each cache access, SearchSCatCacheInternal(),puts the found entry on its front. * Insertion of a new catcache entry adds the entry size to the total cache size. If the total size exceeds the limit definedby syscache_size, most infrequently accessed entries are removed until the total cache size gets below the limit. This eviction results in slight overhead when the cache is full, but the response time is steady. On the other hand, withthe proposed approach, users will wonder about mysterious long response time due to bulk entry deletions. > > In that case, the user can just enlarge the catcache. > > > > IMHO the main issues with this are > > (a) It's not quite clear how to determine the appropriate limit. I can > probably apply a bit of perf+gdb, but I doubt that's what very nice. Like Oracle and MySQL, the user should be able to see the cache hit ratio with a statistics view. > (b) It's not adaptive, so systems that grow over time (e.g. by adding > schemas and other objects) will keep hitting the limit over and over. The user needs to restart the database instance to enlarge the syscache. That's also true for shared buffers: to accommodategrowing amoun of data, the user needs to increase shared_buffers and restart the server. But the current syscache is local memory, so the server may not need restart. > > Just like other caches, we can present a view that shows the hits, misses, > and the hit ratio of the entire catcaches. If the hit ratio is low, the > user can enlarge the catcache size. That's what Oracle and MySQL do as > I referred to in this thread. The tuning parameter is the size. That's > all. > > How will that work, considering the caches are in private backend > memory? And each backend may have quite different characteristics, even > if they are connected to the same database? Assuming that pg_stat_syscache (pid, cache_name, hits, misses) gives the statistics, the statistics data can be stored onthe shared memory, because the number of backends and the number of catcaches are fixed. > > I guess the author meant that the cache is "relatively small" compared > to the underlying storage: CPU cache is smaller than DRAM, DRAM is smaller > than SSD/HDD. In our case, we have to pay more attention to limit the > catcache memory consumption, especially because they are duplicated in > multiple backend processes. > > > > I don't think so. IMHO the focus there in on "cost-effective", i.e. > caches are generally more expensive than the storage, so to make them > worth it you need to make them much smaller than the main storage. I think we're saying the same thing. Perhaps my English is not good enough. > But I don't see how this applies to the problem at hand, because the > system is already split into storage + cache (represented by RAM). The > challenge is how to use RAM to cache various pieces of data to get the > best behavior. The problem is, we don't have a unified cache, but > multiple smaller ones (shared buffers, page cache, syscache) competing > for the same resource. You're right. On the other hand, we can consider syscache, shared buffers, and page cache as different tiers of storage,even though they are all on DRAM. syscache caches some data from shared buffers for efficient access. If we usemuch memory for syscache, there's less memory for caching user data in shared buffers and page cache. That's a normaltradeoff of caches. > Slab can do that, but it requires certain allocation pattern, and I very > much doubt syscache has it. It'll be trivial to end with one active > entry on each block (which means slab can't release it). I expect so, too, although slab context makes efforts to mitigate that possibility like this: * This also allows various optimizations - for example when searching for * free chunk, the allocator reuses space from the fullest blocks first, in * the hope that some of the less full blocks will get completely empty (and * returned back to the OS). > BTW doesn't syscache store the full on-disk tuple? That doesn't seem > like a fixed-length entry, which is a requirement for slab. No? Some system catalogs are fixed in size like pg_am and pg_amop. But I guess the number of such catalogs is small. Dominantcatalogs like pg_class and pg_attribute are variable size. So using different memory contexts for limited catalogsmight not show any visible performance improvement nor memory reduction. Regards Takayuki Tsunakawa
pgsql-hackers by date: