Thread: [HACKERS] Protect syscache from bloating with negative cache entries
Hello, recently one of my customer stumbled over an immoderate catcache bloat. This is a known issue living on the Todo page in the PostgreSQL wiki. https://wiki.postgresql.org/wiki/Todo#Cache_Usage > Fix memory leak caused by negative catcache entries https://www.postgresql.org/message-id/51C0A1FF.2050404@vmware.com This patch addresses the two cases of syscache bloat by using invalidation callback mechanism. Overview of the patch The bloat is caused by negative cache entries in catcaches. They are crucial for performance but it is a problem that there's no way to remove them. They last for the backends' lifetime. The first patch provides a means to flush catcache negative entries, then defines a relcache invalidation callback to flush negative entries in syscaches for pg_statistic(STATRELATTINH) and pg_attributes(ATTNAME, ATTNUM). The second patch implements a syscache invalidation callback so that deletion of a schema causes a flush for pg_class (RELNAMENSP). Both of the aboves are not hard-coded and defined in cacheinfo using additional four members. Remaining problems Still, catcache can bloat by repeatedly accessing non-existent table with unique names in a permanently-living schema but it seems a bit too artificial (or malicious). Since such negative entries don't have a trigger to remove, caps are needed to prevent them from bloating syscaches, but the limits are hardly seem reasonably determinable. Defects or disadvantages This patch scans over whole the target catcache to find negative entries to remove and it might take a (comparably) long time on a catcache with so many entries. By the second patch, unrelated negative caches may be involved in flushing since they are keyd by hashvalue, not by the exact key values. The attached files are the following. 1. 0001-Cleanup-negative-cache-of-pg_statistic-when-dropping.patch Negative entry flushing by relcache invalidation using relcache invalidation callback. 2. 0002-Cleanup-negative-cache-of-pg_class-when-dropping-a-s.patch Negative entry flushing by catcache invalidation using catcache invalidation callback. 3. gen.pl a test script for STATRELATTINH bloating. 4. gen2.pl a test script for RELNAMENSP bloating. 3 and 4 are used as the following, ./gen.pl | psql postgres > /dev/null regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Dec 19, 2016 at 6:15 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, recently one of my customer stumbled over an immoderate > catcache bloat. This isn't only an issue for negative catcache entries. A long time ago, there was a limit on the size of the relcache, which was removed because if you have a workload where the working set of relations is just larger than the limit, performance is terrible. But the problem now is that backend memory usage can grow without bound, and that's also bad, especially on systems with hundreds of long-lived backends. In connection-pooling environments, the problem is worse, because every connection in the pool eventually caches references to everything of interest to any client. Your patches seem to me to have some merit, but I wonder if we should also consider having a time-based threshold of some kind. If, say, a backend hasn't accessed a catcache or relcache entry for many minutes, it becomes eligible to be flushed. We could implement this by having some process, like the background writer, SendProcSignal(PROCSIG_HOUSEKEEPING) to every process in the system every 10 minutes or so. When a process receives this signal, it sets a flag that is checked before going idle. When it sees the flag set, it makes a pass over every catcache and relcache entry. All the ones that are unmarked get marked, and all of the ones that are marked get removed. Access to an entry clears any mark. So anything that's not touched for more than 10 minutes starts dropping out of backend caches. Anyway, that would be a much bigger change from what you are proposing here, and what you are proposing here seems reasonable so I guess I shouldn't distract from it. Your email just made me think of it, because I agree that catcache/relcache bloat is a serious issue. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 20 December 2016 at 21:59, Robert Haas <robertmhaas@gmail.com> wrote: > We could implement this by having > some process, like the background writer, > SendProcSignal(PROCSIG_HOUSEKEEPING) to every process in the system > every 10 minutes or so. ... on a rolling basis. Otherwise that'll be no fun at all, especially with some of those lovely "we kept getting errors so we raised max_connections to 5000" systems out there. But also on more sensibly configured ones that're busy and want nice smooth performance without stalls. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Craig Ringer <craig@2ndquadrant.com> writes: > On 20 December 2016 at 21:59, Robert Haas <robertmhaas@gmail.com> wrote: >> We could implement this by having >> some process, like the background writer, >> SendProcSignal(PROCSIG_HOUSEKEEPING) to every process in the system >> every 10 minutes or so. > ... on a rolling basis. I don't understand why we'd make that a system-wide behavior at all, rather than expecting each process to manage its own cache. regards, tom lane
On Tue, Dec 20, 2016 at 10:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Craig Ringer <craig@2ndquadrant.com> writes: >> On 20 December 2016 at 21:59, Robert Haas <robertmhaas@gmail.com> wrote: >>> We could implement this by having >>> some process, like the background writer, >>> SendProcSignal(PROCSIG_HOUSEKEEPING) to every process in the system >>> every 10 minutes or so. > >> ... on a rolling basis. > > I don't understand why we'd make that a system-wide behavior at all, > rather than expecting each process to manage its own cache. Individual backends don't have a really great way to do time-based stuff, do they? I mean, yes, there is enable_timeout() and friends, but I think that requires quite a bit of bookkeeping. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Dec 20, 2016 at 10:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I don't understand why we'd make that a system-wide behavior at all, >> rather than expecting each process to manage its own cache. > Individual backends don't have a really great way to do time-based > stuff, do they? I mean, yes, there is enable_timeout() and friends, > but I think that requires quite a bit of bookkeeping. If I thought that "every ten minutes" was an ideal way to manage this, I might worry about that, but it doesn't really sound promising at all. Every so many queries would likely work better, or better yet make it self-adaptive depending on how much is in the local syscache. The bigger picture here though is that we used to have limits on syscache size, and we got rid of them (commit 8b9bc234a, see also https://www.postgresql.org/message-id/flat/5141.1150327541%40sss.pgh.pa.us) not only because of the problem you mentioned about performance falling off a cliff once the working-set size exceeded the arbitrary limit, but also because enforcing the limit added significant overhead --- and did so whether or not you got any benefit from it, ie even if the limit is never reached. Maybe the present patch avoids imposing a pile of overhead in situations where no pruning is needed, but it doesn't really look very promising from that angle in a quick once-over. BTW, I don't see the point of the second patch at all? Surely, if an object is deleted or updated, we already have code that flushes related catcache entries. Otherwise the caches would deliver wrong data. regards, tom lane
On Tue, Dec 20, 2016 at 3:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, Dec 20, 2016 at 10:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I don't understand why we'd make that a system-wide behavior at all, >>> rather than expecting each process to manage its own cache. > >> Individual backends don't have a really great way to do time-based >> stuff, do they? I mean, yes, there is enable_timeout() and friends, >> but I think that requires quite a bit of bookkeeping. > > If I thought that "every ten minutes" was an ideal way to manage this, > I might worry about that, but it doesn't really sound promising at all. > Every so many queries would likely work better, or better yet make it > self-adaptive depending on how much is in the local syscache. I don't think "every so many queries" is very promising at all. First, it has the same problem as a fixed cap on the number of entries: if you're doing a round-robin just slightly bigger than that value, performance will be poor. Second, what's really important here is to keep the percentage of wall-clock time spent populating the system caches small. If a backend is doing 4000 queries/second and each of those 4000 queries touches a different table, it really needs a cache of at least 4000 entries or it will thrash and slow way down. But if it's doing a query every 10 minutes and those queries round-robin between 4000 different tables, it doesn't really need a 4000-entry cache. If those queries are long-running, the time to repopulate the cache will only be a tiny fraction of runtime. If the queries are short-running, then the effect is, percentage-wise, just the same as for the high-volume system, but in practice it isn't likely to be felt as much. I mean, if we keep a bunch of old cache entries around on a mostly-idle backend, they are going to be pushed out of CPU caches and maybe even paged out. One can't expect a backend that is woken up after a long sleep to be quite as snappy as one that's continuously active. Which gets to my third point: anything that's based on number of queries won't do anything to help the case where backends sometimes go idle and sit there for long periods. Reducing resource utilization in that case would be beneficial. Ideally I'd like to get rid of not only the backend-local cache contents but the backend itself, but that's a much harder project. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Protect syscache from bloating with negative cacheentries
From
Kyotaro HORIGUCHI
Date:
Thank you for the discussion. At Tue, 20 Dec 2016 15:10:21 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <23492.1482264621@sss.pgh.pa.us> > The bigger picture here though is that we used to have limits on syscache > size, and we got rid of them (commit 8b9bc234a, see also > https://www.postgresql.org/message-id/flat/5141.1150327541%40sss.pgh.pa.us) > not only because of the problem you mentioned about performance falling > off a cliff once the working-set size exceeded the arbitrary limit, but > also because enforcing the limit added significant overhead --- and did so > whether or not you got any benefit from it, ie even if the limit is never > reached. Maybe the present patch avoids imposing a pile of overhead in > situations where no pruning is needed, but it doesn't really look very > promising from that angle in a quick once-over. Indeed. As mentioned in the mail at the beginning of this thread, it hits the whole-cache scanning if at least one negative cache exists even it is not in a relation with the target relid, and it can be significantly long on a fat cache. Lists of negative entries like CatCacheList would help but needs additional memeory. > BTW, I don't see the point of the second patch at all? Surely, if > an object is deleted or updated, we already have code that flushes > related catcache entries. Otherwise the caches would deliver wrong > data. Maybe you take the patch wrongly. Negative entires won't be flushed by any means. Deletion of a namespace causes cascaded object deletion according to dependency then finaly goes to non-neative cache invalidation. But a removal of *negative entries* in RELNAMENSP won't happen. The test script for the case (gen2.pl) does the following thing, CREATE SCHEMA foo; SELECT * FROM foo.invalid; DROP SCHEMA foo; Removing the schema foo leaves a negative cache entry for 'foo.invalid' in RELNAMENSP. However, I'm not sure the above situation happens so frequent that it is worthwhile to amend. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] Protect syscache from bloating with negative cacheentries
From
Kyotaro HORIGUCHI
Date:
At Wed, 21 Dec 2016 10:21:09 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20161221.102109.51106943.horiguchi.kyotaro@lab.ntt.co.jp> > At Tue, 20 Dec 2016 15:10:21 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <23492.1482264621@sss.pgh.pa.us> > > The bigger picture here though is that we used to have limits on syscache > > size, and we got rid of them (commit 8b9bc234a, see also > > https://www.postgresql.org/message-id/flat/5141.1150327541%40sss.pgh.pa.us) > > not only because of the problem you mentioned about performance falling > > off a cliff once the working-set size exceeded the arbitrary limit, but > > also because enforcing the limit added significant overhead --- and did so > > whether or not you got any benefit from it, ie even if the limit is never > > reached. Maybe the present patch avoids imposing a pile of overhead in > > situations where no pruning is needed, but it doesn't really look very > > promising from that angle in a quick once-over. > > Indeed. As mentioned in the mail at the beginning of this thread, > it hits the whole-cache scanning if at least one negative cache > exists even it is not in a relation with the target relid, and it > can be significantly long on a fat cache. > > Lists of negative entries like CatCacheList would help but needs > additional memeory. > > > BTW, I don't see the point of the second patch at all? Surely, if > > an object is deleted or updated, we already have code that flushes > > related catcache entries. Otherwise the caches would deliver wrong > > data. > > Maybe you take the patch wrongly. Negative entires won't be > flushed by any means. Deletion of a namespace causes cascaded > object deletion according to dependency then finaly goes to > non-neative cache invalidation. But a removal of *negative > entries* in RELNAMENSP won't happen. > > The test script for the case (gen2.pl) does the following thing, > > CREATE SCHEMA foo; > SELECT * FROM foo.invalid; > DROP SCHEMA foo; > > Removing the schema foo leaves a negative cache entry for > 'foo.invalid' in RELNAMENSP. > > However, I'm not sure the above situation happens so frequent > that it is worthwhile to amend. Since 1753b1b conflicts this patch, I rebased this onto the current master HEAD. I'll register this to the next CF. The points of discussion are the following, I think. 1. The first patch seems working well. It costs the time to scan the whole of a catcache that have negative entries forother reloids. However, such negative entries are created by rather unusual usages. Accesing to undefined columns, andaccessing columns on which no statistics have created. The whole-catcache scan occurs on ATTNAME, ATTNUM and STATRELATTINHfor every invalidation of a relcache entry. 2. The second patch also works, but flushing negative entries by hash values is inefficient. It scans the bucket corresponding to given hash value for OIDs, then flushing negative entries iterating over all the collected OIDs. So thiscosts more time than 1 and flushes involving entries that is not necessary to be removed. If this feature is valuablebut such side effects are not acceptable, new invalidation category based on cacheid-oid pair would be needed. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Dec 21, 2016 at 5:10 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > If I thought that "every ten minutes" was an ideal way to manage this, > I might worry about that, but it doesn't really sound promising at all. > Every so many queries would likely work better, or better yet make it > self-adaptive depending on how much is in the local syscache. > > The bigger picture here though is that we used to have limits on syscache > size, and we got rid of them (commit 8b9bc234a, see also > https://www.postgresql.org/message-id/flat/5141.1150327541%40sss.pgh.pa.us) > not only because of the problem you mentioned about performance falling > off a cliff once the working-set size exceeded the arbitrary limit, but > also because enforcing the limit added significant overhead --- and did so > whether or not you got any benefit from it, ie even if the limit is never > reached. Maybe the present patch avoids imposing a pile of overhead in > situations where no pruning is needed, but it doesn't really look very > promising from that angle in a quick once-over. Have there been ever discussions about having catcache entries in a shared memory area? This does not sound much performance-wise, I am just wondering about the concept and I cannot find references to such discussions. -- Michael
Michael Paquier <michael.paquier@gmail.com> writes: > Have there been ever discussions about having catcache entries in a > shared memory area? This does not sound much performance-wise, I am > just wondering about the concept and I cannot find references to such > discussions. I'm sure it's been discussed. Offhand I remember the following issues: * A shared cache would create locking and contention overhead. * A shared cache would have a very hard size limit, at least if it's in SysV-style shared memory (perhaps DSM would let us relax that). * Transactions that are doing DDL have a requirement for the catcache to reflect changes that they've made locally but not yet committed, so said changes mustn't be visible globally. You could possibly get around the third point with a local catcache that's searched before the shared one, but tuning that to be performant sounds like a mess. Also, I'm not sure how such a structure could cope with uncommitted deletions: delete A -> remove A from local catcache, but not the shared one -> search for A in local catcache -> not found -> search for A in shared catcache -> found -> oops. regards, tom lane
On Fri, Jan 13, 2017 at 8:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Michael Paquier <michael.paquier@gmail.com> writes: >> Have there been ever discussions about having catcache entries in a >> shared memory area? This does not sound much performance-wise, I am >> just wondering about the concept and I cannot find references to such >> discussions. > > I'm sure it's been discussed. Offhand I remember the following issues: > > * A shared cache would create locking and contention overhead. > > * A shared cache would have a very hard size limit, at least if it's > in SysV-style shared memory (perhaps DSM would let us relax that). > > * Transactions that are doing DDL have a requirement for the catcache > to reflect changes that they've made locally but not yet committed, > so said changes mustn't be visible globally. > > You could possibly get around the third point with a local catcache that's > searched before the shared one, but tuning that to be performant sounds > like a mess. Also, I'm not sure how such a structure could cope with > uncommitted deletions: delete A -> remove A from local catcache, but not > the shared one -> search for A in local catcache -> not found -> search > for A in shared catcache -> found -> oops. I think the first of those concerns is the key one. If searching the system catalogs costs $100 and searching the private catcache costs $1, what's the cost of searching a hypothetical shared catcache? If the answer is $80, it's not worth doing. If the answer is $5, it's probably still not worth doing. If the answer is $1.25, then it's probably worth investing some energy into trying to solve the other problems you list. For some users, the memory cost of catcache and syscache entries multiplied by N backends are a very serious problem, so it would be nice to have some other options. But we do so many syscache lookups that a shared cache won't be viable unless it's almost as fast as a backend-private cache, or at least that's my hunch. I think it would be interested for somebody to build a prototype here that ignores all the problems but the first and uses some straightforward, relatively unoptimized locking strategy for the first problem. Then benchmark it. If the results show that the idea has legs, then we can try to figure out what a real implementation would look like. (One possible approach: use Thomas Munro's DHT stuff to build the shared cache.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Jan 14, 2017 at 12:32 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jan 13, 2017 at 8:58 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Michael Paquier <michael.paquier@gmail.com> writes: >>> Have there been ever discussions about having catcache entries in a >>> shared memory area? This does not sound much performance-wise, I am >>> just wondering about the concept and I cannot find references to such >>> discussions. >> >> I'm sure it's been discussed. Offhand I remember the following issues: >> >> * A shared cache would create locking and contention overhead. >> >> * A shared cache would have a very hard size limit, at least if it's >> in SysV-style shared memory (perhaps DSM would let us relax that). >> >> * Transactions that are doing DDL have a requirement for the catcache >> to reflect changes that they've made locally but not yet committed, >> so said changes mustn't be visible globally. >> >> You could possibly get around the third point with a local catcache that's >> searched before the shared one, but tuning that to be performant sounds >> like a mess. Also, I'm not sure how such a structure could cope with >> uncommitted deletions: delete A -> remove A from local catcache, but not >> the shared one -> search for A in local catcache -> not found -> search >> for A in shared catcache -> found -> oops. > > I think the first of those concerns is the key one. If searching the > system catalogs costs $100 and searching the private catcache costs > $1, what's the cost of searching a hypothetical shared catcache? If > the answer is $80, it's not worth doing. If the answer is $5, it's > probably still not worth doing. If the answer is $1.25, then it's > probably worth investing some energy into trying to solve the other > problems you list. For some users, the memory cost of catcache and > syscache entries multiplied by N backends are a very serious problem, > so it would be nice to have some other options. But we do so many > syscache lookups that a shared cache won't be viable unless it's > almost as fast as a backend-private cache, or at least that's my > hunch. Being able to switch from one mode to another would be interesting. Applications using extensing DDLs that require to change the catcache with an exclusive lock would clearly pay the lock contention cost, but do you think that be really the case of a shared lock? A bunch of applications that I work with deploy Postgres once, then don't change the schema except when an upgrade happens. So that would be benefitial for that. There are even some apps that do not use pgbouncer, but drop sessions after a timeout of inactivity to avoid a memory bloat because of the problem of this thread. That won't solve the problem of the local catcache bloat, but some users using few DDLs may be fine to pay some extra concurrency cost if the session handling gets easied. > I think it would be interested for somebody to build a prototype here > that ignores all the problems but the first and uses some > straightforward, relatively unoptimized locking strategy for the first > problem. Then benchmark it. If the results show that the idea has > legs, then we can try to figure out what a real implementation would > look like. > (One possible approach: use Thomas Munro's DHT stuff to build the shared cache.) Yeah, I'd bet on a couple of days of focus to sort that out. -- Michael
Michael Paquier <michael.paquier@gmail.com> writes: > ... There are even some apps that do not use pgbouncer, but drop > sessions after a timeout of inactivity to avoid a memory bloat because > of the problem of this thread. Yeah, a certain company I used to work for had to do that, though their problem had more to do with bloat in plpgsql's compiled-functions cache (and ensuing bloat in the plancache), I believe. Still, I'm pretty suspicious of anything that will add overhead to catcache lookups. If you think the performance of those is not absolutely critical, turning off the caches via -DCLOBBER_CACHE_ALWAYS will soon disabuse you of the error. I'm inclined to think that a more profitable direction to look in is finding a way to limit the cache size. I know we got rid of exactly that years ago, but the problems with it were (a) the mechanism was itself pretty expensive --- a global-to-all-caches LRU list IIRC, and (b) there wasn't a way to tune the limit. Possibly somebody can think of some cheaper, perhaps less precise way of aging out old entries. As for (b), this is the sort of problem we made GUCs for. But, again, the catcache isn't the only source of per-process bloat and I'm not even sure it's the main one. A more holistic approach might be called for. regards, tom lane
Hi, On 2017-01-13 17:58:41 -0500, Tom Lane wrote: > But, again, the catcache isn't the only source of per-process bloat > and I'm not even sure it's the main one. A more holistic approach > might be called for. It'd be helpful if we'd find a way to make it easy to get statistics about the size of various caches in production systems. Right now that's kinda hard, resulting in us having to make a lot of guesses... Andres
On 01/14/2017 12:06 AM, Andres Freund wrote: > Hi, > > > On 2017-01-13 17:58:41 -0500, Tom Lane wrote: >> But, again, the catcache isn't the only source of per-process bloat >> and I'm not even sure it's the main one. A more holistic approach >> might be called for. > > It'd be helpful if we'd find a way to make it easy to get statistics > about the size of various caches in production systems. Right now > that's kinda hard, resulting in us having to make a lot of > guesses... > What about a simple C extension, that could inspect those caches? Assuming it could be loaded into a single backend, that should be relatively acceptable way (compared to loading it to all backends using shared_preload_libraries). -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Jan 14, 2017 at 9:36 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On 01/14/2017 12:06 AM, Andres Freund wrote: >> On 2017-01-13 17:58:41 -0500, Tom Lane wrote: >>> >>> But, again, the catcache isn't the only source of per-process bloat >>> and I'm not even sure it's the main one. A more holistic approach >>> might be called for. >> >> It'd be helpful if we'd find a way to make it easy to get statistics >> about the size of various caches in production systems. Right now >> that's kinda hard, resulting in us having to make a lot of >> guesses... > > What about a simple C extension, that could inspect those caches? Assuming > it could be loaded into a single backend, that should be relatively > acceptable way (compared to loading it to all backends using > shared_preload_libraries). This extension could do a small amount of work on a portion of the syscache entries at each query loop, still I am wondering if that would not be nicer to get that in-core and configurable, which is basically the approach proposed by Horiguchi-san. At least it seems to me that it has some merit, and if we could make that behavior switchable, disabled by default, that would be a win for some class of applications. What do others think? -- Michael
On 12/26/16 2:31 AM, Kyotaro HORIGUCHI wrote: > The points of discussion are the following, I think. > > 1. The first patch seems working well. It costs the time to scan > the whole of a catcache that have negative entries for other > reloids. However, such negative entries are created by rather > unusual usages. Accesing to undefined columns, and accessing > columns on which no statistics have created. The > whole-catcache scan occurs on ATTNAME, ATTNUM and > STATRELATTINH for every invalidation of a relcache entry. I took a look at this. It looks sane, though I've got a few minor comment tweaks: + * Remove negative cache tuples maching a partial key. s/maching/matching/ +/* searching with a paritial key needs scanning the whole cache */ s/needs/means/ + * a negative cache entry cannot be referenced so we can remove s/referenced/referenced,/ I was wondering if there's a way to test the performance impact of deleting negative entries. > 2. The second patch also works, but flushing negative entries by > hash values is inefficient. It scans the bucket corresponding > to given hash value for OIDs, then flushing negative entries > iterating over all the collected OIDs. So this costs more time > than 1 and flushes involving entries that is not necessary to > be removed. If this feature is valuable but such side effects > are not acceptable, new invalidation category based on > cacheid-oid pair would be needed. I glanced at this and it looks sane. Didn't go any farther since this one's pretty up in the air. ISTM it'd be better to do some kind of aging instead of patch 2. The other (possibly naive) question I have is how useful negative entries really are? Will Postgres regularly incur negative lookups, or will these only happen due to user activity? I can't think of a case where an app would need to depend on fast negative lookup (in other words, it should be considered a bug in the app). I can see where getting rid of them completely might be problematic, but maybe we can just keep a relatively small number of them around. I'm thinking a simple LRU list of X number of negative entries; when that fills you reuse the oldest one. You'd have to pay the LRU maintenance cost on every negative hit, but if those shouldn't be that common it shouldn't be bad. That might well necessitate another GUC, but it seems a lot simpler than most of the other ideas. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)
Jim Nasby <Jim.Nasby@bluetreble.com> writes: > The other (possibly naive) question I have is how useful negative > entries really are? Will Postgres regularly incur negative lookups, or > will these only happen due to user activity? It varies depending on the particular syscache, but in at least some of them, negative cache entries are critical for performance. See for example RelnameGetRelid(), which basically does a RELNAMENSP cache lookup for each schema down the search path until it finds a match. For any user table name with the standard search_path, there's a guaranteed failure in pg_catalog before you can hope to find a match. If we don't have negative cache entries, then *every invocation of this function has to go to disk* (or at least to shared buffers). It's possible that we could revise all our lookup patterns to avoid this sort of thing. But I don't have much faith in that always being possible, and exactly none that we won't introduce new lookup patterns that need it in future. I spent some time, for instance, wondering if RelnameGetRelid could use a SearchSysCacheList lookup instead, doing the lookup on table name only and then inspecting the whole list to see which entry is frontmost according to the current search path. But that has performance failure modes of its own, for example if you have identical table names in a boatload of different schemas. We do it that way for some other cases such as function lookups, but I think it's much less likely that people have identical function names in N schemas than that they have identical table names in N schemas. If you want to poke into this for particular test scenarios, building with CATCACHE_STATS defined will yield a bunch of numbers dumped to the postmaster log at each backend exit. regards, tom lane
On 1/21/17 8:54 PM, Tom Lane wrote: > Jim Nasby <Jim.Nasby@bluetreble.com> writes: >> The other (possibly naive) question I have is how useful negative >> entries really are? Will Postgres regularly incur negative lookups, or >> will these only happen due to user activity? > It varies depending on the particular syscache, but in at least some > of them, negative cache entries are critical for performance. > See for example RelnameGetRelid(), which basically does a RELNAMENSP > cache lookup for each schema down the search path until it finds a > match. Ahh, I hadn't considered that. So one idea would be to only track negative entries on caches where we know they're actually useful. That might make the performance hit of some of the other ideas more tolerable. Presumably you're much less likely to pollute the namespace cache than some of the others. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)
On 1/22/17 4:41 PM, Jim Nasby wrote: > On 1/21/17 8:54 PM, Tom Lane wrote: >> Jim Nasby <Jim.Nasby@bluetreble.com> writes: >>> The other (possibly naive) question I have is how useful negative >>> entries really are? Will Postgres regularly incur negative lookups, or >>> will these only happen due to user activity? >> It varies depending on the particular syscache, but in at least some >> of them, negative cache entries are critical for performance. >> See for example RelnameGetRelid(), which basically does a RELNAMENSP >> cache lookup for each schema down the search path until it finds a >> match. > > Ahh, I hadn't considered that. So one idea would be to only track > negative entries on caches where we know they're actually useful. That > might make the performance hit of some of the other ideas more > tolerable. Presumably you're much less likely to pollute the namespace > cache than some of the others. Ok, after reading the code I see I only partly understood what you were saying. In any case, it might still be useful to do some testing with CATCACHE_STATS defined to see if there's caches that don't accumulate a lot of negative entries. Attached is a patch that tries to document some of this. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Jim Nasby <Jim.Nasby@bluetreble.com> writes: >> Ahh, I hadn't considered that. So one idea would be to only track >> negative entries on caches where we know they're actually useful. That >> might make the performance hit of some of the other ideas more >> tolerable. Presumably you're much less likely to pollute the namespace >> cache than some of the others. > Ok, after reading the code I see I only partly understood what you were > saying. In any case, it might still be useful to do some testing with > CATCACHE_STATS defined to see if there's caches that don't accumulate a > lot of negative entries. There definitely are, according to my testing, but by the same token it's not clear that a shutoff check would save anything. regards, tom lane
On 1/22/17 5:03 PM, Tom Lane wrote: >> Ok, after reading the code I see I only partly understood what you were >> saying. In any case, it might still be useful to do some testing with >> CATCACHE_STATS defined to see if there's caches that don't accumulate a >> lot of negative entries. > There definitely are, according to my testing, but by the same token > it's not clear that a shutoff check would save anything. Currently they wouldn't, but there's concerns about the performance of some of the other ideas in this thread. Getting rid of negative entries that don't really help could reduce some of those concerns. Or perhaps the original complaint about STATRELATTINH could be solved by just disabling negative entries on that cache. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)
On 1/21/17 6:42 PM, Jim Nasby wrote: > On 12/26/16 2:31 AM, Kyotaro HORIGUCHI wrote: >> The points of discussion are the following, I think. >> >> 1. The first patch seems working well. It costs the time to scan >> the whole of a catcache that have negative entries for other >> reloids. However, such negative entries are created by rather >> unusual usages. Accesing to undefined columns, and accessing >> columns on which no statistics have created. The >> whole-catcache scan occurs on ATTNAME, ATTNUM and >> STATRELATTINH for every invalidation of a relcache entry. > > I took a look at this. It looks sane, though I've got a few minor > comment tweaks: > > + * Remove negative cache tuples maching a partial key. > s/maching/matching/ > > +/* searching with a paritial key needs scanning the whole cache */ > > s/needs/means/ > > + * a negative cache entry cannot be referenced so we can remove > > s/referenced/referenced,/ > > I was wondering if there's a way to test the performance impact of > deleting negative entries. I did a make installcheck run with CATCACHE_STATS to see how often we get negative entries in the 3 caches affected by this patch. The caches on pg_attribute get almost no negative entries. pg_statistic gets a good amount of negative entries, presumably because we start off with no entries in there. On a stable system that presumably won't be an issue, but if temporary tables are in use and being analyzed I'd think there could be a moderate amount of inval traffic on that cache. I'll leave it to a committer to decide if they thing that's an issue, but you might want to try and quantify how big a hit that is. I think it'd also be useful to know how much bloat you were seeing in the field. The patch is currently conflicting against master though, due to some caches being added. Can you rebase? BTW, if you set a slightly larger context size on the patch you might be able to avoid rebases; right now the patch doesn't include enough context to uniquely identify the chunks against cacheinfo[]. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)
Re: [HACKERS] Protect syscache from bloating with negative cacheentries
From
Kyotaro HORIGUCHI
Date:
Hello, thank you for lookin this. At Mon, 23 Jan 2017 16:54:36 -0600, Jim Nasby <Jim.Nasby@BlueTreble.com> wrote in <21803f50-a823-c444-ee2b-9a153114f454@BlueTreble.com> > On 1/21/17 6:42 PM, Jim Nasby wrote: > > On 12/26/16 2:31 AM, Kyotaro HORIGUCHI wrote: > >> The points of discussion are the following, I think. > >> > >> 1. The first patch seems working well. It costs the time to scan > >> the whole of a catcache that have negative entries for other > >> reloids. However, such negative entries are created by rather > >> unusual usages. Accesing to undefined columns, and accessing > >> columns on which no statistics have created. The > >> whole-catcache scan occurs on ATTNAME, ATTNUM and > >> STATRELATTINH for every invalidation of a relcache entry. > > > > I took a look at this. It looks sane, though I've got a few minor > > comment tweaks: > > > > + * Remove negative cache tuples maching a partial key. > > s/maching/matching/ > > > > +/* searching with a paritial key needs scanning the whole cache */ > > > > s/needs/means/ > > > > + * a negative cache entry cannot be referenced so we can remove > > > > s/referenced/referenced,/ > > > > I was wondering if there's a way to test the performance impact of > > deleting negative entries. Thanks for the pointing out. These are addressed. > I did a make installcheck run with CATCACHE_STATS to see how often we > get negative entries in the 3 caches affected by this patch. The > caches on pg_attribute get almost no negative entries. pg_statistic > gets a good amount of negative entries, presumably because we start > off with no entries in there. On a stable system that presumably won't > be an issue, but if temporary tables are in use and being analyzed I'd > think there could be a moderate amount of inval traffic on that > cache. I'll leave it to a committer to decide if they thing that's an > issue, but you might want to try and quantify how big a hit that is. I > think it'd also be useful to know how much bloat you were seeing in > the field. > > The patch is currently conflicting against master though, due to some > caches being added. Can you rebase? Six new syscaches in 665d1fa was conflicted and 3-way merge worked correctly. The new syscaches don't seem to be targets of this patch. > BTW, if you set a slightly larger > context size on the patch you might be able to avoid rebases; right > now the patch doesn't include enough context to uniquely identify the > chunks against cacheinfo[]. git format-patch -U5 fuses all hunks on cacheinfo[] together. I'm not sure that such a hunk can avoid rebases. Is this what you suggested? -U4 added an identifiable forward context line for some elements so the attached patch is made with four context lines. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] Protect syscache from bloating with negative cacheentries
From
Kyotaro HORIGUCHI
Date:
Hello, I have tried to cap the number of negative entries for myself (by removing negative entries in least recentrly created first order) but the ceils cannot be reasonably determined both absolutely or relatively to positive entries. Apparently it differs widely among caches and applications. At Mon, 23 Jan 2017 08:16:49 -0600, Jim Nasby <Jim.Nasby@BlueTreble.com> wrote in <6519b7ad-0aa6-c9f4-8869-20691107fb69@BlueTreble.com> > On 1/22/17 5:03 PM, Tom Lane wrote: > >> Ok, after reading the code I see I only partly understood what you > >> were > >> saying. In any case, it might still be useful to do some testing with > >> CATCACHE_STATS defined to see if there's caches that don't accumulate > >> a > >> lot of negative entries. > > There definitely are, according to my testing, but by the same token > > it's not clear that a shutoff check would save anything. > > Currently they wouldn't, but there's concerns about the performance of > some of the other ideas in this thread. Getting rid of negative > entries that don't really help could reduce some of those concerns. Or > perhaps the original complaint about STATRELATTINH could be solved by > just disabling negative entries on that cache. As for STATRELATTINH, planning involving small temporary tablesthat frequently accessed willget benefit from negative entries,butit might ignorably small. ATTNAME, ATTNUM and RENAMENSP alsomight not get so much from negative entries. If theseare true,the whole stuff this patch adds can be replaced with just aboolean in cachedesc that inhibits negatvie entries.Anyway thispatch don't save the case of the cache bloat relaed to functionreference. I'm not sure how that couldbe reproduced, though. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Jan 24, 2017 at 4:58 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Six new syscaches in 665d1fa was conflicted and 3-way merge > worked correctly. The new syscaches don't seem to be targets of > this patch. To be honest, I am not completely sure what to think about this patch. Moved to next CF as there is a new version, and no new reviews to make the discussion perhaps move on. -- Michael
Re: [HACKERS] Protect syscache from bloating with negative cacheentries
From
Kyotaro HORIGUCHI
Date:
Hello, thank you for moving this to the next CF. At Wed, 1 Feb 2017 13:09:51 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqRFhUv+GX=eH1bo7xYHS79-gRj1ecu2QoQtHvX9RS=JYA@mail.gmail.com> > On Tue, Jan 24, 2017 at 4:58 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > Six new syscaches in 665d1fa was conflicted and 3-way merge > > worked correctly. The new syscaches don't seem to be targets of > > this patch. > > To be honest, I am not completely sure what to think about this patch. > Moved to next CF as there is a new version, and no new reviews to make > the discussion perhaps move on. I'm thinking the following is the status of this topic. - The patch stll is not getting conflicted. - This is not a hollistic measure for memory leak but surely saves some existing cases. - Shared catcache is another discussion (and won't really proposed in a short time due to the issue on locking.) - As I mentioned, a patch that caps the number of negative entries is avaiable (in first-created - first-delete manner) butit is having a loose end of how to determine the limitation. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On 2/1/17 1:25 AM, Kyotaro HORIGUCHI wrote: > Hello, thank you for moving this to the next CF. > > At Wed, 1 Feb 2017 13:09:51 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqRFhUv+GX=eH1bo7xYHS79-gRj1ecu2QoQtHvX9RS=JYA@mail.gmail.com> >> On Tue, Jan 24, 2017 at 4:58 PM, Kyotaro HORIGUCHI >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>> Six new syscaches in 665d1fa was conflicted and 3-way merge >>> worked correctly. The new syscaches don't seem to be targets of >>> this patch. >> >> To be honest, I am not completely sure what to think about this patch. >> Moved to next CF as there is a new version, and no new reviews to make >> the discussion perhaps move on. > > I'm thinking the following is the status of this topic. > > - The patch stll is not getting conflicted. > > - This is not a hollistic measure for memory leak but surely > saves some existing cases. > > - Shared catcache is another discussion (and won't really > proposed in a short time due to the issue on locking.) > > - As I mentioned, a patch that caps the number of negative > entries is avaiable (in first-created - first-delete manner) > but it is having a loose end of how to determine the > limitation. While preventing bloat in the syscache is a worthwhile goal, it appears there are a number of loose ends here and a new patch has not been provided. It's a pretty major change so I recommend moving this patch to the 2017-07 CF. -- -David david@pgmasters.net
On 3/3/17 4:54 PM, David Steele wrote: > On 2/1/17 1:25 AM, Kyotaro HORIGUCHI wrote: >> Hello, thank you for moving this to the next CF. >> >> At Wed, 1 Feb 2017 13:09:51 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqRFhUv+GX=eH1bo7xYHS79-gRj1ecu2QoQtHvX9RS=JYA@mail.gmail.com> >>> On Tue, Jan 24, 2017 at 4:58 PM, Kyotaro HORIGUCHI >>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>>> Six new syscaches in 665d1fa was conflicted and 3-way merge >>>> worked correctly. The new syscaches don't seem to be targets of >>>> this patch. >>> To be honest, I am not completely sure what to think about this patch. >>> Moved to next CF as there is a new version, and no new reviews to make >>> the discussion perhaps move on. >> I'm thinking the following is the status of this topic. >> >> - The patch stll is not getting conflicted. >> >> - This is not a hollistic measure for memory leak but surely >> saves some existing cases. >> >> - Shared catcache is another discussion (and won't really >> proposed in a short time due to the issue on locking.) >> >> - As I mentioned, a patch that caps the number of negative >> entries is avaiable (in first-created - first-delete manner) >> but it is having a loose end of how to determine the >> limitation. > While preventing bloat in the syscache is a worthwhile goal, it appears > there are a number of loose ends here and a new patch has not been provided. > > It's a pretty major change so I recommend moving this patch to the > 2017-07 CF. Not hearing any opinions pro or con, I'm moving this patch to the 2017-07 CF. -- -David david@pgmasters.net
Re: [HACKERS] Protect syscache from bloating with negative cache entries
From
Kyotaro HORIGUCHI
Date:
At Tue, 7 Mar 2017 19:23:14 -0800, David Steele <david@pgmasters.net> wrote in <3b7b7f90-db46-8c37-c4f7-443330c3ae33@pgmasters.net> > On 3/3/17 4:54 PM, David Steele wrote: > > > On 2/1/17 1:25 AM, Kyotaro HORIGUCHI wrote: > >> Hello, thank you for moving this to the next CF. > >> > >> At Wed, 1 Feb 2017 13:09:51 +0900, Michael Paquier > >> <michael.paquier@gmail.com> wrote in > >> <CAB7nPqRFhUv+GX=eH1bo7xYHS79-gRj1ecu2QoQtHvX9RS=JYA@mail.gmail.com> > >>> On Tue, Jan 24, 2017 at 4:58 PM, Kyotaro HORIGUCHI > >>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > >>>> Six new syscaches in 665d1fa was conflicted and 3-way merge > >>>> worked correctly. The new syscaches don't seem to be targets of > >>>> this patch. > >>> To be honest, I am not completely sure what to think about this patch. > >>> Moved to next CF as there is a new version, and no new reviews to make > >>> the discussion perhaps move on. > >> I'm thinking the following is the status of this topic. > >> > >> - The patch stll is not getting conflicted. > >> > >> - This is not a hollistic measure for memory leak but surely > >> saves some existing cases. > >> > >> - Shared catcache is another discussion (and won't really > >> proposed in a short time due to the issue on locking.) > >> > >> - As I mentioned, a patch that caps the number of negative > >> entries is avaiable (in first-created - first-delete manner) > >> but it is having a loose end of how to determine the > >> limitation. > > While preventing bloat in the syscache is a worthwhile goal, it > > appears > > there are a number of loose ends here and a new patch has not been > > provided. > > > > It's a pretty major change so I recommend moving this patch to the > > 2017-07 CF. > > Not hearing any opinions pro or con, I'm moving this patch to the > 2017-07 CF. Ah. Yes, I agree on this. Thanks. -- Kyotaro Horiguchi NTT Open Source Software Center
On 1/24/17 02:58, Kyotaro HORIGUCHI wrote: >> BTW, if you set a slightly larger >> context size on the patch you might be able to avoid rebases; right >> now the patch doesn't include enough context to uniquely identify the >> chunks against cacheinfo[]. > git format-patch -U5 fuses all hunks on cacheinfo[] together. I'm > not sure that such a hunk can avoid rebases. Is this what you > suggested? -U4 added an identifiable forward context line for > some elements so the attached patch is made with four context > lines. This patch needs another rebase for the upcoming commit fest. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Protect syscache from bloating with negative cacheentries
From
Kyotaro HORIGUCHI
Date:
Thank you for your attention. At Mon, 14 Aug 2017 17:33:48 -0400, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote in <09fa011f-4536-b05d-0625-11f3625d8332@2ndquadrant.com> > On 1/24/17 02:58, Kyotaro HORIGUCHI wrote: > >> BTW, if you set a slightly larger > >> context size on the patch you might be able to avoid rebases; right > >> now the patch doesn't include enough context to uniquely identify the > >> chunks against cacheinfo[]. > > git format-patch -U5 fuses all hunks on cacheinfo[] together. I'm > > not sure that such a hunk can avoid rebases. Is this what you > > suggested? -U4 added an identifiable forward context line for > > some elements so the attached patch is made with four context > > lines. > > This patch needs another rebase for the upcoming commit fest. This patch have had interferences from several commits after the last submission. I amended this patch to follow them (up to f97c55c), removed an unnecessary branch and edited some comments. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Aug 28, 2017 at 5:24 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > This patch have had interferences from several commits after the > last submission. I amended this patch to follow them (up to > f97c55c), removed an unnecessary branch and edited some comments. I think the core problem for this patch is that there's no consensus on what approach to take. Until that somehow gets sorted out, I think this isn't going to make any progress. Unfortunately, I don't have a clear idea what sort of solution everybody could tolerate. I still think that some kind of slow-expire behavior -- like a clock hand that hits each backend every 10 minutes and expires entries not used since the last hit -- is actually pretty sensible. It ensures that idle or long-running backends don't accumulate infinite bloat while still allowing the cache to grow large enough for good performance when all entries are being regularly used. But Tom doesn't like it. Other approaches were also discussed; none of them seem like an obvious slam-dunk. Turning to the patch itself, I don't know how we decide whether the patch is worth it. Scanning the whole (potentially large) cache to remove negative entries has a cost, mostly in CPU cycles; keeping those negative entries around for a long time also has a cost, mostly in memory. I don't know how to decide whether these patches will help more people than it hurts, or the other way around -- and it's not clear that anyone else has a good idea about that either. Typos: funciton, paritial. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Aug 28, 2017 at 9:24 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > This patch have had interferences from several commits after the > last submission. I amended this patch to follow them (up to > f97c55c), removed an unnecessary branch and edited some comments. Hi Kyotaro-san, This applies but several regression tests fail for me. Here is a sample backtrace: frame #3: 0x000000010f0614c0 postgres`ExceptionalCondition(conditionName="!(attnum < 0 ? attnum == (-2) : cache->cc_tupdesc->attrs[attnum].atttypid == 26)", errorType="FailedAssertion", fileName="catcache.c", lineNumber=1384) + 128 at assert.c:54 frame #4: 0x000000010f03b5fd postgres`CollectOIDsForHashValue(cache=0x00007fe273821268, hashValue=994410284, attnum=0) + 253 at catcache.c:1383 frame #5: 0x000000010f055e8e postgres`SysCacheSysCacheInvalCallback(arg=140610577303984, cacheid=0, hashValue=994410284) + 94 at syscache.c:1692 frame #6: 0x000000010f03fbbb postgres`CallSyscacheCallbacks(cacheid=0, hashvalue=994410284) + 219 at inval.c:1468 frame #7: 0x000000010f03f878 postgres`LocalExecuteInvalidationMessage(msg=0x00007fff51213ff8) + 88 at inval.c:566 frame #8: 0x000000010ee7a3f2 postgres`ReceiveSharedInvalidMessages(invalFunction=(postgres`LocalExecuteInvalidationMessage at inval.c:555), resetFunction=(postgres`InvalidateSystemCaches at inval.c:647)) + 354 at sinval.c:121 frame #9: 0x000000010f03fcb7 postgres`AcceptInvalidationMessages + 23 at inval.c:686 frame #10: 0x000000010eade609 postgres`AtStart_Cache + 9 at xact.c:987 frame #11: 0x000000010ead8c2fpostgres`StartTransaction + 655 at xact.c:1921 frame #12: 0x000000010ead8896 postgres`StartTransactionCommand+ 70 at xact.c:2691 frame #13: 0x000000010eea9746 postgres`start_xact_command + 22 at postgres.c:2438 frame #14: 0x000000010eea722e postgres`exec_simple_query(query_string="RESET SESSION AUTHORIZATION;") + 126 at postgres.c:913 frame #15: 0x000000010eea68d7 postgres`PostgresMain(argc=1, argv=0x00007fe2738036a8, dbname="regression", username="munro") + 2375 at postgres.c:4090 frame #16: 0x000000010eded40e postgres`BackendRun(port=0x00007fe2716001a0) + 654 at postmaster.c:4357 frame #17: 0x000000010edec793 postgres`BackendStartup(port=0x00007fe2716001a0) + 483 at postmaster.c:4029 frame #18: 0x000000010edeb785 postgres`ServerLoop + 597 at postmaster.c:1753 frame #19: 0x000000010ede8f71postgres`PostmasterMain(argc=8, argv=0x00007fe271403860) + 5553 at postmaster.c:1361 frame #20: 0x000000010ed0ccd9 postgres`main(argc=8, argv=0x00007fe271403860) + 761 at main.c:228 frame #21: 0x00007fff8333a5ad libdyld.dylib`start + 1 -- Thomas Munro http://www.enterprisedb.com
Re: [HACKERS] Protect syscache from bloating with negative cacheentries
From
Kyotaro HORIGUCHI
Date:
Thank you for reviewing this. At Sat, 2 Sep 2017 12:12:47 +1200, Thomas Munro <thomas.munro@enterprisedb.com> wrote in <CAEepm=3wqPFFSKP_yhkuHLZtOOwZskGuHJdSctVnbHQ4DFEH+Q@mail.gmail.com> > On Mon, Aug 28, 2017 at 9:24 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > This patch have had interferences from several commits after the > > last submission. I amended this patch to follow them (up to > > f97c55c), removed an unnecessary branch and edited some comments. > > Hi Kyotaro-san, > > This applies but several regression tests fail for me. Here is a > sample backtrace: Sorry for the silly mistake. STAEXTNAMENSP and STATRELATTINH was missing additional elements in their definitions. Somehow I've removed them. The attached patch doesn't crash by regression test. And fixed some typos pointed by Robert and found by myself. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] Protect syscache from bloating with negative cacheentries
From
Kyotaro HORIGUCHI
Date:
Thank you for the comment. At Mon, 28 Aug 2017 21:31:58 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoZjn28uYJRQ2K+5idhYxWBDER68sctoc2p_nW7h7JbhYw@mail.gmail.com> > On Mon, Aug 28, 2017 at 5:24 AM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > This patch have had interferences from several commits after the > > last submission. I amended this patch to follow them (up to > > f97c55c), removed an unnecessary branch and edited some comments. > > I think the core problem for this patch is that there's no consensus > on what approach to take. Until that somehow gets sorted out, I think > this isn't going to make any progress. Unfortunately, I don't have a > clear idea what sort of solution everybody could tolerate. > > I still think that some kind of slow-expire behavior -- like a clock > hand that hits each backend every 10 minutes and expires entries not > used since the last hit -- is actually pretty sensible. It ensures > that idle or long-running backends don't accumulate infinite bloat > while still allowing the cache to grow large enough for good > performance when all entries are being regularly used. But Tom > doesn't like it. Other approaches were also discussed; none of them > seem like an obvious slam-dunk. I suppose that it slows intermittent lookup of non-existent objects. I have tried a slight different thing. Removing entries by 'age', preserving specified number (or ratio to live entries) of younger negative entries. The problem of that approach was I didn't find how to determine the number of entries to preserve, or I didn't want to offer additional knobs for them. Finally I proposed the patch upthread since it doesn't need any assumption on usage. Though I can make another patch that does the same thing based on LRU, the same how-many-to-preserve problem ought to be resolved in order to avoid slowing the inermittent lookup. > Turning to the patch itself, I don't know how we decide whether the > patch is worth it. Scanning the whole (potentially large) cache to > remove negative entries has a cost, mostly in CPU cycles; keeping > those negative entries around for a long time also has a cost, mostly > in memory. I don't know how to decide whether these patches will help > more people than it hurts, or the other way around -- and it's not > clear that anyone else has a good idea about that either. Scanning a hash on invalidation of several catalogs (hopefully slightly) slows certain percentage of inavlidations on maybe most of workloads. Holding no-longer-lookedup entries surely kills a backend under certain workloads sooner or later. This doesn't save the pg_proc cases, but saves pg_statistic and pg_class cases. I'm not sure what other catalogs can bloat. I could reduce the complexity of this. Inval mechanism conveys only a hash value so this scans the whole of a cache for the target OIDs (with possible spurious targets). This will be resolved by letting inval mechanism convey an OID. (but this may need additional members in an inval entry.) Still, the full scan perfomed in CleanupCatCacheNegEntries doesn't seem easily avoidable. Separating the hash by OID of key or provide special dlist that points tuples in buckets will introduce another complexity. > Typos: funciton, paritial. Thanks. ispell told me of additional typos corresnpond, belive and undistinguisable. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] Protect syscache from bloating with negative cache entries
From
Kyotaro HORIGUCHI
Date:
This is a rebased version of the patch. At Fri, 17 Mar 2017 14:23:13 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170317.142313.232290068.horiguchi.kyotaro@lab.ntt.co.jp> > At Tue, 7 Mar 2017 19:23:14 -0800, David Steele <david@pgmasters.net> wrote in <3b7b7f90-db46-8c37-c4f7-443330c3ae33@pgmasters.net> > > On 3/3/17 4:54 PM, David Steele wrote: > > > > > On 2/1/17 1:25 AM, Kyotaro HORIGUCHI wrote: > > >> Hello, thank you for moving this to the next CF. > > >> > > >> At Wed, 1 Feb 2017 13:09:51 +0900, Michael Paquier > > >> <michael.paquier@gmail.com> wrote in > > >> <CAB7nPqRFhUv+GX=eH1bo7xYHS79-gRj1ecu2QoQtHvX9RS=JYA@mail.gmail.com> > > >>> On Tue, Jan 24, 2017 at 4:58 PM, Kyotaro HORIGUCHI > > >>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > >>>> Six new syscaches in 665d1fa was conflicted and 3-way merge > > >>>> worked correctly. The new syscaches don't seem to be targets of > > >>>> this patch. > > >>> To be honest, I am not completely sure what to think about this patch. > > >>> Moved to next CF as there is a new version, and no new reviews to make > > >>> the discussion perhaps move on. > > >> I'm thinking the following is the status of this topic. > > >> > > >> - The patch stll is not getting conflicted. > > >> > > >> - This is not a hollistic measure for memory leak but surely > > >> saves some existing cases. > > >> > > >> - Shared catcache is another discussion (and won't really > > >> proposed in a short time due to the issue on locking.) > > >> > > >> - As I mentioned, a patch that caps the number of negative > > >> entries is avaiable (in first-created - first-delete manner) > > >> but it is having a loose end of how to determine the > > >> limitation. > > > While preventing bloat in the syscache is a worthwhile goal, it > > > appears > > > there are a number of loose ends here and a new patch has not been > > > provided. regards, -- Kyotaro Horiguchi NTT Open Source Software Center From 9f2c81dbc9bc344cafd6995dfc5969d55a8457d9 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Mon, 28 Aug 2017 11:36:21 +0900 Subject: [PATCH 1/2] Cleanup negative cache of pg_statistic when dropping arelation. Accessing columns that don't have statistics leaves negative entries in catcache for pg_statstic, but there's no chance to remove them. Especially when repeatedly creating then dropping temporary tables bloats catcache so much that memory pressure becomes significant. This patch removes negative entries in STATRELATTINH, ATTNAME and ATTNUM when corresponding relation is dropped. ---src/backend/utils/cache/catcache.c | 58 ++++++-src/backend/utils/cache/syscache.c | 302 +++++++++++++++++++++++++++----------src/include/utils/catcache.h | 3 +src/include/utils/syscache.h | 2 +4files changed, 282 insertions(+), 83 deletions(-) diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 95a0742..bd303f3 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -423,10 +423,11 @@ CatCachePrintStats(int code, Datum arg) if (cache->cc_ntup == 0 && cache->cc_searches == 0) continue; /* don't print unused caches */ - elog(DEBUG2, "catcache %s/%u: %d tup, %ld srch, %ld+%ld=%ld hits, %ld+%ld=%ld loads, %ld invals, %ld lsrch, %ldlhits", + elog(DEBUG2, "catcache %s/%u: %d tup, %d negtup, %ld srch, %ld+%ld=%ld hits, %ld+%ld=%ld loads, %ld invals, %ldlsrch, %ld lhits", cache->cc_relname, cache->cc_indexoid, cache->cc_ntup, + cache->cc_nnegtup, cache->cc_searches, cache->cc_hits, cache->cc_neg_hits, @@ -495,8 +496,11 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) * point into tuple, allocated together with theCatCTup. */ if (ct->negative) + { CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys, cache->cc_keyno, ct->keys); + --cache->cc_nnegtup; + } pfree(ct); @@ -697,6 +701,51 @@ ResetCatalogCache(CatCache *cache)}/* + * CleanupCatCacheNegEntries + * + * Remove negative cache tuples matching a partial key. + * + */ +void +CleanupCatCacheNegEntries(CatCache *cache, ScanKeyData *skey) +{ + int i; + + /* If this cache has no negative entries, nothing to do */ + if (cache->cc_nnegtup == 0) + return; + + /* searching with a partial key means scanning the whole cache */ + for (i = 0; i < cache->cc_nbuckets; i++) + { + dlist_head *bucket = &cache->cc_bucket[i]; + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, bucket) + { + const CCFastEqualFN *cc_fastequal = cache->cc_fastequal; + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + int oid_attnum = skey->sk_attno - 1; + + if (!ct->negative) + continue; + + /* Compare the OIDs */ + if (!(cc_fastequal[oid_attnum])(ct->keys[oid_attnum], + skey[0].sk_argument)) + continue; + + /* + * the negative cache entries can no longer be referenced, so we + * can remove it unconditionally + */ + CatCacheRemoveCTup(cache, ct); + } + } +} + + +/* * ResetCatalogCaches * * Reset all caches when a shared cache inval event forces it @@ -845,6 +894,7 @@ InitCatCache(int id, cp->cc_relisshared = false; /* temporary */ cp->cc_tupdesc = (TupleDesc) NULL; cp->cc_ntup = 0; + cp->cc_nnegtup = 0; cp->cc_nbuckets = nbuckets; cp->cc_nkeys = nkeys; for (i = 0; i < nkeys; ++i) @@ -1420,8 +1470,8 @@ SearchCatCacheMiss(CatCache *cache, CACHE4_elog(DEBUG2, "SearchCatCache(%s): Contains %d/%dtuples", cache->cc_relname, cache->cc_ntup, CacheHdr->ch_ntup); - CACHE3_elog(DEBUG2, "SearchCatCache(%s): put neg entry in bucket %d", - cache->cc_relname, hashIndex); + CACHE4_elog(DEBUG2, "SearchCatCache(%s): put neg entry in bucket %d, total %d", + cache->cc_relname, hashIndex, cache->cc_nnegtup); /* * We are not returning the negativeentry to the caller, so leave its @@ -1906,6 +1956,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, cache->cc_ntup++; CacheHdr->ch_ntup++; + if (negative) + cache->cc_nnegtup++; /* * If the hash table has become too full, enlarge the buckets array. Quite diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c index 888edbb..753c5f1 100644 --- a/src/backend/utils/cache/syscache.c +++ b/src/backend/utils/cache/syscache.c @@ -75,6 +75,8 @@#include "catalog/pg_user_mapping.h"#include "utils/rel.h"#include "utils/catcache.h" +#include "utils/fmgroids.h" +#include "utils/inval.h"#include "utils/syscache.h" @@ -118,6 +120,10 @@ struct cachedesc int nkeys; /* # of keys needed for cache lookup */ int key[4]; /* attribute numbers of key attrs */ int nbuckets; /* number of hashbuckets for this cache */ + + /* relcache invalidation stuff */ + AttrNumber relattrnum; /* attrnum to retrieve reloid for + * invalidation, 0 if not needed */};static const struct cachedesc cacheinfo[] = { @@ -130,7 +136,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 16 + 16, + 0 }, {AccessMethodRelationId, /* AMNAME */ AmNameIndexId, @@ -141,7 +148,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 4 + 4, + 0 }, {AccessMethodRelationId, /* AMOID */ AmOidIndexId, @@ -152,7 +160,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 4 + 4, + 0 }, {AccessMethodOperatorRelationId, /* AMOPOPID */ AccessMethodOperatorIndexId, @@ -163,7 +172,8 @@ static const struct cachedesc cacheinfo[] = { Anum_pg_amop_amopfamily, 0 }, - 64 + 64, + 0 }, {AccessMethodOperatorRelationId, /* AMOPSTRATEGY */ AccessMethodStrategyIndexId, @@ -174,7 +184,8 @@ static const struct cachedesc cacheinfo[] = { Anum_pg_amop_amoprighttype, Anum_pg_amop_amopstrategy }, - 64 + 64, + 0 }, {AccessMethodProcedureRelationId, /* AMPROCNUM */ AccessMethodProcedureIndexId, @@ -185,7 +196,8 @@ static const struct cachedesc cacheinfo[] = { Anum_pg_amproc_amprocrighttype, Anum_pg_amproc_amprocnum }, - 16 + 16, + 0 }, {AttributeRelationId, /* ATTNAME */ AttributeRelidNameIndexId, @@ -196,7 +208,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 32 + 32, + Anum_pg_attribute_attrelid }, {AttributeRelationId, /* ATTNUM */ AttributeRelidNumIndexId, @@ -207,7 +220,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 128 + 128, + Anum_pg_attribute_attrelid }, {AuthMemRelationId, /* AUTHMEMMEMROLE */ AuthMemMemRoleIndexId, @@ -218,7 +232,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 8 + 8, + 0 }, {AuthMemRelationId, /* AUTHMEMROLEMEM */ AuthMemRoleMemIndexId, @@ -229,7 +244,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 8 + 8, + 0 }, {AuthIdRelationId, /* AUTHNAME */ AuthIdRolnameIndexId, @@ -240,7 +256,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 8 + 8, + 0 }, {AuthIdRelationId, /* AUTHOID */ AuthIdOidIndexId, @@ -251,10 +268,10 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 8 + 8, + 0 }, - { - CastRelationId, /* CASTSOURCETARGET */ + {CastRelationId, /* CASTSOURCETARGET */ CastSourceTargetIndexId, 2, { @@ -263,7 +280,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 256 + 256, + 0 }, {OperatorClassRelationId, /* CLAAMNAMENSP */ OpclassAmNameNspIndexId, @@ -274,7 +292,8 @@ static const struct cachedesc cacheinfo[] = { Anum_pg_opclass_opcnamespace, 0 }, - 8 + 8, + 0 }, {OperatorClassRelationId, /* CLAOID */ OpclassOidIndexId, @@ -285,7 +304,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 8 + 8, + 0 }, {CollationRelationId, /* COLLNAMEENCNSP */ CollationNameEncNspIndexId, @@ -296,7 +316,8 @@ static const struct cachedesc cacheinfo[] = { Anum_pg_collation_collnamespace, 0 }, - 8 + 8, + 0 }, {CollationRelationId, /* COLLOID */ CollationOidIndexId, @@ -307,7 +328,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 8 + 8, + 0 }, {ConversionRelationId, /* CONDEFAULT */ ConversionDefaultIndexId, @@ -318,7 +340,8 @@ static const struct cachedesc cacheinfo[] = { Anum_pg_conversion_contoencoding, ObjectIdAttributeNumber, }, - 8 + 8, + 0 }, {ConversionRelationId, /* CONNAMENSP */ ConversionNameNspIndexId, @@ -329,7 +352,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 8 + 8, + 0 }, {ConstraintRelationId, /* CONSTROID */ ConstraintOidIndexId, @@ -340,7 +364,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 16 + 16, + 0 }, {ConversionRelationId, /* CONVOID */ ConversionOidIndexId, @@ -351,7 +376,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 8 + 8, + 0 }, {DatabaseRelationId, /* DATABASEOID */ DatabaseOidIndexId, @@ -362,7 +388,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 4 + 4, + 0 }, {DefaultAclRelationId, /* DEFACLROLENSPOBJ */ DefaultAclRoleNspObjIndexId, @@ -373,7 +400,8 @@ static const struct cachedesc cacheinfo[] = { Anum_pg_default_acl_defaclobjtype, 0 }, - 8 + 8, + 0 }, {EnumRelationId, /* ENUMOID */ EnumOidIndexId, @@ -384,7 +412,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 8 + 8, + 0 }, {EnumRelationId, /* ENUMTYPOIDNAME */ EnumTypIdLabelIndexId, @@ -395,7 +424,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 8 + 8, + 0 }, {EventTriggerRelationId, /* EVENTTRIGGERNAME */ EventTriggerNameIndexId, @@ -406,7 +436,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 8 + 8, + 0 }, {EventTriggerRelationId, /* EVENTTRIGGEROID */ EventTriggerOidIndexId, @@ -417,7 +448,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 8 + 8, + 0 }, {ForeignDataWrapperRelationId, /* FOREIGNDATAWRAPPERNAME */ ForeignDataWrapperNameIndexId, @@ -428,7 +460,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 2 + 2, + 0 }, {ForeignDataWrapperRelationId, /* FOREIGNDATAWRAPPEROID */ ForeignDataWrapperOidIndexId, @@ -439,7 +472,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 2 + 2, + 0 }, {ForeignServerRelationId, /* FOREIGNSERVERNAME */ ForeignServerNameIndexId, @@ -450,7 +484,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 2 + 2, + 0 }, {ForeignServerRelationId, /* FOREIGNSERVEROID */ ForeignServerOidIndexId, @@ -461,7 +496,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 2 + 2, + 0 }, {ForeignTableRelationId, /* FOREIGNTABLEREL */ ForeignTableRelidIndexId, @@ -472,7 +508,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 4 + 4, + 0 }, {IndexRelationId, /* INDEXRELID */ IndexRelidIndexId, @@ -483,7 +520,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 64 + 64, + 0 }, {LanguageRelationId, /* LANGNAME */ LanguageNameIndexId, @@ -494,7 +532,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 4 + 4, + 0 }, {LanguageRelationId, /* LANGOID */ LanguageOidIndexId, @@ -505,7 +544,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 4 + 4, + 0 }, {NamespaceRelationId, /* NAMESPACENAME */ NamespaceNameIndexId, @@ -516,7 +556,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 4 + 4, + 0 }, {NamespaceRelationId, /* NAMESPACEOID */ NamespaceOidIndexId, @@ -527,7 +568,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 16 + 16, + 0 }, {OperatorRelationId, /* OPERNAMENSP */ OperatorNameNspIndexId, @@ -538,7 +580,8 @@ static const struct cachedesc cacheinfo[] = { Anum_pg_operator_oprright, Anum_pg_operator_oprnamespace }, - 256 + 256, + 0 }, {OperatorRelationId, /* OPEROID */ OperatorOidIndexId, @@ -549,7 +592,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 32 + 32, + 0 }, {OperatorFamilyRelationId, /* OPFAMILYAMNAMENSP */ OpfamilyAmNameNspIndexId, @@ -560,7 +604,8 @@ static const struct cachedesc cacheinfo[] = { Anum_pg_opfamily_opfnamespace, 0 }, - 8 + 8, + 0 }, {OperatorFamilyRelationId, /* OPFAMILYOID */ OpfamilyOidIndexId, @@ -571,7 +616,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 8 + 8, + 0 }, {PartitionedRelationId, /* PARTRELID */ PartitionedRelidIndexId, @@ -582,7 +628,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 32 + 32, + 0 }, {ProcedureRelationId, /* PROCNAMEARGSNSP */ ProcedureNameArgsNspIndexId, @@ -593,7 +640,8 @@ static const struct cachedesc cacheinfo[] = { Anum_pg_proc_pronamespace, 0 }, - 128 + 128, + 0 }, {ProcedureRelationId, /* PROCOID */ ProcedureOidIndexId, @@ -604,7 +652,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 128 + 128, + 0 }, {PublicationRelationId, /* PUBLICATIONNAME */ PublicationNameIndexId, @@ -615,7 +664,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 8 + 8, + 0 }, {PublicationRelationId, /* PUBLICATIONOID */ PublicationObjectIndexId, @@ -626,7 +676,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 8 + 8, + 0 }, {PublicationRelRelationId, /* PUBLICATIONREL */ PublicationRelObjectIndexId, @@ -637,7 +688,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 64 + 64, + 0 }, {PublicationRelRelationId, /* PUBLICATIONRELMAP */ PublicationRelPrrelidPrpubidIndexId, @@ -648,7 +700,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 64 + 64, + 0 }, {RangeRelationId, /* RANGETYPE */ RangeTypidIndexId, @@ -659,7 +712,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 4 + 4, + 0 }, {RelationRelationId, /* RELNAMENSP */ ClassNameNspIndexId, @@ -670,7 +724,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 128 + 128, + 0 }, {RelationRelationId, /* RELOID */ ClassOidIndexId, @@ -681,7 +736,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 128 + 128, + 0 }, {ReplicationOriginRelationId, /* REPLORIGIDENT */ ReplicationOriginIdentIndex, @@ -692,7 +748,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 16 + 16, + 0 }, {ReplicationOriginRelationId, /* REPLORIGNAME */ ReplicationOriginNameIndex, @@ -703,7 +760,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 16 + 16, + 0 }, {RewriteRelationId, /* RULERELNAME */ RewriteRelRulenameIndexId, @@ -714,7 +772,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 8 + 8, + 0 }, {SequenceRelationId, /* SEQRELID */ SequenceRelidIndexId, @@ -725,7 +784,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 32 + 32, + 0 }, {StatisticExtRelationId, /* STATEXTNAMENSP */ StatisticExtNameIndexId, @@ -736,7 +796,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 4 + 4, + 0 }, {StatisticExtRelationId, /* STATEXTOID */ StatisticExtOidIndexId, @@ -747,7 +808,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 4 + 4, + 0 }, {StatisticRelationId, /* STATRELATTINH */ StatisticRelidAttnumInhIndexId, @@ -758,7 +820,8 @@ static const struct cachedesc cacheinfo[] = { Anum_pg_statistic_stainherit, 0 }, - 128 + 128, + Anum_pg_statistic_starelid }, {SubscriptionRelationId, /* SUBSCRIPTIONNAME */ SubscriptionNameIndexId, @@ -769,7 +832,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 4 + 4, + 0 }, {SubscriptionRelationId, /* SUBSCRIPTIONOID */ SubscriptionObjectIndexId, @@ -780,7 +844,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 4 + 4, + 0 }, {SubscriptionRelRelationId, /* SUBSCRIPTIONRELMAP */ SubscriptionRelSrrelidSrsubidIndexId, @@ -791,7 +856,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 64 + 64, + 0 }, {TableSpaceRelationId, /* TABLESPACEOID */ TablespaceOidIndexId, @@ -802,7 +868,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0, }, - 4 + 4, + 0 }, {TransformRelationId, /* TRFOID */ TransformOidIndexId, @@ -813,7 +880,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0, }, - 16 + 16, + 0 }, {TransformRelationId, /* TRFTYPELANG */ TransformTypeLangIndexId, @@ -824,7 +892,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0, }, - 16 + 16, + 0 }, {TSConfigMapRelationId, /* TSCONFIGMAP */ TSConfigMapIndexId, @@ -835,7 +904,8 @@ static const struct cachedesc cacheinfo[] = { Anum_pg_ts_config_map_mapseqno, 0 }, - 2 + 2, + 0 }, {TSConfigRelationId, /* TSCONFIGNAMENSP */ TSConfigNameNspIndexId, @@ -846,7 +916,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 2 + 2, + 0 }, {TSConfigRelationId, /* TSCONFIGOID */ TSConfigOidIndexId, @@ -857,7 +928,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 2 + 2, + 0 }, {TSDictionaryRelationId, /* TSDICTNAMENSP */ TSDictionaryNameNspIndexId, @@ -868,7 +940,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 2 + 2, + 0 }, {TSDictionaryRelationId, /* TSDICTOID */ TSDictionaryOidIndexId, @@ -879,7 +952,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 2 + 2, + 0 }, {TSParserRelationId, /* TSPARSERNAMENSP */ TSParserNameNspIndexId, @@ -890,7 +964,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 2 + 2, + 0 }, {TSParserRelationId, /* TSPARSEROID */ TSParserOidIndexId, @@ -901,7 +976,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 2 + 2, + 0 }, {TSTemplateRelationId, /* TSTEMPLATENAMENSP */ TSTemplateNameNspIndexId, @@ -912,7 +988,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 2 + 2, + 0 }, {TSTemplateRelationId, /* TSTEMPLATEOID */ TSTemplateOidIndexId, @@ -923,7 +1000,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 2 + 2, + 0 }, {TypeRelationId, /* TYPENAMENSP */ TypeNameNspIndexId, @@ -934,7 +1012,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 64 + 64, + 0 }, {TypeRelationId, /* TYPEOID */ TypeOidIndexId, @@ -945,7 +1024,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 64 + 64, + 0 }, {UserMappingRelationId, /* USERMAPPINGOID */ UserMappingOidIndexId, @@ -956,7 +1036,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 2 + 2, + 0 }, {UserMappingRelationId, /* USERMAPPINGUSERSERVER */ UserMappingUserServerIndexId, @@ -967,7 +1048,8 @@ static const struct cachedesc cacheinfo[] = { 0, 0 }, - 2 + 2, + 0 }}; @@ -983,8 +1065,23 @@ static int SysCacheRelationOidSize;static Oid SysCacheSupportingRelOid[SysCacheSize * 2];staticint SysCacheSupportingRelOidSize; -static int oid_compare(const void *a, const void *b); +/* + * stuff for negative cache flushing by relcache invalidation + */ +#define MAX_RELINVAL_CALLBACKS 4 +typedef struct RELINVALCBParam +{ + CatCache *cache; + int relkeynum; +} RELINVALCBParam; + +RELINVALCBParam relinval_callback_list[MAX_RELINVAL_CALLBACKS]; +static int relinval_callback_count = 0; + +static ScanKeyData oideqscankey; /* ScanKey for reloid match */ +static int oid_compare(const void *a, const void *b); +static void SysCacheRelInvalCallback(Datum arg, Oid reloid);/* * InitCatalogCache - initialize the caches @@ -1028,6 +1125,21 @@ InitCatalogCache(void) cacheinfo[cacheId].indoid; /* see comments for RelationInvalidatesSnapshotsOnly*/ Assert(!RelationInvalidatesSnapshotsOnly(cacheinfo[cacheId].reloid)); + + /* + * If this syscache is requesting relcache invalidation, register a + * callback + */ + if (cacheinfo[cacheId].relattrnum > 0) + { + Assert(relinval_callback_count < MAX_RELINVAL_CALLBACKS); + + relinval_callback_list[relinval_callback_count].cache = + SysCache[cacheId]; + relinval_callback_list[relinval_callback_count].relkeynum = + cacheinfo[cacheId].relattrnum; + relinval_callback_count++; + } } Assert(SysCacheRelationOidSize <= lengthof(SysCacheRelationOid)); @@ -1052,10 +1164,40 @@ InitCatalogCache(void) } SysCacheSupportingRelOidSize = j + 1; + /* + * prepare the scankey for reloid comparison and register a relcache inval + * callback. + */ + oideqscankey.sk_strategy = BTEqualStrategyNumber; + oideqscankey.sk_subtype = InvalidOid; + oideqscankey.sk_collation = InvalidOid; + fmgr_info_cxt(F_OIDEQ, &oideqscankey.sk_func, CacheMemoryContext); + CacheRegisterRelcacheCallback(SysCacheRelInvalCallback, (Datum) 0); + CacheInitialized = true;}/* + * Callback function for negative cache flushing by relcache invalidation + * scankey for this function has been prepared in InitCatalogCache. + */ +static void +SysCacheRelInvalCallback(Datum arg, Oid reloid) +{ + int i; + + for(i = 0 ; i < relinval_callback_count ; i++) + { + ScanKeyData skey; + + memcpy(&skey, &oideqscankey, sizeof(skey)); + skey.sk_attno = relinval_callback_list[i].relkeynum; + skey.sk_argument = ObjectIdGetDatum(reloid); + CleanupCatCacheNegEntries(relinval_callback_list[i].cache, &skey); + } +} + +/* * InitCatalogCachePhase2 - finish initializing the caches * * Finish initializing all the caches, including necessarydatabase diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 74535eb..7564f42 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -59,6 +59,7 @@ typedef struct catcache Oid cc_indexoid; /* OID of index matching cache keys */ bool cc_relisshared; /* is relation shared across databases? */ slist_node cc_next; /* list link */ + int cc_nnegtup; /* # of negative tuples */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputedkey info for heap * scans */ @@ -217,6 +218,8 @@ extern CatCList *SearchCatCacheList(CatCache *cache, int nkeys, Datum v3, Datum v4);externvoid ReleaseCatCacheList(CatCList *list); +extern void +CleanupCatCacheNegEntries(CatCache *cache, ScanKeyData *skey);extern void ResetCatalogCaches(void);extern void CatalogCacheFlushCatalog(OidcatId);extern void CatCacheInvalidate(CatCache *cache, uint32 hashValue); diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h index 8a0be41..26ac57c 100644 --- a/src/include/utils/syscache.h +++ b/src/include/utils/syscache.h @@ -132,6 +132,8 @@ extern HeapTuple SearchSysCache4(int cacheId, Datum key1, Datum key2, Datum key3, Datumkey4);extern void ReleaseSysCache(HeapTuple tuple); +extern void CleanupNegativeCache(int cacheid, int nkeys, + Datum key1, Datum key2, Datum key3, Datum key4);/* convenience routines */extern HeapTuple SearchSysCacheCopy(intcacheId, -- 2.9.2 From 56b1eede29631df78cc622386693381b7aa76a51 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Mon, 28 Aug 2017 12:18:17 +0900 Subject: [PATCH 2/2] Cleanup negative cache of pg_class when dropping a schema This feature in turn is triggered by catcache invalidation. This patch provides a syscache invalidation callback to flush negative cache entries corresponding to invalidated objects. ---src/backend/utils/cache/catcache.c | 42 +++++src/backend/utils/cache/inval.c | 7 +-src/backend/utils/cache/syscache.c| 327 ++++++++++++++++++++++++++++---------src/include/utils/catcache.h | 3 +4files changed, 300 insertions(+), 79 deletions(-) diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index bd303f3..a9ef028 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -1555,6 +1555,48 @@ GetCatCacheHashValue(CatCache *cache, return CatalogCacheComputeHashValue(cache, cache->cc_nkeys,v1, v2, v3, v4);} +/* + * CollectOIDsForHashValue + * + * Collect OIDs correspond to a hash value. attnum is the column to retrieve + * the OIDs. + */ +List * +CollectOIDsForHashValue(CatCache *cache, uint32 hashValue, int attnum) +{ + Index hashIndex = HASH_INDEX(hashValue, cache->cc_nbuckets); + dlist_head *bucket = &cache->cc_bucket[hashIndex]; + dlist_iter iter; + List *ret = NIL; + + /* Nothing to return before initialization */ + if (cache->cc_tupdesc == NULL) + return ret; + + /* Currently only OID key is supported */ + Assert(attnum <= cache->cc_tupdesc->natts); + Assert(attnum < 0 ? attnum == ObjectIdAttributeNumber : + cache->cc_tupdesc->attrs[attnum].atttypid == OIDOID); + + dlist_foreach(iter, bucket) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + bool isNull; + Datum oid; + + if (ct->dead) + continue; /* ignore dead entries */ + + if (ct->hash_value != hashValue) + continue; /* quickly skip entry if wrong hash val */ + + oid = heap_getattr(&ct->tuple, attnum, cache->cc_tupdesc, &isNull); + if (!isNull) + ret = lappend_oid(ret, DatumGetObjectId(oid)); + } + + return ret; +}/* * SearchCatCacheList diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c index 0e61b4b..86e6f07 100644 --- a/src/backend/utils/cache/inval.c +++ b/src/backend/utils/cache/inval.c @@ -559,9 +559,14 @@ LocalExecuteInvalidationMessage(SharedInvalidationMessage *msg) { InvalidateCatalogSnapshot(); + /* + * Call the callbacks first so that the callbacks can access the + * entries corresponding to the hashValue. + */ + CallSyscacheCallbacks(msg->cc.id, msg->cc.hashValue); + SysCacheInvalidate(msg->cc.id, msg->cc.hashValue); - CallSyscacheCallbacks(msg->cc.id, msg->cc.hashValue); } } else if (msg->id == SHAREDINVALCATALOG_ID) diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c index 753c5f1..7dd61cd 100644 --- a/src/backend/utils/cache/syscache.c +++ b/src/backend/utils/cache/syscache.c @@ -111,6 +111,16 @@*//* + * struct for flushing negative cache by syscache invalidation + */ +typedef struct SysCacheCBParam_T +{ + int trig_attnum; + int target_cacheid; + ScanKeyData skey; +} SysCacheCBParam; + +/* * struct cachedesc: information defining a single syscache */struct cachedesc @@ -124,6 +134,14 @@ struct cachedesc /* relcache invalidation stuff */ AttrNumber relattrnum; /* attrnumto retrieve reloid for * invalidation, 0 if not needed */ + + /* catcache invalidation stuff */ + int trig_cacheid; /* cache id of triggering syscache: -1 means + * no triggering cache */ + int16 trig_attnum; /* key column in triggering cache. Must be an + * OID */ + int16 target_attnum; /* corresponding column in this cache. Must be + * an OID*/};static const struct cachedesc cacheinfo[] = { @@ -137,7 +155,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 16, - 0 + 0, + -1, 0, 0 }, {AccessMethodRelationId, /* AMNAME */ AmNameIndexId, @@ -149,7 +168,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 4, - 0 + 0, + -1, 0, 0 }, {AccessMethodRelationId, /* AMOID */ AmOidIndexId, @@ -161,7 +181,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 4, - 0 + 0, + -1, 0, 0 }, {AccessMethodOperatorRelationId, /* AMOPOPID */ AccessMethodOperatorIndexId, @@ -173,7 +194,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 64, - 0 + 0, + -1, 0, 0 }, {AccessMethodOperatorRelationId, /* AMOPSTRATEGY */ AccessMethodStrategyIndexId, @@ -185,7 +207,8 @@ static const struct cachedesc cacheinfo[] = { Anum_pg_amop_amopstrategy }, 64, - 0 + 0, + -1, 0, 0 }, {AccessMethodProcedureRelationId, /* AMPROCNUM */ AccessMethodProcedureIndexId, @@ -197,7 +220,8 @@ static const struct cachedesc cacheinfo[] = { Anum_pg_amproc_amprocnum }, 16, - 0 + 0, + -1, 0, 0 }, {AttributeRelationId, /* ATTNAME */ AttributeRelidNameIndexId, @@ -209,7 +233,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 32, - Anum_pg_attribute_attrelid + Anum_pg_attribute_attrelid, + -1, 0, 0 }, {AttributeRelationId, /* ATTNUM */ AttributeRelidNumIndexId, @@ -221,7 +246,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 128, - Anum_pg_attribute_attrelid + Anum_pg_attribute_attrelid, + -1, 0, 0 }, {AuthMemRelationId, /* AUTHMEMMEMROLE */ AuthMemMemRoleIndexId, @@ -233,7 +259,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 8, - 0 + 0, + -1, 0, 0 }, {AuthMemRelationId, /* AUTHMEMROLEMEM */ AuthMemRoleMemIndexId, @@ -245,7 +272,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 8, - 0 + 0, + -1, 0, 0 }, {AuthIdRelationId, /* AUTHNAME */ AuthIdRolnameIndexId, @@ -257,7 +285,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 8, - 0 + 0, + -1, 0, 0 }, {AuthIdRelationId, /* AUTHOID */ AuthIdOidIndexId, @@ -269,7 +298,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 8, - 0 + 0, + -1, 0, 0 }, {CastRelationId, /* CASTSOURCETARGET */ CastSourceTargetIndexId, @@ -281,7 +311,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 256, - 0 + 0, + -1, 0, 0 }, {OperatorClassRelationId, /* CLAAMNAMENSP */ OpclassAmNameNspIndexId, @@ -293,7 +324,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 8, - 0 + 0, + -1, 0, 0 }, {OperatorClassRelationId, /* CLAOID */ OpclassOidIndexId, @@ -305,7 +337,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 8, - 0 + 0, + -1, 0, 0 }, {CollationRelationId, /* COLLNAMEENCNSP */ CollationNameEncNspIndexId, @@ -317,7 +350,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 8, - 0 + 0, + -1, 0, 0 }, {CollationRelationId, /* COLLOID */ CollationOidIndexId, @@ -329,7 +363,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 8, - 0 + 0, + -1, 0, 0 }, {ConversionRelationId, /* CONDEFAULT */ ConversionDefaultIndexId, @@ -341,7 +376,8 @@ static const struct cachedesc cacheinfo[] = { ObjectIdAttributeNumber, }, 8, - 0 + 0, + -1, 0, 0 }, {ConversionRelationId, /* CONNAMENSP */ ConversionNameNspIndexId, @@ -353,7 +389,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 8, - 0 + 0, + -1, 0, 0 }, {ConstraintRelationId, /* CONSTROID */ ConstraintOidIndexId, @@ -365,7 +402,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 16, - 0 + 0, + -1, 0, 0 }, {ConversionRelationId, /* CONVOID */ ConversionOidIndexId, @@ -377,7 +415,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 8, - 0 + 0, + -1, 0, 0 }, {DatabaseRelationId, /* DATABASEOID */ DatabaseOidIndexId, @@ -389,7 +428,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 4, - 0 + 0, + -1, 0, 0 }, {DefaultAclRelationId, /* DEFACLROLENSPOBJ */ DefaultAclRoleNspObjIndexId, @@ -401,7 +441,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 8, - 0 + 0, + -1, 0, 0 }, {EnumRelationId, /* ENUMOID */ EnumOidIndexId, @@ -413,7 +454,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 8, - 0 + 0, + -1, 0, 0 }, {EnumRelationId, /* ENUMTYPOIDNAME */ EnumTypIdLabelIndexId, @@ -425,7 +467,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 8, - 0 + 0, + -1, 0, 0 }, {EventTriggerRelationId, /* EVENTTRIGGERNAME */ EventTriggerNameIndexId, @@ -437,7 +480,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 8, - 0 + 0, + -1, 0, 0 }, {EventTriggerRelationId, /* EVENTTRIGGEROID */ EventTriggerOidIndexId, @@ -449,7 +493,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 8, - 0 + 0, + -1, 0, 0 }, {ForeignDataWrapperRelationId, /* FOREIGNDATAWRAPPERNAME */ ForeignDataWrapperNameIndexId, @@ -461,7 +506,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 2, - 0 + 0, + -1, 0, 0 }, {ForeignDataWrapperRelationId, /* FOREIGNDATAWRAPPEROID */ ForeignDataWrapperOidIndexId, @@ -473,7 +519,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 2, - 0 + 0, + -1, 0, 0 }, {ForeignServerRelationId, /* FOREIGNSERVERNAME */ ForeignServerNameIndexId, @@ -485,7 +532,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 2, - 0 + 0, + -1, 0, 0 }, {ForeignServerRelationId, /* FOREIGNSERVEROID */ ForeignServerOidIndexId, @@ -497,7 +545,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 2, - 0 + 0, + -1, 0, 0 }, {ForeignTableRelationId, /* FOREIGNTABLEREL */ ForeignTableRelidIndexId, @@ -509,7 +558,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 4, - 0 + 0, + -1, 0, 0 }, {IndexRelationId, /* INDEXRELID */ IndexRelidIndexId, @@ -521,7 +571,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 64, - 0 + 0, + -1, 0, 0 }, {LanguageRelationId, /* LANGNAME */ LanguageNameIndexId, @@ -533,7 +584,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 4, - 0 + 0, + -1, 0, 0 }, {LanguageRelationId, /* LANGOID */ LanguageOidIndexId, @@ -545,7 +597,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 4, - 0 + 0, + -1, 0, 0 }, {NamespaceRelationId, /* NAMESPACENAME */ NamespaceNameIndexId, @@ -557,7 +610,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 4, - 0 + 0, + -1, 0, 0 }, {NamespaceRelationId, /* NAMESPACEOID */ NamespaceOidIndexId, @@ -569,7 +623,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 16, - 0 + 0, + -1, 0, 0 }, {OperatorRelationId, /* OPERNAMENSP */ OperatorNameNspIndexId, @@ -581,7 +636,8 @@ static const struct cachedesc cacheinfo[] = { Anum_pg_operator_oprnamespace }, 256, - 0 + 0, + -1, 0, 0 }, {OperatorRelationId, /* OPEROID */ OperatorOidIndexId, @@ -593,7 +649,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 32, - 0 + 0, + -1, 0, 0 }, {OperatorFamilyRelationId, /* OPFAMILYAMNAMENSP */ OpfamilyAmNameNspIndexId, @@ -605,7 +662,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 8, - 0 + 0, + -1, 0, 0 }, {OperatorFamilyRelationId, /* OPFAMILYOID */ OpfamilyOidIndexId, @@ -617,7 +675,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 8, - 0 + 0, + -1, 0, 0 }, {PartitionedRelationId, /* PARTRELID */ PartitionedRelidIndexId, @@ -629,7 +688,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 32, - 0 + 0, + -1, 0, 0 }, {ProcedureRelationId, /* PROCNAMEARGSNSP */ ProcedureNameArgsNspIndexId, @@ -641,7 +701,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 128, - 0 + 0, + -1, 0, 0 }, {ProcedureRelationId, /* PROCOID */ ProcedureOidIndexId, @@ -653,7 +714,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 128, - 0 + 0, + -1, 0, 0 }, {PublicationRelationId, /* PUBLICATIONNAME */ PublicationNameIndexId, @@ -665,7 +727,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 8, - 0 + 0, + -1, 0, 0 }, {PublicationRelationId, /* PUBLICATIONOID */ PublicationObjectIndexId, @@ -677,7 +740,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 8, - 0 + 0, + -1, 0, 0 }, {PublicationRelRelationId, /* PUBLICATIONREL */ PublicationRelObjectIndexId, @@ -689,7 +753,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 64, - 0 + 0, + -1, 0, 0 }, {PublicationRelRelationId, /* PUBLICATIONRELMAP */ PublicationRelPrrelidPrpubidIndexId, @@ -701,7 +766,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 64, - 0 + 0, + -1, 0, 0 }, {RangeRelationId, /* RANGETYPE */ RangeTypidIndexId, @@ -713,7 +779,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 4, - 0 + 0, + -1, 0, 0 }, {RelationRelationId, /* RELNAMENSP */ ClassNameNspIndexId, @@ -725,7 +792,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 128, - 0 + 0, + NAMESPACEOID, ObjectIdAttributeNumber, Anum_pg_class_relnamespace }, {RelationRelationId, /* RELOID*/ ClassOidIndexId, @@ -737,7 +805,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 128, - 0 + 0, + -1, 0, 0 }, {ReplicationOriginRelationId, /* REPLORIGIDENT */ ReplicationOriginIdentIndex, @@ -749,7 +818,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 16, - 0 + 0, + -1, 0, 0 }, {ReplicationOriginRelationId, /* REPLORIGNAME */ ReplicationOriginNameIndex, @@ -761,7 +831,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 16, - 0 + 0, + -1, 0, 0 }, {RewriteRelationId, /* RULERELNAME */ RewriteRelRulenameIndexId, @@ -773,7 +844,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 8, - 0 + 0, + -1, 0, 0 }, {SequenceRelationId, /* SEQRELID */ SequenceRelidIndexId, @@ -785,7 +857,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 32, - 0 + 0, + -1, 0, 0 }, {StatisticExtRelationId, /* STATEXTNAMENSP */ StatisticExtNameIndexId, @@ -797,7 +870,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 4, - 0 + 0, + -1, 0, 0 }, {StatisticExtRelationId, /* STATEXTOID */ StatisticExtOidIndexId, @@ -809,7 +883,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 4, - 0 + 0, + -1, 0, 0 }, {StatisticRelationId, /* STATRELATTINH */ StatisticRelidAttnumInhIndexId, @@ -821,7 +896,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 128, - Anum_pg_statistic_starelid + Anum_pg_statistic_starelid, + -1, 0, 0 }, {SubscriptionRelationId, /* SUBSCRIPTIONNAME */ SubscriptionNameIndexId, @@ -833,7 +909,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 4, - 0 + 0, + -1, 0, 0 }, {SubscriptionRelationId, /* SUBSCRIPTIONOID */ SubscriptionObjectIndexId, @@ -845,7 +922,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 4, - 0 + 0, + -1, 0, 0 }, {SubscriptionRelRelationId, /* SUBSCRIPTIONRELMAP */ SubscriptionRelSrrelidSrsubidIndexId, @@ -857,7 +935,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 64, - 0 + 0, + -1, 0, 0 }, {TableSpaceRelationId, /* TABLESPACEOID */ TablespaceOidIndexId, @@ -869,7 +948,8 @@ static const struct cachedesc cacheinfo[] = { 0, }, 4, - 0 + 0, + -1, 0, 0 }, {TransformRelationId, /* TRFOID */ TransformOidIndexId, @@ -881,7 +961,8 @@ static const struct cachedesc cacheinfo[] = { 0, }, 16, - 0 + 0, + -1, 0, 0 }, {TransformRelationId, /* TRFTYPELANG */ TransformTypeLangIndexId, @@ -893,7 +974,8 @@ static const struct cachedesc cacheinfo[] = { 0, }, 16, - 0 + 0, + -1, 0, 0 }, {TSConfigMapRelationId, /* TSCONFIGMAP */ TSConfigMapIndexId, @@ -905,7 +987,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 2, - 0 + 0, + -1, 0, 0 }, {TSConfigRelationId, /* TSCONFIGNAMENSP */ TSConfigNameNspIndexId, @@ -917,7 +1000,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 2, - 0 + 0, + -1, 0, 0 }, {TSConfigRelationId, /* TSCONFIGOID */ TSConfigOidIndexId, @@ -929,7 +1013,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 2, - 0 + 0, + -1, 0, 0 }, {TSDictionaryRelationId, /* TSDICTNAMENSP */ TSDictionaryNameNspIndexId, @@ -941,7 +1026,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 2, - 0 + 0, + -1, 0, 0 }, {TSDictionaryRelationId, /* TSDICTOID */ TSDictionaryOidIndexId, @@ -953,7 +1039,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 2, - 0 + 0, + -1, 0, 0 }, {TSParserRelationId, /* TSPARSERNAMENSP */ TSParserNameNspIndexId, @@ -965,7 +1052,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 2, - 0 + 0, + -1, 0, 0 }, {TSParserRelationId, /* TSPARSEROID */ TSParserOidIndexId, @@ -977,7 +1065,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 2, - 0 + 0, + -1, 0, 0 }, {TSTemplateRelationId, /* TSTEMPLATENAMENSP */ TSTemplateNameNspIndexId, @@ -989,7 +1078,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 2, - 0 + 0, + -1, 0, 0 }, {TSTemplateRelationId, /* TSTEMPLATEOID */ TSTemplateOidIndexId, @@ -1001,7 +1091,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 2, - 0 + 0, + -1, 0, 0 }, {TypeRelationId, /* TYPENAMENSP */ TypeNameNspIndexId, @@ -1013,7 +1104,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 64, - 0 + 0, + -1, 0, 0 }, {TypeRelationId, /* TYPEOID */ TypeOidIndexId, @@ -1025,7 +1117,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 64, - 0 + 0, + -1, 0, 0 }, {UserMappingRelationId, /* USERMAPPINGOID */ UserMappingOidIndexId, @@ -1037,7 +1130,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 2, - 0 + 0, + -1, 0, 0 }, {UserMappingRelationId, /* USERMAPPINGUSERSERVER */ UserMappingUserServerIndexId, @@ -1049,7 +1143,8 @@ static const struct cachedesc cacheinfo[] = { 0 }, 2, - 0 + 0, + -1, 0, 0 }}; @@ -1082,7 +1177,8 @@ static ScanKeyData oideqscankey; /* ScanKey for reloid match */static int oid_compare(constvoid *a, const void *b);static void SysCacheRelInvalCallback(Datum arg, Oid reloid); - +static void SysCacheSysCacheInvalCallback(Datum arg, int cacheid, + uint32 hashvalue);/* * InitCatalogCache - initialize the caches * @@ -1140,6 +1236,34 @@ InitCatalogCache(void) cacheinfo[cacheId].relattrnum; relinval_callback_count++; } + + /* + * If this syscache has syscache invalidation trigger, register + * it. + */ + if (cacheinfo[cacheId].trig_cacheid >= 0) + { + SysCacheCBParam *param; + + param = MemoryContextAlloc(CacheMemoryContext, + sizeof(SysCacheCBParam)); + param->target_cacheid = cacheId; + + /* + * XXXX: Create a scankeydata for OID comparison. We don't have a + * means to check the type of the column in the system catalog at + * this time. So we have to believe the definition. + */ + fmgr_info_cxt(F_OIDEQ, ¶m->skey.sk_func, CacheMemoryContext); + param->skey.sk_attno = cacheinfo[cacheId].target_attnum; + param->trig_attnum = cacheinfo[cacheId].trig_attnum; + param->skey.sk_strategy = BTEqualStrategyNumber; + param->skey.sk_subtype = InvalidOid; + param->skey.sk_collation = InvalidOid; + CacheRegisterSyscacheCallback(cacheinfo[cacheId].trig_cacheid, + SysCacheSysCacheInvalCallback, + PointerGetDatum(param)); + } } Assert(SysCacheRelationOidSize <= lengthof(SysCacheRelationOid)); @@ -1623,6 +1747,53 @@ RelationInvalidatesSnapshotsOnly(Oid relid)}/* + * SysCacheSysCacheInvalCallback + * + * Callback function for negative cache flushing by syscache invalidation. + * Fetches an OID (not restricted to system oid column) from the invalidated + * tuple and flushes negative entries that matches the OID in the target + * syscache. + */ +static void +SysCacheSysCacheInvalCallback(Datum arg, int cacheid, uint32 hashValue) +{ + SysCacheCBParam *param; + CatCache *trigger_cache; /* triggering catcache */ + CatCache *target_cache; /* target catcache */ + List *oids; + ListCell *lc; + int trigger_cacheid = cacheid; + int target_cacheid; + + param = (SysCacheCBParam *)DatumGetPointer(arg); + target_cacheid = param->target_cacheid; + + trigger_cache = SysCache[trigger_cacheid]; + target_cache = SysCache[target_cacheid]; + + /* + * Collect candidate OIDs for target syscache entries. The result contains + * just one value for most cases, or two or more for the case hashvalue + * has synonyms. At least one of them is the right OID but it is + * undistinguishable from others by the given hash value. + * As the result some unnecessary entries may be flushed but it won't harm + * so much than letting them bloat catcaches. + */ + oids = + CollectOIDsForHashValue(trigger_cache, hashValue, param->trig_attnum); + + foreach (lc, oids) + { + ScanKeyData skey; + Oid oid = lfirst_oid (lc); + + memcpy(&skey, ¶m->skey, sizeof(skey)); + skey.sk_argument = ObjectIdGetDatum(oid); + CleanupCatCacheNegEntries(target_cache, &skey); + } +} + +/* * Test whether a relation has a system cache. */bool diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 7564f42..562810f 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -213,6 +213,9 @@ extern uint32 GetCatCacheHashValue(CatCache *cache, Datum v1, Datum v2, Datum v3, Datum v4); +extern List *CollectOIDsForHashValue(CatCache *cache, + uint32 hashValue, int attnum); +extern CatCList *SearchCatCacheList(CatCache *cache, int nkeys, Datum v1, Datum v2, Datum v3, Datum v4); -- 2.9.2 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 31, 2017 at 6:46 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > This is a rebased version of the patch. As far as I can see, the patch still applies, compiles, and got no reviews. So moved to next CF. -- Michael
On Wed, Nov 29, 2017 at 8:25 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Tue, Oct 31, 2017 at 6:46 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> This is a rebased version of the patch. > > As far as I can see, the patch still applies, compiles, and got no > reviews. So moved to next CF. I think we have to mark this as returned with feedback or rejected for the reasons mentioned here: http://postgr.es/m/CA+TgmoZjn28uYJRQ2K+5idhYxWBDER68sctoc2p_nW7h7JbhYw@mail.gmail.com -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Nov 30, 2017 at 12:32 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Nov 29, 2017 at 8:25 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Tue, Oct 31, 2017 at 6:46 PM, Kyotaro HORIGUCHI >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >>> This is a rebased version of the patch. >> >> As far as I can see, the patch still applies, compiles, and got no >> reviews. So moved to next CF. > > I think we have to mark this as returned with feedback or rejected for > the reasons mentioned here: > > http://postgr.es/m/CA+TgmoZjn28uYJRQ2K+5idhYxWBDER68sctoc2p_nW7h7JbhYw@mail.gmail.com Good point. I forgot this bit. Thanks for mentioning it I am switching the patch as returned with feedback. -- Michael
Michael Paquier <michael.paquier@gmail.com> writes: > On Thu, Nov 30, 2017 at 12:32 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I think we have to mark this as returned with feedback or rejected for >> the reasons mentioned here: >> http://postgr.es/m/CA+TgmoZjn28uYJRQ2K+5idhYxWBDER68sctoc2p_nW7h7JbhYw@mail.gmail.com > Good point. I forgot this bit. Thanks for mentioning it I am switching > the patch as returned with feedback. We had a bug report just today that seemed to me to trace to relcache bloat: https://www.postgresql.org/message-id/flat/20171129100649.1473.73990%40wrigleys.postgresql.org ISTM that there's definitely work to be done here, but as I said upthread, I think we need a more holistic approach than just focusing on negative catcache entries, or even just catcache entries. The thing that makes me uncomfortable about this is that we used to have a catcache size limitation mechanism, and ripped it out because it had too much overhead (see commit 8b9bc234a). I'm not sure how we can avoid that problem within a fresh implementation. regards, tom lane
On Wed, Nov 29, 2017 at 11:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The thing that makes me uncomfortable about this is that we used to have a > catcache size limitation mechanism, and ripped it out because it had too > much overhead (see commit 8b9bc234a). I'm not sure how we can avoid that > problem within a fresh implementation. At the risk of beating a dead horse, I still think that the amount of wall clock time that has elapsed since an entry was last accessed is very relevant. The problem with a fixed maximum size is that you can hit it arbitrarily frequently; time-based expiration solves that problem. It allows backends that are actively using a lot of stuff to hold on to as many cache entries as they need, while forcing backends that have moved on to a different set of tables -- or that are completely idle -- to let go of cache entries that are no longer being actively used. I think that's what we want. Nobody wants to keep the cache size small when a big cache is necessary for good performance, but what people do want to avoid is having long-running backends eventually accumulate huge numbers of cache entries most of which haven't been touched in hours or, maybe, weeks. To put that another way, we should only hang on to a cache entry for so long as the bytes of memory that it consumes are more valuable than some other possible use of those bytes of memory. That is very likely to be true when we've accessed those bytes recently, but progressively less likely to be true the more time has passed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Nov 30, 2017 at 11:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Nov 29, 2017 at 11:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> The thing that makes me uncomfortable about this is that we used to have a >>> catcache size limitation mechanism, and ripped it out because it had too >>> much overhead (see commit 8b9bc234a). I'm not sure how we can avoid that >>> problem within a fresh implementation. > >> At the risk of beating a dead horse, I still think that the amount of >> wall clock time that has elapsed since an entry was last accessed is >> very relevant. > > While I don't object to that statement, I'm not sure how it helps us > here. If we couldn't afford DLMoveToFront(), doing a gettimeofday() > during each syscache access is surely right out. Well, yeah, that would be insane. But I think even something very rough could work well enough. I think our goal should be to eliminate cache entries that are have gone unused for many *minutes*, and there's no urgency about getting it to any sort of exact value. For non-idle backends, using the most recent statement start time as a proxy would probably be plenty good enough. Idle backends might need a bit more thought. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017-12-01 16:20:44 -0500, Robert Haas wrote: > Well, yeah, that would be insane. But I think even something very > rough could work well enough. I think our goal should be to eliminate > cache entries that are have gone unused for many *minutes*, and > there's no urgency about getting it to any sort of exact value. For > non-idle backends, using the most recent statement start time as a > proxy would probably be plenty good enough. Idle backends might need > a bit more thought. Our timer framework is flexible enough that we can install a once-a-minute timer without much overhead. That timer could increment a 'cache generation' integer. Upon cache access we write the current generation into relcache / syscache (and potentially also plancache?) entries. Not entirely free, but cheap enough. In those once-a-minute passes entries that haven't been touched in X cycles get pruned. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2017-12-01 16:20:44 -0500, Robert Haas wrote: >> Well, yeah, that would be insane. But I think even something very >> rough could work well enough. I think our goal should be to eliminate >> cache entries that are have gone unused for many *minutes*, and >> there's no urgency about getting it to any sort of exact value. For >> non-idle backends, using the most recent statement start time as a >> proxy would probably be plenty good enough. Idle backends might need >> a bit more thought. > Our timer framework is flexible enough that we can install a > once-a-minute timer without much overhead. That timer could increment a > 'cache generation' integer. Upon cache access we write the current > generation into relcache / syscache (and potentially also plancache?) > entries. Not entirely free, but cheap enough. In those once-a-minute > passes entries that haven't been touched in X cycles get pruned. I have no faith in either of these proposals, because they both assume that the problem only arises over the course of many minutes. In the recent complaint about pg_dump causing relcache bloat, it probably does not take nearly that long for the bloat to occur. Maybe you could make it work on the basis of number of cache accesses, or some other normalized-to-workload-not-wall-clock time reference. regards, tom lane
On 2017-12-01 16:40:23 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2017-12-01 16:20:44 -0500, Robert Haas wrote: > >> Well, yeah, that would be insane. But I think even something very > >> rough could work well enough. I think our goal should be to eliminate > >> cache entries that are have gone unused for many *minutes*, and > >> there's no urgency about getting it to any sort of exact value. For > >> non-idle backends, using the most recent statement start time as a > >> proxy would probably be plenty good enough. Idle backends might need > >> a bit more thought. > > > Our timer framework is flexible enough that we can install a > > once-a-minute timer without much overhead. That timer could increment a > > 'cache generation' integer. Upon cache access we write the current > > generation into relcache / syscache (and potentially also plancache?) > > entries. Not entirely free, but cheap enough. In those once-a-minute > > passes entries that haven't been touched in X cycles get pruned. > > I have no faith in either of these proposals, because they both assume > that the problem only arises over the course of many minutes. In the > recent complaint about pg_dump causing relcache bloat, it probably does > not take nearly that long for the bloat to occur. To me that's a bit of a different problem than what I was discussing here. It also actually doesn't seem that hard - if your caches are growing fast, you'll continually get hash-resizing of the various. Adding cache-pruning to the resizing code doesn't seem hard, and wouldn't add meaningful overhead. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2017-12-01 16:40:23 -0500, Tom Lane wrote: >> I have no faith in either of these proposals, because they both assume >> that the problem only arises over the course of many minutes. In the >> recent complaint about pg_dump causing relcache bloat, it probably does >> not take nearly that long for the bloat to occur. > To me that's a bit of a different problem than what I was discussing > here. It also actually doesn't seem that hard - if your caches are > growing fast, you'll continually get hash-resizing of the > various. Adding cache-pruning to the resizing code doesn't seem hard, > and wouldn't add meaningful overhead. That's an interesting way to think about it, as well, though I'm not sure it's quite that simple. If you tie this to cache resizing then the cache will have to grow up to the newly increased size before you'll prune it again. That doesn't sound like it will lead to nice steady-state behavior. regards, tom lane
On 2017-12-01 17:03:28 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2017-12-01 16:40:23 -0500, Tom Lane wrote: > >> I have no faith in either of these proposals, because they both assume > >> that the problem only arises over the course of many minutes. In the > >> recent complaint about pg_dump causing relcache bloat, it probably does > >> not take nearly that long for the bloat to occur. > > > To me that's a bit of a different problem than what I was discussing > > here. It also actually doesn't seem that hard - if your caches are > > growing fast, you'll continually get hash-resizing of the > > various. Adding cache-pruning to the resizing code doesn't seem hard, > > and wouldn't add meaningful overhead. > > That's an interesting way to think about it, as well, though I'm not > sure it's quite that simple. If you tie this to cache resizing then > the cache will have to grow up to the newly increased size before > you'll prune it again. That doesn't sound like it will lead to nice > steady-state behavior. Yea, it's not perfect - but if we do pruning both at resize *and* on regular intervals, like once-a-minute as I was suggesting, I don't think it's that bad. The steady state won't be reached within seconds, true, but the negative consequences of only attempting to shrink the cache upon resizing when the cache size is growing fast anyway doesn't seem that large. I don't think we need to be super accurate here, there just needs to be *some* backpressure. I've had cases in the past where just occasionally blasting the cache away would've been good enough. Greetings, Andres Freund
At Fri, 1 Dec 2017 14:12:20 -0800, Andres Freund <andres@anarazel.de> wrote in <20171201221220.z5e6wtlpl264wzik@alap3.anarazel.de> > On 2017-12-01 17:03:28 -0500, Tom Lane wrote: > > Andres Freund <andres@anarazel.de> writes: > > > On 2017-12-01 16:40:23 -0500, Tom Lane wrote: > > >> I have no faith in either of these proposals, because they both assume > > >> that the problem only arises over the course of many minutes. In the > > >> recent complaint about pg_dump causing relcache bloat, it probably does > > >> not take nearly that long for the bloat to occur. > > > > > To me that's a bit of a different problem than what I was discussing > > > here. It also actually doesn't seem that hard - if your caches are > > > growing fast, you'll continually get hash-resizing of the > > > various. Adding cache-pruning to the resizing code doesn't seem hard, > > > and wouldn't add meaningful overhead. > > > > That's an interesting way to think about it, as well, though I'm not > > sure it's quite that simple. If you tie this to cache resizing then > > the cache will have to grow up to the newly increased size before > > you'll prune it again. That doesn't sound like it will lead to nice > > steady-state behavior. > > Yea, it's not perfect - but if we do pruning both at resize *and* on > regular intervals, like once-a-minute as I was suggesting, I don't think > it's that bad. The steady state won't be reached within seconds, true, > but the negative consequences of only attempting to shrink the cache > upon resizing when the cache size is growing fast anyway doesn't seem > that large. > > I don't think we need to be super accurate here, there just needs to be > *some* backpressure. > > I've had cases in the past where just occasionally blasting the cache > away would've been good enough. Thank you very much for the valuable suggestions. I still would like to solve this problem and the a-counter-freely-running-in-minute(or several seconds)-resolution and pruning-too-long-unaccessed-entries-on-resizing seems to me to work enough for at least several known bloat cases. This still has a defect that this is not workable for a very quick bloating. I'll try thinking about the remaining issue. If no one has immediate objection to the direction, I'll come up with an implementation. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Dec 13, 2017 at 11:20 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Thank you very much for the valuable suggestions. I still would > like to solve this problem and the > a-counter-freely-running-in-minute(or several seconds)-resolution > and pruning-too-long-unaccessed-entries-on-resizing seems to me > to work enough for at least several known bloat cases. This still > has a defect that this is not workable for a very quick > bloating. I'll try thinking about the remaining issue. I'm not sure we should regard very quick bloating as a problem in need of solving. Doesn't that just mean we need the cache to be bigger, at least temporarily? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017-12-16 22:25:48 -0500, Robert Haas wrote: > On Wed, Dec 13, 2017 at 11:20 PM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > Thank you very much for the valuable suggestions. I still would > > like to solve this problem and the > > a-counter-freely-running-in-minute(or several seconds)-resolution > > and pruning-too-long-unaccessed-entries-on-resizing seems to me > > to work enough for at least several known bloat cases. This still > > has a defect that this is not workable for a very quick > > bloating. I'll try thinking about the remaining issue. > > I'm not sure we should regard very quick bloating as a problem in need > of solving. Doesn't that just mean we need the cache to be bigger, at > least temporarily? Leaving that aside, is that actually not at least to a good degree, solved by that problem? By bumping the generation on hash resize, we have recency information we can take into account. Greetings, Andres Freund
On Sat, Dec 16, 2017 at 11:42 PM, Andres Freund <andres@anarazel.de> wrote: >> I'm not sure we should regard very quick bloating as a problem in need >> of solving. Doesn't that just mean we need the cache to be bigger, at >> least temporarily? > > Leaving that aside, is that actually not at least to a good degree, > solved by that problem? By bumping the generation on hash resize, we > have recency information we can take into account. I agree that we can do it. I'm just not totally sure it's a good idea. I'm also not totally sure it's a bad idea, either. That's why I asked the question. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017-12-17 19:23:45 -0500, Robert Haas wrote: > On Sat, Dec 16, 2017 at 11:42 PM, Andres Freund <andres@anarazel.de> wrote: > >> I'm not sure we should regard very quick bloating as a problem in need > >> of solving. Doesn't that just mean we need the cache to be bigger, at > >> least temporarily? > > > > Leaving that aside, is that actually not at least to a good degree, > > solved by that problem? By bumping the generation on hash resize, we > > have recency information we can take into account. > > I agree that we can do it. I'm just not totally sure it's a good > idea. I'm also not totally sure it's a bad idea, either. That's why > I asked the question. I'm not 100% convinced either - but I also don't think it matters all that terribly much. As long as the overall hash hit rate is decent, minor increases in the absolute number of misses don't really matter that much for syscache imo. I'd personally go for something like: 1) When about to resize, check if there's entries of a generation -2 around. Don't resize if more than 15% of entries could be freed. Also, stop reclaiming at that threshold, to avoid unnecessary purging cache entries. Using two generations allows a bit more time for cache entries to marked as fresh before resizing next. 2) While resizing increment generation count by one. 3) Once a minute, increment generation count by one. The one thing I'm not quite have a good handle upon is how much, and if any, cache reclamation to do at 3). We don't really want to throw away all the caches just because a connection has been idle for a few minutes, in a connection pool that can happen occasionally. I think I'd for now *not* do any reclamation except at resize boundaries. Greetings, Andres Freund
On Mon, Dec 18, 2017 at 11:46 AM, Andres Freund <andres@anarazel.de> wrote: > I'm not 100% convinced either - but I also don't think it matters all > that terribly much. As long as the overall hash hit rate is decent, > minor increases in the absolute number of misses don't really matter > that much for syscache imo. I'd personally go for something like: > > 1) When about to resize, check if there's entries of a generation -2 > around. > > Don't resize if more than 15% of entries could be freed. Also, stop > reclaiming at that threshold, to avoid unnecessary purging cache > entries. > > Using two generations allows a bit more time for cache entries to > marked as fresh before resizing next. > > 2) While resizing increment generation count by one. > > 3) Once a minute, increment generation count by one. > > > The one thing I'm not quite have a good handle upon is how much, and if > any, cache reclamation to do at 3). We don't really want to throw away > all the caches just because a connection has been idle for a few > minutes, in a connection pool that can happen occasionally. I think I'd > for now *not* do any reclamation except at resize boundaries. My starting inclination was almost the opposite. I think that you might be right that a minute or two of idle time isn't sufficient reason to flush our local cache, but I'd be inclined to fix that by incrementing the generation count every 10 minutes or so rather than every minute, and still flush things more then 1 generation old. The reason for that is that I think we should ensure that the system doesn't sit there idle forever with a giant cache. If it's not using those cache entries, I'd rather have it discard them and rebuild the cache when it becomes active again. Now, I also see that your point about trying to clean up before resizing. That does seem like a good idea, although we have to be careful not to be too eager to clean up there, or we'll just result in artificially limiting the cache size when it's unwise to do so. But I guess that's what you meant by "Also, stop reclaiming at that threshold, to avoid unnecessary purging cache entries." I think the idea you are proposing is that: 1. The first time we are due to expand the hash table, we check whether we can forestall that expansion by doing a cleanup; if so, we do that instead. 2. After that, we just expand. That seems like a fairly good idea, although it might be a better idea to allow cleanup if enough time has passed. If we hit the expansion threshold twice an hour apart, there's no reason not to try cleanup again. Generally, the way I'm viewing this is that a syscache entry means paying memory to save CPU time. Each 8kB of memory we use to store system cache entries is one less block we have for the OS page cache to hold onto our data blocks. If we had an oracle (the kind from Delphi, not Redwood City) that told us with perfect accuracy when to discard syscache entries, it would throw away syscache entries whenever the marginal execution-time performance we could buy from another 8kB in the page cache is greater than the marginal execution-time performance we could buy from those syscache entries. In reality, it's hard to know which of those things is of greater value. If the system isn't meaningfully memory-constrained, we ought to just always hang onto the syscache entries, as we do today, but it's hard to know that. I think the place where this really becomes a problem is on system with hundreds of connections + thousands of tables + connection pooling; without some back-pressure, every backend eventually caches everything, putting the system under severe memory pressure for basically no performance gain. Each new use of the connection is probably for a limited set of tables, and only those tables really syscache entries; holding onto things used long in the past doesn't save enough to justify the memory used. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
At Mon, 18 Dec 2017 12:14:24 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoaWLBzUasvVs-q=dfBr3pLWSUCQnbqLk-MT7iX4eyrinA@mail.gmail.com> > On Mon, Dec 18, 2017 at 11:46 AM, Andres Freund <andres@anarazel.de> wrote: > > I'm not 100% convinced either - but I also don't think it matters all > > that terribly much. As long as the overall hash hit rate is decent, > > minor increases in the absolute number of misses don't really matter > > that much for syscache imo. I'd personally go for something like: > > > > 1) When about to resize, check if there's entries of a generation -2 > > around. > > > > Don't resize if more than 15% of entries could be freed. Also, stop > > reclaiming at that threshold, to avoid unnecessary purging cache > > entries. > > > > Using two generations allows a bit more time for cache entries to > > marked as fresh before resizing next. > > > > 2) While resizing increment generation count by one. > > > > 3) Once a minute, increment generation count by one. > > > > > > The one thing I'm not quite have a good handle upon is how much, and if > > any, cache reclamation to do at 3). We don't really want to throw away > > all the caches just because a connection has been idle for a few > > minutes, in a connection pool that can happen occasionally. I think I'd > > for now *not* do any reclamation except at resize boundaries. > > My starting inclination was almost the opposite. I think that you > might be right that a minute or two of idle time isn't sufficient > reason to flush our local cache, but I'd be inclined to fix that by > incrementing the generation count every 10 minutes or so rather than > every minute, and still flush things more then 1 generation old. The > reason for that is that I think we should ensure that the system > doesn't sit there idle forever with a giant cache. If it's not using > those cache entries, I'd rather have it discard them and rebuild the > cache when it becomes active again. I see three kinds of syscache entries. A. An entry for an actually existing object. This is literally a syscache entry. This kind of entry is not necessary to be removed but can be removed after ignorance for a certain period of time. B. An entry for an object which once existed but no longer. This can be removed any time after the removal of the object and is a main cause of stats bloat or relcache bloat which are the motive of this thread. We can know whether the entries of this kind are removable using cache invalidation mechanism. (the patch upthread) We can queue the oids that specify the entries to remove, then actually remove at the next resize. (And this also could be another cause of bloat. So we could forcibly flush a hash when the oid list becomes longer than some threashold.) C. An entry for a just non-existent objects. I'm not sure how we should treat this since the necessity of a entry of the kind purely stands on whether the entry will be accessed sometime. But we could put the same assumption to A. > Now, I also see that your point about trying to clean up before > resizing. That does seem like a good idea, although we have to be > careful not to be too eager to clean up there, or we'll just result in > artificially limiting the cache size when it's unwise to do so. But I > guess that's what you meant by "Also, stop reclaiming at that > threshold, to avoid unnecessary purging cache entries." I think the > idea you are proposing is that: > > 1. The first time we are due to expand the hash table, we check > whether we can forestall that expansion by doing a cleanup; if so, we > do that instead. > > 2. After that, we just expand. > > That seems like a fairly good idea, although it might be a better idea > to allow cleanup if enough time has passed. If we hit the expansion > threshold twice an hour apart, there's no reason not to try cleanup > again. Aa session with intermittently executes queries run in a very short time could be considered as an example workload where cleanup with such criteria is unwelcomed. But syscache won't bloat in the case. > Generally, the way I'm viewing this is that a syscache entry means > paying memory to save CPU time. Each 8kB of memory we use to store > system cache entries is one less block we have for the OS page cache > to hold onto our data blocks. If we had an oracle (the kind from Sure > Delphi, not Redwood City) that told us with perfect accuracy when to > discard syscache entries, it would throw away syscache entries Except for the B in the aboves. The logic seems somewhat alien to the time-based cleanup but this can be the measure for quick bloat of some syscahces. > whenever the marginal execution-time performance we could buy from > another 8kB in the page cache is greater than the marginal > execution-time performance we could buy from those syscache entries. > In reality, it's hard to know which of those things is of greater > value. If the system isn't meaningfully memory-constrained, we ought > to just always hang onto the syscache entries, as we do today, but > it's hard to know that. I think the place where this really becomes a > problem is on system with hundreds of connections + thousands of > tables + connection pooling; without some back-pressure, every backend > eventually caches everything, putting the system under severe memory > pressure for basically no performance gain. Each new use of the > connection is probably for a limited set of tables, and only those > tables really syscache entries; holding onto things used long in the > past doesn't save enough to justify the memory used. Agreed. The following is the whole image of the measure for syscache bloat considering "quick bloat". (I still think it is wanted under some situations.) 1. If a removal of any objects that make some syscache entries stale (this cannot be checked without scanning whole a hash so just queue it into, for exameple, recently_removed_relations OID hash.) 2. If the number of the oid-hash entries reasches 1000 or 10000 (mmm. quite arbitrary..), Immediately clean up syscaches that accepts/needs removed-reloid cleanup. (The oid hash might be needed separately for each target cache to avoid readandunt scan, or to get rid a kind of generation management in the oid hash.) 3. > 1. The first time we are due to expand the hash table, we check > whether we can forestall that expansion by doing a cleanup; if so, we > do that instead. And if there's any entry in the removed-reloid hash it is considered while cleanup. 4. > 2. After that, we just expand. > > That seems like a fairly good idea, although it might be a better idea > to allow cleanup if enough time has passed. If we hit the expansion > threshold twice an hour apart, there's no reason not to try cleanup > again. 1 + 2 and 3 + 4 can be implemented as separate patches and I'll do the latter first. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Dec 19, 2017 at 3:31 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > I see three kinds of syscache entries. > > A. An entry for an actually existing object. > B. An entry for an object which once existed but no longer. > C. An entry for a just non-existent objects. I'm not convinced that it's useful to divide things up this way. Regardless of whether the syscache entries is a positive entry, a negative entry for a dropped object, or a negative energy for an object that never existed in the first place, it's valuable if it's likely to get used again and worthless if not. Positive entries may get used repeatedly, or not; negative entries may get used repeatedly, or not. >> Generally, the way I'm viewing this is that a syscache entry means >> paying memory to save CPU time. Each 8kB of memory we use to store >> system cache entries is one less block we have for the OS page cache >> to hold onto our data blocks. If we had an oracle (the kind from > > Sure > >> Delphi, not Redwood City) that told us with perfect accuracy when to >> discard syscache entries, it would throw away syscache entries > > Except for the B in the aboves. The logic seems somewhat alien to > the time-based cleanup but this can be the measure for quick > bloat of some syscahces. I guess I still don't see why B is different. If somebody sits there and runs queries against non-existent table names at top speed, maybe they'll query the same non-existent table entries more than once, in which case keeping the negative entries for the non-existent table names around until they stop doing it may improve performance. If they are sitting there and running queries against randomly-generated non-existent table names at top speed, then they'll generate a lot of catcache bloat, but that's not really any different from a database with a large number of tables that DO exist which are queried at random. Workloads that access a lot of objects, whether those objects exist or not, are going to use up a lot of cache entries, and I guess that just seems OK to me. > Agreed. The following is the whole image of the measure for > syscache bloat considering "quick bloat". (I still think it is > wanted under some situations.) > > 1. If a removal of any objects that make some syscache entries > stale (this cannot be checked without scanning whole a hash so > just queue it into, for exameple, recently_removed_relations > OID hash.) If we just let some sort of cleanup process that generally blows away rarely-used entries get rid of those entries too, then it should handle this case, too, because the cache entries pertaining to removed relations (or schemas) probably won't get used after that (and if they do, we should keep them). So I don't see that there is a need for this, and it drew objections upthread because of the cost of scanning the whole hash table. Batching relations together might help, but it doesn't really seem worth trying to sort out the problems with this idea when we can do something better and more general. > 2. If the number of the oid-hash entries reasches 1000 or 10000 > (mmm. quite arbitrary..), Immediately clean up syscaches that > accepts/needs removed-reloid cleanup. (The oid hash might be > needed separately for each target cache to avoid readandunt > scan, or to get rid a kind of generation management in the oid > hash.) That is bound to draw a strong negative response from Tom, and for good reason. If the number of relations in the working set is 1001 and your cleanup threshold is 1000, cleanups will happen constantly and performance will be poor. This is exactly why, as I said in the second email on this thread, the limit of on the size of the relcache was removed. >> 1. The first time we are due to expand the hash table, we check >> whether we can forestall that expansion by doing a cleanup; if so, we >> do that instead. > > And if there's any entry in the removed-reloid hash it is > considered while cleanup. As I say, I don't think there's any need for a removed-reloid hash. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Dec 19, 2017 at 3:31 AM, Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> I see three kinds of syscache entries. >> >> A. An entry for an actually existing object. >> B. An entry for an object which once existed but no longer. >> C. An entry for a just non-existent objects. > I'm not convinced that it's useful to divide things up this way. Actually, I don't believe that case B exists at all; such an entry should get blown away by syscache invalidation when we commit the DROP command. If one were to stick around, we'd risk false positive lookups later. > I guess I still don't see why B is different. If somebody sits there > and runs queries against non-existent table names at top speed, maybe > they'll query the same non-existent table entries more than once, in > which case keeping the negative entries for the non-existent table > names around until they stop doing it may improve performance. FWIW, my recollection is that the reason for negative cache entries is that there are some very common patterns where we probe for object names (not just table names, either) that aren't there, typically as a side effect of walking through the search_path looking for a match to an unqualified object name. Those cache entries aren't going to get any less useful than the positive entry for the ultimately-found object. So from a lifespan point of view I'm not very sure that it's worth distinguishing cases A and C. It's conceivable that we could rewrite all the lookup algorithms so that they didn't require negative cache entries to have good performance ... but I doubt that that's easy to do. regards, tom lane
At Tue, 19 Dec 2017 13:14:09 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <748.1513707249@sss.pgh.pa.us> > Robert Haas <robertmhaas@gmail.com> writes: > > On Tue, Dec 19, 2017 at 3:31 AM, Kyotaro HORIGUCHI > > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > >> I see three kinds of syscache entries. > >> > >> A. An entry for an actually existing object. > >> B. An entry for an object which once existed but no longer. > >> C. An entry for a just non-existent objects. > > > I'm not convinced that it's useful to divide things up this way. > > Actually, I don't believe that case B exists at all; such an entry > should get blown away by syscache invalidation when we commit the > DROP command. If one were to stick around, we'd risk false positive > lookups later. As I have shown upthread, access to a temporary table (*1) leaves several STATRELATTINH entries after DROPing, and it doesn't have a chance to be deleted. SELECTing a nonexistent table in a schema (*2) also leaves a RELNAMENSP entry after DROPing the schema. I'm not sure that the latter happen so frequently but the former happens rather frequently and quickly bloats the syscache once happens. However no false positive can happen since such entiries cannot be reached without parent objects, but on the other hand they have no chance to be deleted. *1: begin; create temp table t1 (a int, b int, c int, d int, e int, f int, g int, h int, i int, j int) on commit drop; insertinto t1 values (1, 2, 3, 4, 5, 6, 7, 8, 9, 10); select * from t1; commit; *2: create schema foo; select * from foo.invalid; drop schema foo; > > I guess I still don't see why B is different. If somebody sits there > > and runs queries against non-existent table names at top speed, maybe > > they'll query the same non-existent table entries more than once, in > > which case keeping the negative entries for the non-existent table > > names around until they stop doing it may improve performance. > > FWIW, my recollection is that the reason for negative cache entries > is that there are some very common patterns where we probe for object > names (not just table names, either) that aren't there, typically as > a side effect of walking through the search_path looking for a match > to an unqualified object name. Those cache entries aren't going to > get any less useful than the positive entry for the ultimately-found > object. So from a lifespan point of view I'm not very sure that it's > worth distinguishing cases A and C. Agreed. > It's conceivable that we could rewrite all the lookup algorithms > so that they didn't require negative cache entries to have good > performance ... but I doubt that that's easy to do. That sounds to me to be the same as improving the performance of systable scan as the same as local hash. Lockless systable (index) might work (if possible)? Anyway, I think we are reached to a consensus that the time-tick-based expiration is promising. So I'll work on the way as the first step. Thanks! -- Kyotaro Horiguchi NTT Open Source Software Center
At Fri, 22 Dec 2017 13:47:16 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20171222.134716.88479707.horiguchi.kyotaro@lab.ntt.co.jp> > Anyway, I think we are reached to a consensus that the > time-tick-based expiration is promising. So I'll work on the way > as the first step. So this is the patch. It gets simpler. # I became to think that the second step is not needed. I'm not sure that no syscache aceess happens outside a statement but the operations that lead to the bloat seem to be performed while processing of a statement. So statement timestamp seems sufficient as the aging clock. At first I tried the simple strategy that removes entries that have been left alone for 30 minutes or more but I still want to alleviate the quick bloat (by non-reused entries) so I introduced together a clock-sweep like aging mechanism. An entry is created with naccessed = 0, then incremented up to 2 each time it is accessed. Removal side decrements naccessed of entriies older than 600 seconds then actually removes if it becomes 0. Entries that just created and not used will go off in 600 seconds and entries that have been accessed several times have 1800 seconds' grace after the last acccess. We could shrink bucket array together but I didn't since it is not so large and is prone to grow up to the same size shortly again. regards, -- Kyotaro Horiguchi NTT Open Source Software Center *** a/src/backend/access/transam/xact.c --- b/src/backend/access/transam/xact.c *************** *** 733,738 **** void --- 733,741 ---- SetCurrentStatementStartTimestamp(void) { stmtStartTimestamp = GetCurrentTimestamp(); + + /* Set this time stamp as aproximated current time */ + SetCatCacheClock(stmtStartTimestamp); } /* *** a/src/backend/utils/cache/catcache.c --- b/src/backend/utils/cache/catcache.c *************** *** 74,79 **** --- 74,82 ---- /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; + /* Timestamp used for any operation on caches. */ + TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, *************** *** 866,875 **** InitCatCache(int id, --- 869,969 ---- */ MemoryContextSwitchTo(oldcxt); + /* initilize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } /* + * Remove entries that haven't been accessed for a certain time. + * + * Sometimes catcache entries are left unremoved for several reasons. We + * cannot allow them to eat up the usable memory and still it is better to + * remove entries that are no longer accessed from the perspective of memory + * performance ratio. Unfortunately we cannot predict that but we can assume + * that entries that are not accessed for long time no longer contribute to + * performance. + */ + static bool + CatCacheCleanupOldEntries(CatCache *cp) + { + int i; + int nremoved = 0; + #ifdef CATCACHE_STATS + int ntotal = 0; + int tm[] = {30, 60, 600, 1200, 1800, 0}; + int cn[6] = {0, 0, 0, 0, 0}; + int cage[3] = {0, 0, 0}; + #endif + + /* Move all entries from old hash table to new. */ + for (i = 0; i < cp->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long s; + int us; + + + TimestampDifference(ct->lastaccess, catcacheclock, &s, &us); + + #ifdef CATCACHE_STATS + { + int j; + + ntotal++; + for (j = 0 ; tm[j] != 0 && s > tm[j] ; j++); + if (tm[j] == 0) j--; + cn[j]++; + } + #endif + + /* + * Remove entries older than 600 seconds but not recently used. + * Entries that are not accessed after creation are removed in 600 + * seconds, and that has been used several times are removed after + * 30 minumtes ignorance. We don't try shrink buckets since they + * are not the major part of syscache bloat and they are expected + * to be filled shortly again. + */ + if (s > 600) + { + #ifdef CATCACHE_STATS + Assert (ct->naccess >= 0 && ct->naccess <= 2); + cage[ct->naccess]++; + #endif + if (ct->naccess > 0) + ct->naccess--; + else + { + if (!ct->c_list || ct->c_list->refcount == 0) + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + } + } + } + } + } + + #ifdef CATCACHE_STATS + ereport(DEBUG2, + (errmsg ("removed %d/%d, age(-30s:%d, -60s:%d, -600s:%d, -1200s:%d, -1800:%d) naccessed(0:%d, 1:%d, 2:%d)", + nremoved, ntotal, + cn[0], cn[1], cn[2], cn[3], cn[4], + cage[0], cage[1], cage[2]), + errhidestmt(true))); + #endif + + return nremoved > 0; + } + + /* * Enlarge a catcache, doubling the number of buckets. */ static void *************** *** 1282,1287 **** SearchCatCacheInternal(CatCache *cache, --- 1376,1389 ---- */ dlist_move_head(bucket, &ct->cache_elem); + + /* + * Update the last access time of this entry + */ + if (ct->naccess < 2) + ct->naccess++; + ct->lastaccess = catcacheclock; + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. *************** *** 1901,1906 **** CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, --- 2003,2010 ---- ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); *************** *** 1911,1917 **** CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, * If the hash table has become too full, enlarge the buckets array. Quite * arbitrarily, we enlarge when fill factor > 2. */ ! if (cache->cc_ntup > cache->cc_nbuckets * 2) RehashCatCache(cache); return ct; --- 2015,2022 ---- * If the hash table has become too full, enlarge the buckets array. Quite * arbitrarily, we enlarge when fill factor > 2. */ ! if (cache->cc_ntup > cache->cc_nbuckets * 2 && ! !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); return ct; *** a/src/include/utils/catcache.h --- b/src/include/utils/catcache.h *************** *** 119,124 **** typedef struct catctup --- 119,126 ---- bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ + int naccess; /* # of access to this entry */ + TimestampTz lastaccess; /* approx. TS of the last access/modification */ /* * The tuple may also be a member of at most one CatCList. (If a single *************** *** 189,194 **** typedef struct catcacheheader --- 191,203 ---- /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; + extern PGDLLIMPORT TimestampTz catcacheclock; + static inline void + SetCatCacheClock(TimestampTz ts) + { + catcacheclock = ts; + } + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
On 2017-12-26 18:19:16 +0900, Kyotaro HORIGUCHI wrote: > --- a/src/backend/access/transam/xact.c > +++ b/src/backend/access/transam/xact.c > @@ -733,6 +733,9 @@ void > SetCurrentStatementStartTimestamp(void) > { > stmtStartTimestamp = GetCurrentTimestamp(); > + > + /* Set this time stamp as aproximated current time */ > + SetCatCacheClock(stmtStartTimestamp); > } Hm. > + * Remove entries that haven't been accessed for a certain time. > + * > + * Sometimes catcache entries are left unremoved for several reasons. I'm unconvinced that that's ok for positive entries, entirely regardless of this patch. > We > + * cannot allow them to eat up the usable memory and still it is better to > + * remove entries that are no longer accessed from the perspective of memory > + * performance ratio. Unfortunately we cannot predict that but we can assume > + * that entries that are not accessed for long time no longer contribute to > + * performance. > + */ This needs polish. > +static bool > +CatCacheCleanupOldEntries(CatCache *cp) > +{ > + int i; > + int nremoved = 0; > +#ifdef CATCACHE_STATS > + int ntotal = 0; > + int tm[] = {30, 60, 600, 1200, 1800, 0}; > + int cn[6] = {0, 0, 0, 0, 0}; > + int cage[3] = {0, 0, 0}; > +#endif This doesn't look nice, the names descriptive enough to be self evident, and there's no comments what these random arrays mean. And some specify lenght (and have differing number of elements!) and others don't. > + /* Move all entries from old hash table to new. */ > + for (i = 0; i < cp->cc_nbuckets; i++) > + { > + dlist_mutable_iter iter; > + > + dlist_foreach_modify(iter, &cp->cc_bucket[i]) > + { > + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); > + long s; > + int us; > + > + > + TimestampDifference(ct->lastaccess, catcacheclock, &s, &us); > + > +#ifdef CATCACHE_STATS > + { > + int j; > + > + ntotal++; > + for (j = 0 ; tm[j] != 0 && s > tm[j] ; j++); > + if (tm[j] == 0) j--; > + cn[j]++; > + } > +#endif What? > + /* > + * Remove entries older than 600 seconds but not recently used. > + * Entries that are not accessed after creation are removed in 600 > + * seconds, and that has been used several times are removed after > + * 30 minumtes ignorance. We don't try shrink buckets since they > + * are not the major part of syscache bloat and they are expected > + * to be filled shortly again. > + */ > + if (s > 600) > + { So this is hardcoded, without any sort of cache pressure logic? Doesn't that mean we'll often *severely* degrade performance if a backend is idle for a while? Greetings, Andres Freund
On Thu, Mar 1, 2018 at 1:54 PM, Andres Freund <andres@anarazel.de> wrote: > So this is hardcoded, without any sort of cache pressure logic? Doesn't > that mean we'll often *severely* degrade performance if a backend is > idle for a while? Well, it is true that if we flush cache entries that haven't been used in a long time, a backend that is idle for a long time might be a bit slow when (and if) it eventually becomes non-idle, because it may have to reload some of those flushed entries. On the other hand, a backend that holds onto a large number of cache entries that it's not using for tens of minutes at a time degrades the performance of the whole system unless, of course, you're running on a machine that is under no memory pressure at all. I don't understand why people keep acting as if holding onto cache entries regardless of how infrequently they're being used is an unalloyed good. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2018-03-01 14:24:56 -0500, Robert Haas wrote: > On Thu, Mar 1, 2018 at 1:54 PM, Andres Freund <andres@anarazel.de> wrote: > > So this is hardcoded, without any sort of cache pressure logic? Doesn't > > that mean we'll often *severely* degrade performance if a backend is > > idle for a while? > > Well, it is true that if we flush cache entries that haven't been used > in a long time, a backend that is idle for a long time might be a bit > slow when (and if) it eventually becomes non-idle, because it may have > to reload some of those flushed entries. Right. Which might be very painful latency wise. And with poolers it's pretty easy to get into situations like that, without the app influencing it. > On the other hand, a backend that holds onto a large number of cache > entries that it's not using for tens of minutes at a time degrades the > performance of the whole system unless, of course, you're running on a > machine that is under no memory pressure at all. But it's *extremely* common to have no memory pressure these days. The inverse definitely also exists. > I don't understand why people keep acting as if holding onto cache > entries regardless of how infrequently they're being used is an > unalloyed good. Huh? I'm definitely not arguing for that? I think we want a feature like this, I just don't think the logic when to prune is quite sophisticated enough? Greetings, Andres Freund
On Thu, Mar 1, 2018 at 2:29 PM, Andres Freund <andres@anarazel.de> wrote: > Right. Which might be very painful latency wise. And with poolers it's > pretty easy to get into situations like that, without the app > influencing it. Really? I'm not sure I believe that. You're talking perhaps a few milliseconds - maybe less - of additional latency on a connection that's been idle for many minutes. You need to have a workload that involves leaving connections idle for very long periods but has extremely tight latency requirements when it does finally send a query. I suppose such workloads exist, but I would not think them common. Anyway, I don't mind making the exact timeout a GUC (with 0 disabling the feature altogether) if that addresses your concern, but in general I think that it's reasonable to accept that a connection that's been idle for a long time may have a little bit more latency than usual when you start using it again. That could happen for other reasons anyway -- e.g. the cache could have been flushed because of concurrent DDL on the objects you were accessing, by a syscache reset caused by a flood of temp objects being created, or by the operating system deciding to page out some of your data, or by your data getting evicted from the CPU caches, or by being scheduled onto a NUMA node different than the one that contains its data. Operating systems have been optimizing for the performance of relatively active processes over ones that have been idle for a long time since the 1960s or earlier, and I don't know of any reason why PostgreSQL shouldn't do the same. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018-03-01 14:49:26 -0500, Robert Haas wrote: > On Thu, Mar 1, 2018 at 2:29 PM, Andres Freund <andres@anarazel.de> wrote: > > Right. Which might be very painful latency wise. And with poolers it's > > pretty easy to get into situations like that, without the app > > influencing it. > > Really? I'm not sure I believe that. You're talking perhaps a few > milliseconds - maybe less - of additional latency on a connection > that's been idle for many minutes. I've seen latency increases in second+ ranges due to empty cat/sys/rel caches. And the connection doesn't have to be idle, it might just have been active for a different application doing different things, thus accessing different cache entries. With a pooler you can trivially end up switch connections occasionally between different [parts of] applications, and you don't want performance to suck after each time. You also don't want to use up all memory, I entirely agree on that. > Anyway, I don't mind making the exact timeout a GUC (with 0 disabling > the feature altogether) if that addresses your concern, but in general > I think that it's reasonable to accept that a connection that's been > idle for a long time may have a little bit more latency than usual > when you start using it again. I don't think that'd quite address my concern. I just don't think that the granularity (drop all entries older than xxx sec at the next resize) is right. For one I don't want to drop stuff if the cache size isn't a problem for the current memory budget. For another, I'm not convinced that dropping entries from the current "generation" at resize won't end up throwing away too much. If we'd a guc 'syscache_memory_target' and we'd only start pruning if above it, I'd be much happier. Greetings, Andres Freund
On Thu, Mar 1, 2018 at 3:01 PM, Andres Freund <andres@anarazel.de> wrote: > On 2018-03-01 14:49:26 -0500, Robert Haas wrote: >> On Thu, Mar 1, 2018 at 2:29 PM, Andres Freund <andres@anarazel.de> wrote: >> > Right. Which might be very painful latency wise. And with poolers it's >> > pretty easy to get into situations like that, without the app >> > influencing it. >> >> Really? I'm not sure I believe that. You're talking perhaps a few >> milliseconds - maybe less - of additional latency on a connection >> that's been idle for many minutes. > > I've seen latency increases in second+ ranges due to empty cat/sys/rel > caches. How is that even possible unless the system is grossly overloaded? >> Anyway, I don't mind making the exact timeout a GUC (with 0 disabling >> the feature altogether) if that addresses your concern, but in general >> I think that it's reasonable to accept that a connection that's been >> idle for a long time may have a little bit more latency than usual >> when you start using it again. > > I don't think that'd quite address my concern. I just don't think that > the granularity (drop all entries older than xxx sec at the next resize) > is right. For one I don't want to drop stuff if the cache size isn't a > problem for the current memory budget. For another, I'm not convinced > that dropping entries from the current "generation" at resize won't end > up throwing away too much. I think that a fixed memory budget for the syscache is an idea that was tried many years ago and basically failed, because it's very easy to end up with terrible eviction patterns -- e.g. if you are accessing 11 relations in round-robin fashion with a 10-relation cache, your cache nets you a 0% hit rate but takes a lot more maintenance than having no cache at all. The time-based approach lets the cache grow with no fixed upper limit without allowing unused entries to stick around forever. > If we'd a guc 'syscache_memory_target' and we'd only start pruning if > above it, I'd be much happier. It does seem reasonable to skip pruning altogether if the cache is below some threshold size. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2018-03-01 15:19:26 -0500, Robert Haas wrote: > On Thu, Mar 1, 2018 at 3:01 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2018-03-01 14:49:26 -0500, Robert Haas wrote: > >> On Thu, Mar 1, 2018 at 2:29 PM, Andres Freund <andres@anarazel.de> wrote: > >> > Right. Which might be very painful latency wise. And with poolers it's > >> > pretty easy to get into situations like that, without the app > >> > influencing it. > >> > >> Really? I'm not sure I believe that. You're talking perhaps a few > >> milliseconds - maybe less - of additional latency on a connection > >> that's been idle for many minutes. > > > > I've seen latency increases in second+ ranges due to empty cat/sys/rel > > caches. > > How is that even possible unless the system is grossly overloaded? You just need to have catalog contents out of cache and statements touching a few relations, functions, etc. Indexscan + heap fetch latencies do add up quite quickly if done sequentially. > > I don't think that'd quite address my concern. I just don't think that > > the granularity (drop all entries older than xxx sec at the next resize) > > is right. For one I don't want to drop stuff if the cache size isn't a > > problem for the current memory budget. For another, I'm not convinced > > that dropping entries from the current "generation" at resize won't end > > up throwing away too much. > > I think that a fixed memory budget for the syscache is an idea that > was tried many years ago and basically failed, because it's very easy > to end up with terrible eviction patterns -- e.g. if you are accessing > 11 relations in round-robin fashion with a 10-relation cache, your > cache nets you a 0% hit rate but takes a lot more maintenance than > having no cache at all. The time-based approach lets the cache grow > with no fixed upper limit without allowing unused entries to stick > around forever. I definitely think we want a time based component to this, I just want to not prune at all if we're below a certain size. > > If we'd a guc 'syscache_memory_target' and we'd only start pruning if > > above it, I'd be much happier. > > It does seem reasonable to skip pruning altogether if the cache is > below some threshold size. Cool. There might be some issues making that check performant enough, but I don't have a good intuition on it. Greetings, Andres Freund
Hello. Thank you for the discussion, and sorry for being late to come. At Thu, 1 Mar 2018 12:26:30 -0800, Andres Freund <andres@anarazel.de> wrote in <20180301202630.2s6untij2x5hpksn@alap3.anarazel.de> > Hi, > > On 2018-03-01 15:19:26 -0500, Robert Haas wrote: > > On Thu, Mar 1, 2018 at 3:01 PM, Andres Freund <andres@anarazel.de> wrote: > > > On 2018-03-01 14:49:26 -0500, Robert Haas wrote: > > >> On Thu, Mar 1, 2018 at 2:29 PM, Andres Freund <andres@anarazel.de> wrote: > > >> > Right. Which might be very painful latency wise. And with poolers it's > > >> > pretty easy to get into situations like that, without the app > > >> > influencing it. > > >> > > >> Really? I'm not sure I believe that. You're talking perhaps a few > > >> milliseconds - maybe less - of additional latency on a connection > > >> that's been idle for many minutes. > > > > > > I've seen latency increases in second+ ranges due to empty cat/sys/rel > > > caches. > > > > How is that even possible unless the system is grossly overloaded? > > You just need to have catalog contents out of cache and statements > touching a few relations, functions, etc. Indexscan + heap fetch > latencies do add up quite quickly if done sequentially. > > > > > I don't think that'd quite address my concern. I just don't think that > > > the granularity (drop all entries older than xxx sec at the next resize) > > > is right. For one I don't want to drop stuff if the cache size isn't a > > > problem for the current memory budget. For another, I'm not convinced > > > that dropping entries from the current "generation" at resize won't end > > > up throwing away too much. > > > > I think that a fixed memory budget for the syscache is an idea that > > was tried many years ago and basically failed, because it's very easy > > to end up with terrible eviction patterns -- e.g. if you are accessing > > 11 relations in round-robin fashion with a 10-relation cache, your > > cache nets you a 0% hit rate but takes a lot more maintenance than > > having no cache at all. The time-based approach lets the cache grow > > with no fixed upper limit without allowing unused entries to stick > > around forever. > > I definitely think we want a time based component to this, I just want > to not prune at all if we're below a certain size. > > > > > If we'd a guc 'syscache_memory_target' and we'd only start pruning if > > > above it, I'd be much happier. > > > > It does seem reasonable to skip pruning altogether if the cache is > > below some threshold size. > > Cool. There might be some issues making that check performant enough, > but I don't have a good intuition on it. So.. - Now it gets two new GUC variables named syscache_prune_min_age and syscache_memory_target. The former is the replacement of the previous magic number 600 and defaults to the same number. The latter prevens syscache pruning until exceeding the size and defaults to 0, means that pruning is always considered. Documentation for the two variables are also added. - Revised the pointed mysterious comment for CatcacheCleanupOldEntries and some comments are added. - Fixed the name of the variables for CATCACHE_STATS to be more descriptive, and added some comments for the code. The catcache entries accessed within the current transaction won't be pruned so theoretically a long transaction can bloat catcache. But I believe it is quite rare, or at least this saves the most other cases. regards, -- Kyotaro Horiguchi NTT Open Source Software Center From d3b73b68ed4ce246a0892ac72ec2eed1a47429f2 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 26 Dec 2017 17:43:09 +0900 Subject: [PATCH] Remove entries that haven't been used for a certain time Catcache entries can be left alone for several reasons. It is not desirable that they eat up memory. With this patch, This adds consideration of removal of entries that haven't been used for a certain time before enlarging the hash array. --- doc/src/sgml/config.sgml | 38 +++++++ src/backend/access/transam/xact.c | 3 + src/backend/utils/cache/catcache.c | 158 +++++++++++++++++++++++++- src/backend/utils/misc/guc.c | 23 ++++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/utils/catcache.h | 19 ++++ 6 files changed, 238 insertions(+), 4 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 259a2d83b4..fd25669abc 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1556,6 +1556,44 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target"> + <term><varname>syscache_memory_target</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_memory_target</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specify the maximum amount of memory to which syscache is expanded + without pruning. The value defaults to 0, indicating that pruning is + always considered. After exceeding this size, syscache pruning is + considered according to + <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep + certain amount of syscache entries with intermittent usage, try + increase this setting. + </para> + </listitem> + </varlistentry> + + <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age"> + <term><varname>syscache_prune_min_age</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specify the minimum amount of unused time in seconds at which a + syscache entry is considered to be removed. -1 indicates that syscache + pruning is disabled at all. The value defaults to 600 seconds + (<literal>10 minutes</literal>). The syscache entries that are not + used for the duration can be removed to prevent syscache bloat. This + behavior is suppressed until the size of syscache exceeds + <xref linkend="guc-syscache-memory-target"/>. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth"> <term><varname>max_stack_depth</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index dbaaf8e005..86d76917bb 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -733,6 +733,9 @@ void SetCurrentStatementStartTimestamp(void) { stmtStartTimestamp = GetCurrentTimestamp(); + + /* Set this timestamp as aproximated current time */ + SetCatCacheClock(stmtStartTimestamp); } /* diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 5ddbf6eab1..56d4f10019 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -71,9 +71,23 @@ #define CACHE6_elog(a,b,c,d,e,f,g) #endif +/* + * GUC variable to define the minimum size of hash to cosider entry eviction. + * Let the name to be samewith the guc variable name, not using 'catcache'. + */ +int syscache_memory_target = 0; + +/* GUC variable to define the minimum age of entries that will be cosidered + * to be evicted in seconds. + */ +int syscache_prune_min_age = 600; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Timestamp used for any operation on caches. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -866,9 +880,133 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initilize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries can be left alone for several reasons. We remove catcache + * entries that are not accessed for a certain time to prevent catcache from + * bloating. The eviction is performed with the similar algorithm with buffer + * eviction using access counter. Entries that are accessed several times can + * live longer than those that have had no access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int i; + int nremoved = 0; + size_t hash_size; +#ifdef CATCACHE_STATS + /* These variables are only for debugging purpose */ + int ntotal = 0; + /* + * nth element in nentries stores the number of cache entries that have + * lived unaccessed for that multiple in ageclass of + * syscache_prune_min_age. The index of nremoved_entry is the value of the + * clock-sweep counter, which takes from 0 up to 2. + */ + double ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; + int nentries[] = {0, 0, 0, 0, 0, 0}; + int nremoved_entry[3] = {0, 0, 0}; + int j; +#endif + + /* Return immediately if no pruning is wanted */ + if (syscache_prune_min_age < 0) + return false; + + /* + * Return without pruning if the size of the hash is below the target. + * Since the area for bucket array is dominant, consider only it. + */ + hash_size = cp->cc_nbuckets * sizeof(dlist_head); + if (hash_size < syscache_memory_target * 1024) + return false; + + /* Search the whole hash for entries to remove */ + for (i = 0; i < cp->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + + /* + * Calculate the time from the time of the last access to the + * "current" time. Since catcacheclock is not advance within a + * transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + +#ifdef CATCACHE_STATS + /* count catcache entries for each age class */ + ntotal++; + for (j = 0 ; + ageclass[j] != 0.0 && + entry_age > syscache_prune_min_age * ageclass[j] ; + j++); + if (ageclass[j] == 0.0) j--; + nentries[j]++; +#endif + + /* + * Remove entries older than syscache_prune_min_age seconds but + * not recently used. Entries that are not accessed after last + * access are removed in that seconds, and that has been used + * several times are removed after leaving alone for up to three + * times of the duration. We don't try shrink buckets since this + * effectively prevents the catcache from enlarged in the long + * term. + */ + if (entry_age > syscache_prune_min_age) + { +#ifdef CATCACHE_STATS + Assert (ct->naccess >= 0 && ct->naccess <= 2); + nremoved_entry[ct->naccess]++; +#endif + if (ct->naccess > 0) + ct->naccess--; + else + { + if (!ct->c_list || ct->c_list->refcount == 0) + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + } + } + } + } + } + +#ifdef CATCACHE_STATS + ereport(DEBUG2, + (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d, 2:%d),hash_size = %lubytes, %d", + nremoved, ntotal, + ageclass[0] * syscache_prune_min_age, nentries[0], + ageclass[1] * syscache_prune_min_age, nentries[1], + ageclass[2] * syscache_prune_min_age, nentries[2], + ageclass[3] * syscache_prune_min_age, nentries[3], + ageclass[4] * syscache_prune_min_age, nentries[4], + nremoved_entry[0], nremoved_entry[1], nremoved_entry[2], + hash_size, syscache_memory_target + ), + errhidestmt(true))); +#endif + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -1282,6 +1420,14 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + + /* + * Update the last access time of this entry + */ + if (ct->naccess < 2) + ct->naccess++; + ct->lastaccess = catcacheclock; + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1813,7 +1959,6 @@ ReleaseCatCacheList(CatCList *list) CatCacheRemoveCList(list->my_cache, list); } - /* * CatalogCacheCreateEntry * Create a new CatCTup entry, copying the given HeapTuple and other @@ -1906,6 +2051,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); @@ -1913,10 +2060,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, CacheHdr->ch_ntup++; /* - * If the hash table has become too full, enlarge the buckets array. Quite - * arbitrarily, we enlarge when fill factor > 2. + * If the hash table has become too full, try cleanup by removing + * infrequently used entries to make a room for the new entry. If it + * failed to remove an entry, enlarge the bucket array instead. Quite + * arbitrarily, we try this when fill factor > 2. */ - if (cache->cc_ntup > cache->cc_nbuckets * 2) + if (cache->cc_ntup > cache->cc_nbuckets * 2 && + !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); return ct; diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 1db7845d5a..a63bc5eb79 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -78,6 +78,7 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/guc_tables.h" #include "utils/memutils.h" #include "utils/pg_locale.h" @@ -1971,6 +1972,28 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"syscache_memory_target", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum syscache size to keep."), + gettext_noop("Syscache entries are not pruned after the size of syscache exceeds this size."), + GUC_UNIT_KB + }, + &syscache_memory_target, + 0, 0, INT_MAX, + NULL, NULL, NULL + }, + + { + {"syscache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum time to consider syscahe pruning."), + gettext_noop("Syscache entries lives less than this seconds will not be considered to be pruned."), + GUC_UNIT_S + }, + &syscache_prune_min_age, + 600, -1, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 39272925fb..0bda73d080 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -124,6 +124,7 @@ #work_mem = 4MB # min 64kB #maintenance_work_mem = 64MB # min 1MB #autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem +#syscache_prune_min_age = 600s # minimum age of syscache entries to keep #max_stack_depth = 2MB # min 100kB #dynamic_shared_memory_type = posix # the default is the first option # supported by the operating system: diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 7b22f9c7bc..eb89a9f0d7 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -119,6 +120,8 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ + int naccess; /* # of access to this entry */ + TimestampTz lastaccess; /* approx. timestamp of the last usage */ /* * The tuple may also be a member of at most one CatCList. (If a single @@ -189,6 +192,22 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int syscache_prune_min_age; +extern int syscache_memory_target; + +/* to use as access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* + * SetCatCacheClock - set timestamp for catcache access record + */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, -- 2.16.2
Oops! The previous ptach contained garbage printing in debugging output. The attached is the new version without the garbage. Addition to it, I changed my mind to use DEBUG1 for the debug message since the frequency is quite low. No changes in the following cited previous mail. At Wed, 07 Mar 2018 16:19:23 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180307.161923.178158050.horiguchi.kyotaro@lab.ntt.co.jp> ================== Hello. Thank you for the discussion, and sorry for being late to come. At Thu, 1 Mar 2018 12:26:30 -0800, Andres Freund <andres@anarazel.de> wrote in <20180301202630.2s6untij2x5hpksn@alap3.anarazel.de> > Hi, > > On 2018-03-01 15:19:26 -0500, Robert Haas wrote: > > On Thu, Mar 1, 2018 at 3:01 PM, Andres Freund <andres@anarazel.de> wrote: > > > On 2018-03-01 14:49:26 -0500, Robert Haas wrote: > > >> On Thu, Mar 1, 2018 at 2:29 PM, Andres Freund <andres@anarazel.de> wrote: > > >> > Right. Which might be very painful latency wise. And with poolers it's > > >> > pretty easy to get into situations like that, without the app > > >> > influencing it. > > >> > > >> Really? I'm not sure I believe that. You're talking perhaps a few > > >> milliseconds - maybe less - of additional latency on a connection > > >> that's been idle for many minutes. > > > > > > I've seen latency increases in second+ ranges due to empty cat/sys/rel > > > caches. > > > > How is that even possible unless the system is grossly overloaded? > > You just need to have catalog contents out of cache and statements > touching a few relations, functions, etc. Indexscan + heap fetch > latencies do add up quite quickly if done sequentially. > > > > > I don't think that'd quite address my concern. I just don't think that > > > the granularity (drop all entries older than xxx sec at the next resize) > > > is right. For one I don't want to drop stuff if the cache size isn't a > > > problem for the current memory budget. For another, I'm not convinced > > > that dropping entries from the current "generation" at resize won't end > > > up throwing away too much. > > > > I think that a fixed memory budget for the syscache is an idea that > > was tried many years ago and basically failed, because it's very easy > > to end up with terrible eviction patterns -- e.g. if you are accessing > > 11 relations in round-robin fashion with a 10-relation cache, your > > cache nets you a 0% hit rate but takes a lot more maintenance than > > having no cache at all. The time-based approach lets the cache grow > > with no fixed upper limit without allowing unused entries to stick > > around forever. > > I definitely think we want a time based component to this, I just want > to not prune at all if we're below a certain size. > > > > > If we'd a guc 'syscache_memory_target' and we'd only start pruning if > > > above it, I'd be much happier. > > > > It does seem reasonable to skip pruning altogether if the cache is > > below some threshold size. > > Cool. There might be some issues making that check performant enough, > but I don't have a good intuition on it. So.. - Now it gets two new GUC variables named syscache_prune_min_age and syscache_memory_target. The former is the replacement of the previous magic number 600 and defaults to the same number. The latter prevens syscache pruning until exceeding the size and defaults to 0, means that pruning is always considered. Documentation for the two variables are also added. - Revised the pointed mysterious comment for CatcacheCleanupOldEntries and some comments are added. - Fixed the name of the variables for CATCACHE_STATS to be more descriptive, and added some comments for the code. The catcache entries accessed within the current transaction won't be pruned so theoretically a long transaction can bloat catcache. But I believe it is quite rare, or at least this saves the most other cases. regards, -- Kyotaro Horiguchi NTT Open Source Software Center From 975e7e82d4eeb7d7b7ecf981141a8924297c46ef Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 26 Dec 2017 17:43:09 +0900 Subject: [PATCH] Remove entries that haven't been used for a certain time Catcache entries can be left alone for several reasons. It is not desirable that they eat up memory. With this patch, This adds consideration of removal of entries that haven't been used for a certain time before enlarging the hash array. --- doc/src/sgml/config.sgml | 38 +++++++ src/backend/access/transam/xact.c | 3 + src/backend/utils/cache/catcache.c | 152 +++++++++++++++++++++++++- src/backend/utils/misc/guc.c | 23 ++++ src/backend/utils/misc/postgresql.conf.sample | 2 + src/include/utils/catcache.h | 19 ++++ 6 files changed, 233 insertions(+), 4 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 259a2d83b4..782b506984 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1556,6 +1556,44 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target"> + <term><varname>syscache_memory_target</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_memory_target</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the maximum amount of memory to which syscache is expanded + without pruning. The value defaults to 0, indicating that pruning is + always considered. After exceeding this size, syscache pruning is + considered according to + <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep + certain amount of syscache entries with intermittent usage, try + increase this setting. + </para> + </listitem> + </varlistentry> + + <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age"> + <term><varname>syscache_prune_min_age</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the minimum amount of unused time in seconds at which a + syscache entry is considered to be removed. -1 indicates that syscache + pruning is disabled at all. The value defaults to 600 seconds + (<literal>10 minutes</literal>). The syscache entries that are not + used for the duration can be removed to prevent syscache bloat. This + behavior is suppressed until the size of syscache exceeds + <xref linkend="guc-syscache-memory-target"/>. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth"> <term><varname>max_stack_depth</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index dbaaf8e005..86d76917bb 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -733,6 +733,9 @@ void SetCurrentStatementStartTimestamp(void) { stmtStartTimestamp = GetCurrentTimestamp(); + + /* Set this timestamp as aproximated current time */ + SetCatCacheClock(stmtStartTimestamp); } /* diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 5ddbf6eab1..e4a9ab8789 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -71,9 +71,23 @@ #define CACHE6_elog(a,b,c,d,e,f,g) #endif +/* + * GUC variable to define the minimum size of hash to cosider entry eviction. + * Let the name be the same with the guc variable name, not using 'catcache'. + */ +int syscache_memory_target = 0; + +/* GUC variable to define the minimum age of entries that will be cosidered + * to be evicted in seconds. Ditto for the name. + */ +int syscache_prune_min_age = 600; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Timestamp used for any operation on caches. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -866,9 +880,130 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initilize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries can be left alone for several reasons. We remove them if + * they not accessed for a certain time to prevent catcache from bloating. The + * eviction is performed with the similar algorithm with buffer eviction using + * access counter. Entries that are accessed several times can live longer + * than those that have had no access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int i; + int nremoved = 0; + size_t hash_size; +#ifdef CATCACHE_STATS + /* These variables are only for debugging purpose */ + int ntotal = 0; + /* + * nth element in nentries stores the number of cache entries that have + * lived unaccessed for corresponding multiple in ageclass of + * syscache_prune_min_age. The index of nremoved_entry is the value of the + * clock-sweep counter, which takes from 0 up to 2. + */ + double ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; + int nentries[] = {0, 0, 0, 0, 0, 0}; + int nremoved_entry[3] = {0, 0, 0}; + int j; +#endif + + /* Return immediately if no pruning is wanted */ + if (syscache_prune_min_age < 0) + return false; + + /* + * Return without pruning if the size of the hash is below the target. + * Since the area for bucket array is dominant, consider only it. + */ + hash_size = cp->cc_nbuckets * sizeof(dlist_head); + if (hash_size < syscache_memory_target * 1024) + return false; + + /* Search the whole hash for entries to remove */ + for (i = 0; i < cp->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + + /* + * Calculate the duration from the time of the last access to the + * "current" time. Since catcacheclock is not advanced within a + * transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + +#ifdef CATCACHE_STATS + /* count catcache entries for each age class */ + ntotal++; + for (j = 0 ; + ageclass[j] != 0.0 && + entry_age > syscache_prune_min_age * ageclass[j] ; + j++); + if (ageclass[j] == 0.0) j--; + nentries[j]++; +#endif + + /* + * Try to remove entries older than syscache_prune_min_age + * seconds. Entries that are not accessed after last pruning are + * removed in that seconds, and that has been accessed several + * times are removed after leaving alone for up to three times of + * the duration. We don't try shrink buckets since pruning + * effectively caps catcache expansion in the long term. + */ + if (entry_age > syscache_prune_min_age) + { +#ifdef CATCACHE_STATS + Assert (ct->naccess >= 0 && ct->naccess <= 2); + nremoved_entry[ct->naccess]++; +#endif + if (ct->naccess > 0) + ct->naccess--; + else + { + if (!ct->c_list || ct->c_list->refcount == 0) + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + } + } + } + } + } + +#ifdef CATCACHE_STATS + ereport(DEBUG1, + (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d, 2:%d)", + nremoved, ntotal, + ageclass[0] * syscache_prune_min_age, nentries[0], + ageclass[1] * syscache_prune_min_age, nentries[1], + ageclass[2] * syscache_prune_min_age, nentries[2], + ageclass[3] * syscache_prune_min_age, nentries[3], + ageclass[4] * syscache_prune_min_age, nentries[4], + nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]), + errhidestmt(true))); +#endif + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -1282,6 +1417,11 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* Update access information for pruning */ + if (ct->naccess < 2) + ct->naccess++; + ct->lastaccess = catcacheclock; + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1813,7 +1953,6 @@ ReleaseCatCacheList(CatCList *list) CatCacheRemoveCList(list->my_cache, list); } - /* * CatalogCacheCreateEntry * Create a new CatCTup entry, copying the given HeapTuple and other @@ -1906,6 +2045,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); @@ -1913,10 +2054,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, CacheHdr->ch_ntup++; /* - * If the hash table has become too full, enlarge the buckets array. Quite - * arbitrarily, we enlarge when fill factor > 2. + * If the hash table has become too full, try cleanup by removing + * infrequently used entries to make a room for the new entry. If it + * failed, enlarge the bucket array instead. Quite arbitrarily, we try + * this when fill factor > 2. */ - if (cache->cc_ntup > cache->cc_nbuckets * 2) + if (cache->cc_ntup > cache->cc_nbuckets * 2 && + !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); return ct; diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 1db7845d5a..33abe04efe 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -78,6 +78,7 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/guc_tables.h" #include "utils/memutils.h" #include "utils/pg_locale.h" @@ -1971,6 +1972,28 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"syscache_memory_target", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum syscache size to keep."), + gettext_noop("Syscache is not pruned before exceeding this size."), + GUC_UNIT_KB + }, + &syscache_memory_target, + 0, 0, INT_MAX, + NULL, NULL, NULL + }, + + { + {"syscache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum duration of an unused syscache entry to remove."), + gettext_noop("Syscache entries that live unused for longer than this seconds are considered to be removed."), + GUC_UNIT_S + }, + &syscache_prune_min_age, + 600, -1, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 39272925fb..5a5729a88f 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -124,6 +124,8 @@ #work_mem = 4MB # min 64kB #maintenance_work_mem = 64MB # min 1MB #autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem +#syscache_memory_target = 0kB # in kB. zero disables the feature +#syscache_prune_min_age = 600s # -1 disables the feature #max_stack_depth = 2MB # min 100kB #dynamic_shared_memory_type = posix # the default is the first option # supported by the operating system: diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 7b22f9c7bc..c3c4d65998 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -119,6 +120,8 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ + int naccess; /* # of access to this entry, up to 2 */ + TimestampTz lastaccess; /* approx. timestamp of the last usage */ /* * The tuple may also be a member of at most one CatCList. (If a single @@ -189,6 +192,22 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int syscache_prune_min_age; +extern int syscache_memory_target; + +/* to use as access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* + * SetCatCacheClock - set timestamp for catcache access record + */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, -- 2.16.2
The thing that comes to mind when reading this patch is that some time ago we made fun of other database software, "they are so complicated to configure, they have some magical settings that few people understand how to set". Postgres was so much better because it was simple to set up, no magic crap. But now it becomes apparent that that only was so because Postgres sucked, ie., we hadn't yet gotten to the point where we *needed* to introduce settings like that. Now we finally are? I have to admit being a little disappointed about that outcome. I wonder if this is just because we refuse to acknowledge the notion of a connection pooler. If we did, and the pooler told us "here, this session is being given back to us by the application, we'll keep it around until the next app comes along", could we clean the oldest inactive cache entries at that point? Currently they use DISCARD for that. Though this does nothing to fix hypothetical cache bloat for pg_dump in bug #14936. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2018-03-07 08:01:38 -0300, Alvaro Herrera wrote: > I wonder if this is just because we refuse to acknowledge the notion of > a connection pooler. If we did, and the pooler told us "here, this > session is being given back to us by the application, we'll keep it > around until the next app comes along", could we clean the oldest > inactive cache entries at that point? Currently they use DISCARD for > that. Though this does nothing to fix hypothetical cache bloat for > pg_dump in bug #14936. I'm not seeing how this solves anything? You don't want to throw all caches away, therefore you need a target size. Then there's also the case of the cache being too large in a single "session". Greetings, Andres Freund
Hello, Andres Freund wrote: > On 2018-03-07 08:01:38 -0300, Alvaro Herrera wrote: > > I wonder if this is just because we refuse to acknowledge the notion of > > a connection pooler. If we did, and the pooler told us "here, this > > session is being given back to us by the application, we'll keep it > > around until the next app comes along", could we clean the oldest > > inactive cache entries at that point? Currently they use DISCARD for > > that. Though this does nothing to fix hypothetical cache bloat for > > pg_dump in bug #14936. > > I'm not seeing how this solves anything? You don't want to throw all > caches away, therefore you need a target size. Then there's also the > case of the cache being too large in a single "session". Oh, I wasn't suggesting to throw away the whole cache at that point; only that that is a convenient to do whatever cleanup we want to do. What I'm not clear about is exactly what is the cleanup that we want to do at that point. You say it should be based on some configured size; Robert says any predefined size breaks [performance for] the case where the workload uses size+1, so let's use time instead (evict anything not used in more than X seconds?), but keeping in mind that a workload that requires X+1 would also break. So it seems we've arrived at the conclusion that the only possible solution is to let the user tell us what time/size to use. But that sucks, because the user doesn't know either (maybe they can measure, but how?), and they don't even know that this setting is there to be tweaked; and if there is a performance problem, how do they figure whether or not it can be fixed by fooling with this parameter? I mean, maybe it's set to 10 and we suggest "maybe 11 works better" but it turns out not to, so "maybe 12 works better"? How do you know when to stop increasing it? This seems a bit like max_fsm_pages, that is to say, a disaster that was only fixed by removing it. Thanks, -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-03-07 14:48:48 -0300, Alvaro Herrera wrote: > Oh, I wasn't suggesting to throw away the whole cache at that point; > only that that is a convenient to do whatever cleanup we want to do. But why is that better than doing so continuously? > What I'm not clear about is exactly what is the cleanup that we want to > do at that point. You say it should be based on some configured size; > Robert says any predefined size breaks [performance for] the case where > the workload uses size+1, so let's use time instead (evict anything not > used in more than X seconds?), but keeping in mind that a workload that > requires X+1 would also break. We mostly seem to have found that adding a *minimum* size before starting evicting basedon time solves both of our concerns? > So it seems we've arrived at the > conclusion that the only possible solution is to let the user tell us > what time/size to use. But that sucks, because the user doesn't know > either (maybe they can measure, but how?), and they don't even know that > this setting is there to be tweaked; and if there is a performance > problem, how do they figure whether or not it can be fixed by fooling > with this parameter? I mean, maybe it's set to 10 and we suggest "maybe > 11 works better" but it turns out not to, so "maybe 12 works better"? > How do you know when to stop increasing it? I don't think it's that complicated, for the size figure. Having a knob that controls how much memory a backend uses isn't a new concept, and can definitely depend on the usecase. > This seems a bit like max_fsm_pages, that is to say, a disaster that was > only fixed by removing it. I don't think that's a meaningful comparison. max_fms_pages had persistent effect, couldn't be tuned without restarts, and the performance dropoffs were much more "cliff" like. Greetings, Andres Freund
On Wed, Mar 7, 2018 at 6:01 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > The thing that comes to mind when reading this patch is that some time > ago we made fun of other database software, "they are so complicated to > configure, they have some magical settings that few people understand > how to set". Postgres was so much better because it was simple to set > up, no magic crap. But now it becomes apparent that that only was so > because Postgres sucked, ie., we hadn't yet gotten to the point where we > *needed* to introduce settings like that. Now we finally are? > > I have to admit being a little disappointed about that outcome. I think your disappointment is a little excessive. I am not convinced of the need either for this to have any GUCs at all, but if it makes other people happy to have them, then I think it's worth accepting that as the price of getting the feature into the tree. These are scarcely the first GUCs we have that are hard to tune. work_mem is a terrible knob, and there are probably like very few people who know how to set ssl_ecdh_curve to anything other than the default, and what's geqo_selection_bias good for, anyway? I'm not sure what makes the settings we're adding here any different. Most people will ignore them, and a few people who really care can change the values. > I wonder if this is just because we refuse to acknowledge the notion of > a connection pooler. If we did, and the pooler told us "here, this > session is being given back to us by the application, we'll keep it > around until the next app comes along", could we clean the oldest > inactive cache entries at that point? Currently they use DISCARD for > that. Though this does nothing to fix hypothetical cache bloat for > pg_dump in bug #14936. We could certainly clean the oldest inactive cache entries at that point, but there's no guarantee that would be the right thing to do. If the working set across all applications is small enough that you can keep them all in the caches all the time, then you should do that, for maximum performance. If not, DISCARD ALL should probably flush everything that the last application needed and the next application won't. But without some configuration knob, you have zero way of knowing how concerned the user is about saving memory in this place vs. improving performance by reducing catalog scans. Even with such a knob it's a little difficult to say which things actually ought to be thrown away. I think this is a related problem, but a different one. I also think we ought to have built-in connection pooling. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
From: Alvaro Herrera [mailto:alvherre@alvh.no-ip.org] > The thing that comes to mind when reading this patch is that some time ago > we made fun of other database software, "they are so complicated to configure, > they have some magical settings that few people understand how to set". > Postgres was so much better because it was simple to set up, no magic crap. > But now it becomes apparent that that only was so because Postgres sucked, > ie., we hadn't yet gotten to the point where we > *needed* to introduce settings like that. Now we finally are? Yes. We are now facing the problem of too much memory use by PostgreSQL, where about some applications randomly access about200,000 tables. It is estimated based on a small experiment that each backend will use several to ten GBs of localmemory for CacheMemoryContext. The total memory use will become over 1 TB when the expected maximum connections areused. I haven't looked at this patch, but does it evict all kinds of entries in CacheMemoryContext, ie. relcache, plancache, etc? Regards Takayuki Tsunakawa
Hello, At Thu, 8 Mar 2018 00:28:04 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1F8FF0D9@G01JPEXMBYT05> > From: Alvaro Herrera [mailto:alvherre@alvh.no-ip.org] > > The thing that comes to mind when reading this patch is that some time ago > > we made fun of other database software, "they are so complicated to configure, > > they have some magical settings that few people understand how to set". > > Postgres was so much better because it was simple to set up, no magic crap. > > But now it becomes apparent that that only was so because Postgres sucked, > > ie., we hadn't yet gotten to the point where we > > *needed* to introduce settings like that. Now we finally are? > > Yes. We are now facing the problem of too much memory use by PostgreSQL, where about some applications randomly accessabout 200,000 tables. It is estimated based on a small experiment that each backend will use several to ten GBs oflocal memory for CacheMemoryContext. The total memory use will become over 1 TB when the expected maximum connectionsare used. > > I haven't looked at this patch, but does it evict all kinds of entries in CacheMemoryContext, ie. relcache, plancache,etc? This works only for syscaches, which could bloat with entries for nonexistent objects. Plan cache is a utterly deferent thing. It is abandoned at the end of a transaction or such like. Relcache is not based on catcache and out of the scope of this patch since it doesn't get bloat with nonexistent entries. It uses dynahash and we could introduce a similar feature to it if we are willing to cap relcache size. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes: > At Thu, 8 Mar 2018 00:28:04 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1F8FF0D9@G01JPEXMBYT05> >> Yes. We are now facing the problem of too much memory use by PostgreSQL, where about some applications randomly accessabout 200,000 tables. It is estimated based on a small experiment that each backend will use several to ten GBs oflocal memory for CacheMemoryContext. The total memory use will become over 1 TB when the expected maximum connectionsare used. >> >> I haven't looked at this patch, but does it evict all kinds of entries in CacheMemoryContext, ie. relcache, plancache,etc? > This works only for syscaches, which could bloat with entries for > nonexistent objects. > Plan cache is a utterly deferent thing. It is abandoned at the > end of a transaction or such like. When I was at Salesforce, we had *substantial* problems with plancache bloat. The driving factor there was plans associated with plpgsql functions, which Salesforce had a huge number of. In an environment like that, there would be substantial value in being able to prune both the plancache and plpgsql's function cache. (Note that neither of those things are "abandoned at the end of a transaction".) > Relcache is not based on catcache and out of the scope of this > patch since it doesn't get bloat with nonexistent entries. It > uses dynahash and we could introduce a similar feature to it if > we are willing to cap relcache size. I think if the case of concern is an application with 200,000 tables, it's just nonsense to claim that relcache size isn't an issue. In short, it's not really apparent to me that negative syscache entries are the major problem of this kind. I'm afraid that you're drawing very large conclusions from a specific workload. Maybe we could fix that workload some other way. regards, tom lane
At Wed, 07 Mar 2018 23:12:29 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <352.1520482349@sss.pgh.pa.us> > Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes: > > At Thu, 8 Mar 2018 00:28:04 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1F8FF0D9@G01JPEXMBYT05> > >> Yes. We are now facing the problem of too much memory use by PostgreSQL, where about some applications randomly accessabout 200,000 tables. It is estimated based on a small experiment that each backend will use several to ten GBs oflocal memory for CacheMemoryContext. The total memory use will become over 1 TB when the expected maximum connectionsare used. > >> > >> I haven't looked at this patch, but does it evict all kinds of entries in CacheMemoryContext, ie. relcache, plancache,etc? > > > This works only for syscaches, which could bloat with entries for > > nonexistent objects. > > > Plan cache is a utterly deferent thing. It is abandoned at the > > end of a transaction or such like. > > When I was at Salesforce, we had *substantial* problems with plancache > bloat. The driving factor there was plans associated with plpgsql > functions, which Salesforce had a huge number of. In an environment > like that, there would be substantial value in being able to prune > both the plancache and plpgsql's function cache. (Note that neither > of those things are "abandoned at the end of a transaction".) Mmm. Right. Thanks for pointing it. Anyway plan cache seems to be a different thing. > > Relcache is not based on catcache and out of the scope of this > > patch since it doesn't get bloat with nonexistent entries. It > > uses dynahash and we could introduce a similar feature to it if > > we are willing to cap relcache size. > > I think if the case of concern is an application with 200,000 tables, > it's just nonsense to claim that relcache size isn't an issue. > > In short, it's not really apparent to me that negative syscache entries > are the major problem of this kind. I'm afraid that you're drawing very > large conclusions from a specific workload. Maybe we could fix that > workload some other way. The current patch doesn't consider whether an entry is negative or positive(?). Just clean up all entries based on time. If relation has to have the same characterictics to syscaches, it might be better be on the catcache mechanism, instaed of adding the same pruning mechanism to dynahash.. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
At Fri, 09 Mar 2018 17:40:01 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180309.174001.202113825.horiguchi.kyotaro@lab.ntt.co.jp> > > In short, it's not really apparent to me that negative syscache entries > > are the major problem of this kind. I'm afraid that you're drawing very > > large conclusions from a specific workload. Maybe we could fix that > > workload some other way. > > The current patch doesn't consider whether an entry is negative > or positive(?). Just clean up all entries based on time. > > If relation has to have the same characterictics to syscaches, it > might be better be on the catcache mechanism, instaed of adding > the same pruning mechanism to dynahash.. For the moment, I added such feature to dynahash and let only relcache use it in this patch. Hash element has different shape in "prunable" hash and pruning is performed in a similar way sharing the setting with syscache. This seems working fine. It is bit uneasy that all syscaches and relcache shares the same value of syscache_memory_target... Something like the sttached test script causes relcache "bloat". Server emits the following log entries in DEBUG1 message level. DEBUG: removed 11240/32769 entries from hash "Relcache by OID" at character 15 # The last few words are just garbage I mentioned in another thread. The last two patches do that (as PoC). regards, -- Kyotaro Horiguchi NTT Open Source Software Center From 705b67a79ef7e27a450083944f8d970b7eb9e619 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 26 Dec 2017 17:43:09 +0900 Subject: [PATCH 1/3] Remove entries that haven't been used for a certain time Catcache entries can be left alone for several reasons. It is not desirable that they eat up memory. With this patch, This adds consideration of removal of entries that haven't been used for a certain time before enlarging the hash array. --- doc/src/sgml/config.sgml | 38 +++++++ src/backend/access/transam/xact.c | 3 + src/backend/utils/cache/catcache.c | 152 +++++++++++++++++++++++++- src/backend/utils/misc/guc.c | 23 ++++ src/backend/utils/misc/postgresql.conf.sample | 2 + src/include/utils/catcache.h | 19 ++++ 6 files changed, 233 insertions(+), 4 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 3a8fc7d803..394e0703f8 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1557,6 +1557,44 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target"> + <term><varname>syscache_memory_target</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_memory_target</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the maximum amount of memory to which syscache is expanded + without pruning. The value defaults to 0, indicating that pruning is + always considered. After exceeding this size, syscache pruning is + considered according to + <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep + certain amount of syscache entries with intermittent usage, try + increase this setting. + </para> + </listitem> + </varlistentry> + + <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age"> + <term><varname>syscache_prune_min_age</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the minimum amount of unused time in seconds at which a + syscache entry is considered to be removed. -1 indicates that syscache + pruning is disabled at all. The value defaults to 600 seconds + (<literal>10 minutes</literal>). The syscache entries that are not + used for the duration can be removed to prevent syscache bloat. This + behavior is suppressed until the size of syscache exceeds + <xref linkend="guc-syscache-memory-target"/>. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth"> <term><varname>max_stack_depth</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index dbaaf8e005..86d76917bb 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -733,6 +733,9 @@ void SetCurrentStatementStartTimestamp(void) { stmtStartTimestamp = GetCurrentTimestamp(); + + /* Set this timestamp as aproximated current time */ + SetCatCacheClock(stmtStartTimestamp); } /* diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 5ddbf6eab1..0236a05127 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -71,9 +71,23 @@ #define CACHE6_elog(a,b,c,d,e,f,g) #endif +/* + * GUC variable to define the minimum size of hash to cosider entry eviction. + * Let the name be the same with the guc variable name, not using 'catcache'. + */ +int syscache_memory_target = 0; + +/* GUC variable to define the minimum age of entries that will be cosidered + * to be evicted in seconds. Ditto for the name. + */ +int syscache_prune_min_age = 600; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Timestamp used for any operation on caches. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -866,9 +880,130 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initilize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries can be left alone for several reasons. We remove them if + * they not accessed for a certain time to prevent catcache from bloating. The + * eviction is performed with the similar algorithm with buffer eviction using + * access counter. Entries that are accessed several times can live longer + * than those that have had no access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int i; + int nremoved = 0; + size_t hash_size; +#ifdef CATCACHE_STATS + /* These variables are only for debugging purpose */ + int ntotal = 0; + /* + * nth element in nentries stores the number of cache entries that have + * lived unaccessed for corresponding multiple in ageclass of + * syscache_prune_min_age. The index of nremoved_entry is the value of the + * clock-sweep counter, which takes from 0 up to 2. + */ + double ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; + int nentries[] = {0, 0, 0, 0, 0, 0}; + int nremoved_entry[3] = {0, 0, 0}; + int j; +#endif + + /* Return immediately if no pruning is wanted */ + if (syscache_prune_min_age < 0) + return false; + + /* + * Return without pruning if the size of the hash is below the target. + * Since the area for bucket array is dominant, consider only it. + */ + hash_size = cp->cc_nbuckets * sizeof(dlist_head); + if (hash_size < (Size) syscache_memory_target * 1024L) + return false; + + /* Search the whole hash for entries to remove */ + for (i = 0; i < cp->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + + /* + * Calculate the duration from the time of the last access to the + * "current" time. Since catcacheclock is not advanced within a + * transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + +#ifdef CATCACHE_STATS + /* count catcache entries for each age class */ + ntotal++; + for (j = 0 ; + ageclass[j] != 0.0 && + entry_age > syscache_prune_min_age * ageclass[j] ; + j++); + if (ageclass[j] == 0.0) j--; + nentries[j]++; +#endif + + /* + * Try to remove entries older than syscache_prune_min_age + * seconds. Entries that are not accessed after last pruning are + * removed in that seconds, and that has been accessed several + * times are removed after leaving alone for up to three times of + * the duration. We don't try shrink buckets since pruning + * effectively caps catcache expansion in the long term. + */ + if (entry_age > syscache_prune_min_age) + { +#ifdef CATCACHE_STATS + Assert (ct->naccess >= 0 && ct->naccess <= 2); + nremoved_entry[ct->naccess]++; +#endif + if (ct->naccess > 0) + ct->naccess--; + else + { + if (!ct->c_list || ct->c_list->refcount == 0) + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + } + } + } + } + } + +#ifdef CATCACHE_STATS + ereport(DEBUG1, + (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d, 2:%d)", + nremoved, ntotal, + ageclass[0] * syscache_prune_min_age, nentries[0], + ageclass[1] * syscache_prune_min_age, nentries[1], + ageclass[2] * syscache_prune_min_age, nentries[2], + ageclass[3] * syscache_prune_min_age, nentries[3], + ageclass[4] * syscache_prune_min_age, nentries[4], + nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]), + errhidestmt(true))); +#endif + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -1282,6 +1417,11 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* Update access information for pruning */ + if (ct->naccess < 2) + ct->naccess++; + ct->lastaccess = catcacheclock; + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1813,7 +1953,6 @@ ReleaseCatCacheList(CatCList *list) CatCacheRemoveCList(list->my_cache, list); } - /* * CatalogCacheCreateEntry * Create a new CatCTup entry, copying the given HeapTuple and other @@ -1906,6 +2045,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); @@ -1913,10 +2054,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, CacheHdr->ch_ntup++; /* - * If the hash table has become too full, enlarge the buckets array. Quite - * arbitrarily, we enlarge when fill factor > 2. + * If the hash table has become too full, try cleanup by removing + * infrequently used entries to make a room for the new entry. If it + * failed, enlarge the bucket array instead. Quite arbitrarily, we try + * this when fill factor > 2. */ - if (cache->cc_ntup > cache->cc_nbuckets * 2) + if (cache->cc_ntup > cache->cc_nbuckets * 2 && + !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); return ct; diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index a4f9b3668e..5e0d18657f 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -78,6 +78,7 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/guc_tables.h" #include "utils/memutils.h" #include "utils/pg_locale.h" @@ -1972,6 +1973,28 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"syscache_memory_target", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum syscache size to keep."), + gettext_noop("Syscache is not pruned before exceeding this size."), + GUC_UNIT_KB + }, + &syscache_memory_target, + 0, 0, MAX_KILOBYTES, + NULL, NULL, NULL + }, + + { + {"syscache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum duration of an unused syscache entry to remove."), + gettext_noop("Syscache entries that live unused for longer than this seconds are considered to be removed."), + GUC_UNIT_S + }, + &syscache_prune_min_age, + 600, -1, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 39272925fb..5a5729a88f 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -124,6 +124,8 @@ #work_mem = 4MB # min 64kB #maintenance_work_mem = 64MB # min 1MB #autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem +#syscache_memory_target = 0kB # in kB. zero disables the feature +#syscache_prune_min_age = 600s # -1 disables the feature #max_stack_depth = 2MB # min 100kB #dynamic_shared_memory_type = posix # the default is the first option # supported by the operating system: diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 7b22f9c7bc..c3c4d65998 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -119,6 +120,8 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ + int naccess; /* # of access to this entry, up to 2 */ + TimestampTz lastaccess; /* approx. timestamp of the last usage */ /* * The tuple may also be a member of at most one CatCList. (If a single @@ -189,6 +192,22 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int syscache_prune_min_age; +extern int syscache_memory_target; + +/* to use as access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* + * SetCatCacheClock - set timestamp for catcache access record + */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, -- 2.16.2 From 74545dc6f52d42cf93d1353e205bb38581269c5f Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Mon, 12 Mar 2018 15:52:18 +0900 Subject: [PATCH 2/3] introduce dynhash pruning --- src/backend/utils/hash/dynahash.c | 159 +++++++++++++++++++++++++++++++++----- src/include/utils/catcache.h | 12 +++ src/include/utils/hsearch.h | 19 ++++- 3 files changed, 170 insertions(+), 20 deletions(-) diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c index 5281cd5410..5a8b15652a 100644 --- a/src/backend/utils/hash/dynahash.c +++ b/src/backend/utils/hash/dynahash.c @@ -88,6 +88,7 @@ #include "access/xact.h" #include "storage/shmem.h" #include "storage/spin.h" +#include "utils/catcache.h" #include "utils/dynahash.h" #include "utils/memutils.h" @@ -184,6 +185,8 @@ struct HASHHDR long ssize; /* segment size --- must be power of 2 */ int sshift; /* segment shift = log2(ssize) */ int nelem_alloc; /* number of entries to allocate at once */ + bool prunable; /* true if prunable */ + HASH_PRUNE_CB prune_cb; /* pruning callback. see above. */ #ifdef HASH_STATISTICS @@ -227,16 +230,18 @@ struct HTAB int sshift; /* segment shift = log2(ssize) */ }; +#define HASHELEMENT_SIZE(ctlp) MAXALIGN(ctlp->prunable ? sizeof(PRUNABLE_HASHELEMENT) : sizeof(HASHELEMENT)) + /* * Key (also entry) part of a HASHELEMENT */ -#define ELEMENTKEY(helem) (((char *)(helem)) + MAXALIGN(sizeof(HASHELEMENT))) +#define ELEMENTKEY(helem, ctlp) (((char *)(helem)) + HASHELEMENT_SIZE(ctlp)) /* * Obtain element pointer given pointer to key */ -#define ELEMENT_FROM_KEY(key) \ - ((HASHELEMENT *) (((char *) (key)) - MAXALIGN(sizeof(HASHELEMENT)))) +#define ELEMENT_FROM_KEY(key, ctlp) \ + ((HASHELEMENT *) (((char *) (key)) - HASHELEMENT_SIZE(ctlp))) /* * Fast MOD arithmetic, assuming that y is a power of 2 ! @@ -257,6 +262,7 @@ static HASHSEGMENT seg_alloc(HTAB *hashp); static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx); static bool dir_realloc(HTAB *hashp); static bool expand_table(HTAB *hashp); +static bool prune_entries(HTAB *hashp); static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx); static void hdefault(HTAB *hashp); static int choose_nelem_alloc(Size entrysize); @@ -497,6 +503,17 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags) hctl->entrysize = info->entrysize; } + /* + * hash table runs pruning + */ + if (flags & HASH_PRUNABLE) + { + hctl->prunable = true; + hctl->prune_cb = info->prune_cb; + } + else + hctl->prunable = false; + /* make local copies of heavily-used constant fields */ hashp->keysize = hctl->keysize; hashp->ssize = hctl->ssize; @@ -982,7 +999,7 @@ hash_search_with_hash_value(HTAB *hashp, while (currBucket != NULL) { if (currBucket->hashvalue == hashvalue && - match(ELEMENTKEY(currBucket), keyPtr, keysize) == 0) + match(ELEMENTKEY(currBucket, hctl), keyPtr, keysize) == 0) break; prevBucketPtr = &(currBucket->link); currBucket = *prevBucketPtr; @@ -995,6 +1012,17 @@ hash_search_with_hash_value(HTAB *hashp, if (foundPtr) *foundPtr = (bool) (currBucket != NULL); + /* Update access counter if needed */ + if (hctl->prunable && currBucket && + (action == HASH_FIND || action == HASH_ENTER)) + { + PRUNABLE_HASHELEMENT *prunable_elm = + (PRUNABLE_HASHELEMENT *) currBucket; + if (prunable_elm->naccess < 2) + prunable_elm->naccess++; + prunable_elm->last_access = GetCatCacheClock(); + } + /* * OK, now what? */ @@ -1002,7 +1030,8 @@ hash_search_with_hash_value(HTAB *hashp, { case HASH_FIND: if (currBucket != NULL) - return (void *) ELEMENTKEY(currBucket); + return (void *) ELEMENTKEY(currBucket, hctl); + return NULL; case HASH_REMOVE: @@ -1031,7 +1060,7 @@ hash_search_with_hash_value(HTAB *hashp, * element, because someone else is going to reuse it the next * time something is added to the table */ - return (void *) ELEMENTKEY(currBucket); + return (void *) ELEMENTKEY(currBucket, hctl); } return NULL; @@ -1043,7 +1072,7 @@ hash_search_with_hash_value(HTAB *hashp, case HASH_ENTER: /* Return existing element if found, else create one */ if (currBucket != NULL) - return (void *) ELEMENTKEY(currBucket); + return (void *) ELEMENTKEY(currBucket, hctl); /* disallow inserts if frozen */ if (hashp->frozen) @@ -1073,8 +1102,18 @@ hash_search_with_hash_value(HTAB *hashp, /* copy key into record */ currBucket->hashvalue = hashvalue; - hashp->keycopy(ELEMENTKEY(currBucket), keyPtr, keysize); + hashp->keycopy(ELEMENTKEY(currBucket, hctl), keyPtr, keysize); + /* set access counter */ + if (hctl->prunable) + { + PRUNABLE_HASHELEMENT *prunable_elm = + (PRUNABLE_HASHELEMENT *) currBucket; + if (prunable_elm->naccess < 2) + prunable_elm->naccess++; + prunable_elm->last_access = GetCatCacheClock(); + } + /* * Caller is expected to fill the data field on return. DO NOT * insert any code that could possibly throw error here, as doing @@ -1082,7 +1121,7 @@ hash_search_with_hash_value(HTAB *hashp, * caller's data structure. */ - return (void *) ELEMENTKEY(currBucket); + return (void *) ELEMENTKEY(currBucket, hctl); } elog(ERROR, "unrecognized hash action code: %d", (int) action); @@ -1114,7 +1153,7 @@ hash_update_hash_key(HTAB *hashp, void *existingEntry, const void *newKeyPtr) { - HASHELEMENT *existingElement = ELEMENT_FROM_KEY(existingEntry); + HASHELEMENT *existingElement = ELEMENT_FROM_KEY(existingEntry, hashp->hctl); HASHHDR *hctl = hashp->hctl; uint32 newhashvalue; Size keysize; @@ -1198,7 +1237,7 @@ hash_update_hash_key(HTAB *hashp, while (currBucket != NULL) { if (currBucket->hashvalue == newhashvalue && - match(ELEMENTKEY(currBucket), newKeyPtr, keysize) == 0) + match(ELEMENTKEY(currBucket, hctl), newKeyPtr, keysize) == 0) break; prevBucketPtr = &(currBucket->link); currBucket = *prevBucketPtr; @@ -1232,7 +1271,7 @@ hash_update_hash_key(HTAB *hashp, /* copy new key into record */ currBucket->hashvalue = newhashvalue; - hashp->keycopy(ELEMENTKEY(currBucket), newKeyPtr, keysize); + hashp->keycopy(ELEMENTKEY(currBucket, hctl), newKeyPtr, keysize); /* rest of record is untouched */ @@ -1386,8 +1425,8 @@ hash_seq_init(HASH_SEQ_STATUS *status, HTAB *hashp) void * hash_seq_search(HASH_SEQ_STATUS *status) { - HTAB *hashp; - HASHHDR *hctl; + HTAB *hashp = status->hashp; + HASHHDR *hctl = hashp->hctl; uint32 max_bucket; long ssize; long segment_num; @@ -1402,15 +1441,13 @@ hash_seq_search(HASH_SEQ_STATUS *status) status->curEntry = curElem->link; if (status->curEntry == NULL) /* end of this bucket */ ++status->curBucket; - return (void *) ELEMENTKEY(curElem); + return (void *) ELEMENTKEY(curElem, hctl); } /* * Search for next nonempty bucket starting at curBucket. */ curBucket = status->curBucket; - hashp = status->hashp; - hctl = hashp->hctl; ssize = hashp->ssize; max_bucket = hctl->max_bucket; @@ -1456,7 +1493,7 @@ hash_seq_search(HASH_SEQ_STATUS *status) if (status->curEntry == NULL) /* end of this bucket */ ++curBucket; status->curBucket = curBucket; - return (void *) ELEMENTKEY(curElem); + return (void *) ELEMENTKEY(curElem, hctl); } void @@ -1550,6 +1587,10 @@ expand_table(HTAB *hashp) */ if ((uint32) new_bucket > hctl->high_mask) { + /* try pruning before expansion. return true on success */ + if (hctl->prunable && prune_entries(hashp)) + return true; + hctl->low_mask = hctl->high_mask; hctl->high_mask = (uint32) new_bucket | hctl->low_mask; } @@ -1592,6 +1633,86 @@ expand_table(HTAB *hashp) return true; } +static bool +prune_entries(HTAB *hashp) +{ + HASHHDR *hctl = hashp->hctl; + HASH_SEQ_STATUS status; + void *elm; + TimestampTz currclock = GetCatCacheClock(); + int nall = 0, + nremoved = 0; + + Assert(hctl->prunable); + + /* not called for frozen or under seqscan. see + * hash_search_with_hash_value. */ + Assert(IS_PARTITIONED(hctl) || + hashp->frozen || + hctl->freeList[0].nentries / (long) (hctl->max_bucket + 1) < + hctl->ffactor || + has_seq_scans(hashp)); + + /* This setting prevents pruning */ + if (syscache_prune_min_age < 0) + return false; + + /* + * return false immediately if this hash is small enough. We only consider + * bucket array size since it is the significant part of memory usage. + * settings is shared with syscache + */ + if (hctl->dsize * sizeof(HASHBUCKET) * hashp->ssize < + (Size) syscache_memory_target * 1024L) + return false; + + /* + * Ok, this hash can be pruned. start pruning. This function is called + * early enough for doing this via public API. + */ + hash_seq_init(&status, hashp); + while ((elm = hash_seq_search(&status)) != NULL) + { + PRUNABLE_HASHELEMENT *helm = + (PRUNABLE_HASHELEMENT *)ELEMENT_FROM_KEY(elm, hctl); + long entry_age; + int us; + + nall++; + + TimestampDifference(helm->last_access, currclock, &entry_age, &us); + + /* settings is shared with syscache */ + if (entry_age > syscache_prune_min_age) + { + /* Wait for the next chance if this is recently used */ + if (helm->naccess > 0) + helm->naccess--; + else + { + /* just call it if callback is provided, remove otherwise */ + if (hctl->prune_cb) + { + if (hctl->prune_cb(hashp, (void *)elm)) + nremoved++; + } + else + { + bool found; + + hash_search(hashp, elm, HASH_REMOVE, &found); + Assert(found); + nremoved++; + } + } + } + } + + elog(DEBUG1, "removed %d/%d entries from hash \"%s\"", + nremoved, nall, hashp->tabname); + + return nremoved > 0; +} static bool dir_realloc(HTAB *hashp) @@ -1665,7 +1786,7 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx) return false; /* Each element has a HASHELEMENT header plus user data. */ - elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize); + elementSize = HASHELEMENT_SIZE(hctl) + MAXALIGN(hctl->entrysize); CurrentDynaHashCxt = hashp->hcxt; firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize); diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index c3c4d65998..fcc680bb82 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -208,6 +208,18 @@ SetCatCacheClock(TimestampTz ts) catcacheclock = ts; } +/* + * GetCatCacheClock - get timestamp for catcache access record + * + * This clock is basically provided for catcache usage, but dynahash has a + * similar pruning mechanism and wants to use the same clock. + */ +static inline TimestampTz +GetCatCacheClock(void) +{ + return catcacheclock; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h index 8357faac5a..df12352a46 100644 --- a/src/include/utils/hsearch.h +++ b/src/include/utils/hsearch.h @@ -13,7 +13,7 @@ */ #ifndef HSEARCH_H #define HSEARCH_H - +#include "datatype/timestamp.h" /* * Hash functions must have this signature. @@ -47,6 +47,7 @@ typedef void *(*HashAllocFunc) (Size request); * HASHELEMENT is the private part of a hashtable entry. The caller's data * follows the HASHELEMENT structure (on a MAXALIGN'd boundary). The hash key * is expected to be at the start of the caller's hash entry data structure. + * If this hash is prunable, PRUNABLE_HASHELEMENT is used instead. */ typedef struct HASHELEMENT { @@ -54,12 +55,26 @@ typedef struct HASHELEMENT uint32 hashvalue; /* hash function result for this entry */ } HASHELEMENT; +typedef struct PRUNABLE_HASHELEMENT +{ + struct HASHELEMENT *link; /* link to next entry in same bucket */ + uint32 hashvalue; /* hash function result for this entry */ + TimestampTz last_access; /* timestamp of the last usage */ + int naccess; /* takes 0 to 2, counted up when used */ +} PRUNABLE_HASHELEMENT; + /* Hash table header struct is an opaque type known only within dynahash.c */ typedef struct HASHHDR HASHHDR; /* Hash table control struct is an opaque type known only within dynahash.c */ typedef struct HTAB HTAB; +/* + * Hash pruning callback. This is called for the entries which is about to be + * removed without the owner's intention. + */ +typedef bool (*HASH_PRUNE_CB)(HTAB *hashp, void *ent); + /* Parameter data structure for hash_create */ /* Only those fields indicated by hash_flags need be set */ typedef struct HASHCTL @@ -77,6 +92,7 @@ typedef struct HASHCTL HashAllocFunc alloc; /* memory allocator */ MemoryContext hcxt; /* memory context to use for allocations */ HASHHDR *hctl; /* location of header in shared mem */ + HASH_PRUNE_CB prune_cb; /* pruning callback. see above. */ } HASHCTL; /* Flags to indicate which parameters are supplied */ @@ -94,6 +110,7 @@ typedef struct HASHCTL #define HASH_SHARED_MEM 0x0800 /* Hashtable is in shared memory */ #define HASH_ATTACH 0x1000 /* Do not initialize hctl */ #define HASH_FIXED_SIZE 0x2000 /* Initial size is a hard limit */ +#define HASH_PRUNABLE 0x4000 /* pruning setting */ /* max_dsize value to indicate expansible directory */ -- 2.16.2 From debface28e2261b0d819c46e52942ba500143581 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Mon, 12 Mar 2018 17:31:43 +0900 Subject: [PATCH 3/3] Apply purning to relcache --- src/backend/utils/cache/relcache.c | 28 +++++++++++++++++++++++++++- 1 file changed, 27 insertions(+), 1 deletion(-) diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 9ee78f885f..f344771d57 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -3503,6 +3503,29 @@ RelationSetNewRelfilenode(Relation relation, char persistence, #define INITRELCACHESIZE 400 +/* callback function for hash pruning */ +static bool +relcache_prune_cb(HTAB *hashp, void *ent) +{ + RelIdCacheEnt *relent = (RelIdCacheEnt *) ent; + Relation relation; + + /* this relation is requested to be removed. */ + RelationIdCacheLookup(relent->reloid, relation); + + /* but cannot removed an active cache entry */ + if (!RelationHasReferenceCountZero(relation)) + return false; + + /* + * Otherwise we are allowd to forget it unconditionally. see + * RelationForgetRelation + */ + RelationClearRelation(relation, false); + + return true; +} + void RelationCacheInitialize(void) { @@ -3520,8 +3543,11 @@ RelationCacheInitialize(void) MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(RelIdCacheEnt); + + /* use the same setting with syscache */ + ctl.prune_cb = relcache_prune_cb; RelationIdCache = hash_create("Relcache by OID", INITRELCACHESIZE, - &ctl, HASH_ELEM | HASH_BLOBS); + &ctl, HASH_ELEM | HASH_BLOBS | HASH_PRUNABLE); /* * relation mapper needs to be initialized too -- 2.16.2
Oops. At Mon, 12 Mar 2018 17:34:08 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180312.173408.162882093.horiguchi.kyotaro@lab.ntt.co.jp> > Something like the sttached test script causes relcache This is that. #! /usr/bin/perl # printf("drop schema if exists test_schema;\n", $i); printf("create schema test_schema;\n", $i); printf("create table test_schema.t%06d ();\n", $i); for $i (0..100000) { printf("create table test_schema.t%06d ();\n", $i); } printf("set syscache_memory_target = \'1kB\';\n"); printf("set syscache_prune_min_age = \'15s\';\n"); for $i (0..100000) { printf("select * from test_schema.t%06d;\n", $i); }
At Mon, 12 Mar 2018 17:34:08 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180312.173408.162882093.horiguchi.kyotaro@lab.ntt.co.jp> > > > In short, it's not really apparent to me that negative syscache entries > > > are the major problem of this kind. I'm afraid that you're drawing very > > > large conclusions from a specific workload. Maybe we could fix that > > > workload some other way. > > > > The current patch doesn't consider whether an entry is negative > > or positive(?). Just clean up all entries based on time. > > > > If relation has to have the same characterictics to syscaches, it > > might be better be on the catcache mechanism, instaed of adding > > the same pruning mechanism to dynahash.. > > For the moment, I added such feature to dynahash and let only > relcache use it in this patch. Hash element has different shape > in "prunable" hash and pruning is performed in a similar way > sharing the setting with syscache. This seems working fine. I gave consideration on plancache. The most different characteristics from catcache and relcache is the fact that it is not voluntarily removable since CachedPlanSource, the root struct of a plan cache, holds some indispensable inforamtion. In regards to prepared queries, even if we store the information into another location, for example in "Prepred Queries" hash, it merely moving a big data into another place. Looking into CachedPlanSoruce, generic plan is a part that is safely removable since it is rebuilt as necessary. Keeping "old" plancache entries not holding a generic plan can reduce memory usage. For testing purpose, I made 50000 parepared statement like "select sum(c) from p where e < $" on 100 partitions, With disabling the feature (0004 patch) VSZ of the backend exceeds 3GB (It is still increasing at the moment), while it stops to increase at about 997MB for min_cached_plans = 1000 and plancache_prune_min_age = '10s'. # 10s is apparently short for acutual use, of course. It is expected to be significant amount if the plan is large enough but I'm still not sure it is worth doing, or is a right way. The attached is the patch set including this plancache stuff. 0001- catcache time-based expiration (The origin of this thread) 0002- introduces dynahash pruning feature 0003- implement relcache pruning using 0002 0004- (perhaps) independent from the three above. PoC of plancache pruning. Details are shown above. regards, -- Kyotaro Horiguchi NTT Open Source Software Center From 705b67a79ef7e27a450083944f8d970b7eb9e619 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 26 Dec 2017 17:43:09 +0900 Subject: [PATCH 1/4] Remove entries that haven't been used for a certain time Catcache entries can be left alone for several reasons. It is not desirable that they eat up memory. With this patch, This adds consideration of removal of entries that haven't been used for a certain time before enlarging the hash array. --- doc/src/sgml/config.sgml | 38 +++++++ src/backend/access/transam/xact.c | 3 + src/backend/utils/cache/catcache.c | 152 +++++++++++++++++++++++++- src/backend/utils/misc/guc.c | 23 ++++ src/backend/utils/misc/postgresql.conf.sample | 2 + src/include/utils/catcache.h | 19 ++++ 6 files changed, 233 insertions(+), 4 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 3a8fc7d803..394e0703f8 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1557,6 +1557,44 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target"> + <term><varname>syscache_memory_target</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_memory_target</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the maximum amount of memory to which syscache is expanded + without pruning. The value defaults to 0, indicating that pruning is + always considered. After exceeding this size, syscache pruning is + considered according to + <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep + certain amount of syscache entries with intermittent usage, try + increase this setting. + </para> + </listitem> + </varlistentry> + + <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age"> + <term><varname>syscache_prune_min_age</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the minimum amount of unused time in seconds at which a + syscache entry is considered to be removed. -1 indicates that syscache + pruning is disabled at all. The value defaults to 600 seconds + (<literal>10 minutes</literal>). The syscache entries that are not + used for the duration can be removed to prevent syscache bloat. This + behavior is suppressed until the size of syscache exceeds + <xref linkend="guc-syscache-memory-target"/>. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth"> <term><varname>max_stack_depth</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index dbaaf8e005..86d76917bb 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -733,6 +733,9 @@ void SetCurrentStatementStartTimestamp(void) { stmtStartTimestamp = GetCurrentTimestamp(); + + /* Set this timestamp as aproximated current time */ + SetCatCacheClock(stmtStartTimestamp); } /* diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 5ddbf6eab1..0236a05127 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -71,9 +71,23 @@ #define CACHE6_elog(a,b,c,d,e,f,g) #endif +/* + * GUC variable to define the minimum size of hash to cosider entry eviction. + * Let the name be the same with the guc variable name, not using 'catcache'. + */ +int syscache_memory_target = 0; + +/* GUC variable to define the minimum age of entries that will be cosidered + * to be evicted in seconds. Ditto for the name. + */ +int syscache_prune_min_age = 600; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Timestamp used for any operation on caches. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -866,9 +880,130 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initilize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries can be left alone for several reasons. We remove them if + * they not accessed for a certain time to prevent catcache from bloating. The + * eviction is performed with the similar algorithm with buffer eviction using + * access counter. Entries that are accessed several times can live longer + * than those that have had no access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int i; + int nremoved = 0; + size_t hash_size; +#ifdef CATCACHE_STATS + /* These variables are only for debugging purpose */ + int ntotal = 0; + /* + * nth element in nentries stores the number of cache entries that have + * lived unaccessed for corresponding multiple in ageclass of + * syscache_prune_min_age. The index of nremoved_entry is the value of the + * clock-sweep counter, which takes from 0 up to 2. + */ + double ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; + int nentries[] = {0, 0, 0, 0, 0, 0}; + int nremoved_entry[3] = {0, 0, 0}; + int j; +#endif + + /* Return immediately if no pruning is wanted */ + if (syscache_prune_min_age < 0) + return false; + + /* + * Return without pruning if the size of the hash is below the target. + * Since the area for bucket array is dominant, consider only it. + */ + hash_size = cp->cc_nbuckets * sizeof(dlist_head); + if (hash_size < (Size) syscache_memory_target * 1024L) + return false; + + /* Search the whole hash for entries to remove */ + for (i = 0; i < cp->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + + /* + * Calculate the duration from the time of the last access to the + * "current" time. Since catcacheclock is not advanced within a + * transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + +#ifdef CATCACHE_STATS + /* count catcache entries for each age class */ + ntotal++; + for (j = 0 ; + ageclass[j] != 0.0 && + entry_age > syscache_prune_min_age * ageclass[j] ; + j++); + if (ageclass[j] == 0.0) j--; + nentries[j]++; +#endif + + /* + * Try to remove entries older than syscache_prune_min_age + * seconds. Entries that are not accessed after last pruning are + * removed in that seconds, and that has been accessed several + * times are removed after leaving alone for up to three times of + * the duration. We don't try shrink buckets since pruning + * effectively caps catcache expansion in the long term. + */ + if (entry_age > syscache_prune_min_age) + { +#ifdef CATCACHE_STATS + Assert (ct->naccess >= 0 && ct->naccess <= 2); + nremoved_entry[ct->naccess]++; +#endif + if (ct->naccess > 0) + ct->naccess--; + else + { + if (!ct->c_list || ct->c_list->refcount == 0) + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + } + } + } + } + } + +#ifdef CATCACHE_STATS + ereport(DEBUG1, + (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d, 2:%d)", + nremoved, ntotal, + ageclass[0] * syscache_prune_min_age, nentries[0], + ageclass[1] * syscache_prune_min_age, nentries[1], + ageclass[2] * syscache_prune_min_age, nentries[2], + ageclass[3] * syscache_prune_min_age, nentries[3], + ageclass[4] * syscache_prune_min_age, nentries[4], + nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]), + errhidestmt(true))); +#endif + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -1282,6 +1417,11 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* Update access information for pruning */ + if (ct->naccess < 2) + ct->naccess++; + ct->lastaccess = catcacheclock; + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1813,7 +1953,6 @@ ReleaseCatCacheList(CatCList *list) CatCacheRemoveCList(list->my_cache, list); } - /* * CatalogCacheCreateEntry * Create a new CatCTup entry, copying the given HeapTuple and other @@ -1906,6 +2045,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); @@ -1913,10 +2054,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, CacheHdr->ch_ntup++; /* - * If the hash table has become too full, enlarge the buckets array. Quite - * arbitrarily, we enlarge when fill factor > 2. + * If the hash table has become too full, try cleanup by removing + * infrequently used entries to make a room for the new entry. If it + * failed, enlarge the bucket array instead. Quite arbitrarily, we try + * this when fill factor > 2. */ - if (cache->cc_ntup > cache->cc_nbuckets * 2) + if (cache->cc_ntup > cache->cc_nbuckets * 2 && + !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); return ct; diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index a4f9b3668e..5e0d18657f 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -78,6 +78,7 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/guc_tables.h" #include "utils/memutils.h" #include "utils/pg_locale.h" @@ -1972,6 +1973,28 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"syscache_memory_target", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum syscache size to keep."), + gettext_noop("Syscache is not pruned before exceeding this size."), + GUC_UNIT_KB + }, + &syscache_memory_target, + 0, 0, MAX_KILOBYTES, + NULL, NULL, NULL + }, + + { + {"syscache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum duration of an unused syscache entry to remove."), + gettext_noop("Syscache entries that live unused for longer than this seconds are considered to be removed."), + GUC_UNIT_S + }, + &syscache_prune_min_age, + 600, -1, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 39272925fb..5a5729a88f 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -124,6 +124,8 @@ #work_mem = 4MB # min 64kB #maintenance_work_mem = 64MB # min 1MB #autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem +#syscache_memory_target = 0kB # in kB. zero disables the feature +#syscache_prune_min_age = 600s # -1 disables the feature #max_stack_depth = 2MB # min 100kB #dynamic_shared_memory_type = posix # the default is the first option # supported by the operating system: diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 7b22f9c7bc..c3c4d65998 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -119,6 +120,8 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ + int naccess; /* # of access to this entry, up to 2 */ + TimestampTz lastaccess; /* approx. timestamp of the last usage */ /* * The tuple may also be a member of at most one CatCList. (If a single @@ -189,6 +192,22 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int syscache_prune_min_age; +extern int syscache_memory_target; + +/* to use as access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* + * SetCatCacheClock - set timestamp for catcache access record + */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, -- 2.16.2 From 037f3534f5274eb7bcdb5adee262b5af624175e2 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Mon, 12 Mar 2018 15:52:18 +0900 Subject: [PATCH 2/4] introduce dynhash pruning --- src/backend/utils/hash/dynahash.c | 169 +++++++++++++++++++++++++++++++++----- src/include/utils/catcache.h | 12 +++ src/include/utils/hsearch.h | 21 ++++- 3 files changed, 182 insertions(+), 20 deletions(-) diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c index 5281cd5410..a5b4979662 100644 --- a/src/backend/utils/hash/dynahash.c +++ b/src/backend/utils/hash/dynahash.c @@ -88,6 +88,7 @@ #include "access/xact.h" #include "storage/shmem.h" #include "storage/spin.h" +#include "utils/catcache.h" #include "utils/dynahash.h" #include "utils/memutils.h" @@ -184,6 +185,10 @@ struct HASHHDR long ssize; /* segment size --- must be power of 2 */ int sshift; /* segment shift = log2(ssize) */ int nelem_alloc; /* number of entries to allocate at once */ + bool prunable; /* true if prunable */ + HASH_PRUNE_CB prune_cb; /* pruning callback. see above. */ + int *memory_target; /* pointer to memory target */ + int *prune_min_age; /* pointer to prune minimum age */ #ifdef HASH_STATISTICS @@ -227,16 +232,18 @@ struct HTAB int sshift; /* segment shift = log2(ssize) */ }; +#define HASHELEMENT_SIZE(ctlp) MAXALIGN(ctlp->prunable ? sizeof(PRUNABLE_HASHELEMENT) : sizeof(HASHELEMENT)) + /* * Key (also entry) part of a HASHELEMENT */ -#define ELEMENTKEY(helem) (((char *)(helem)) + MAXALIGN(sizeof(HASHELEMENT))) +#define ELEMENTKEY(helem, ctlp) (((char *)(helem)) + HASHELEMENT_SIZE(ctlp)) /* * Obtain element pointer given pointer to key */ -#define ELEMENT_FROM_KEY(key) \ - ((HASHELEMENT *) (((char *) (key)) - MAXALIGN(sizeof(HASHELEMENT)))) +#define ELEMENT_FROM_KEY(key, ctlp) \ + ((HASHELEMENT *) (((char *) (key)) - HASHELEMENT_SIZE(ctlp))) /* * Fast MOD arithmetic, assuming that y is a power of 2 ! @@ -257,6 +264,7 @@ static HASHSEGMENT seg_alloc(HTAB *hashp); static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx); static bool dir_realloc(HTAB *hashp); static bool expand_table(HTAB *hashp); +static bool prune_entries(HTAB *hashp); static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx); static void hdefault(HTAB *hashp); static int choose_nelem_alloc(Size entrysize); @@ -497,6 +505,25 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags) hctl->entrysize = info->entrysize; } + /* + * hash table runs pruning + */ + if (flags & HASH_PRUNABLE) + { + hctl->prunable = true; + hctl->prune_cb = info->prune_cb; + if (info->memory_target) + hctl->memory_target = info->memory_target; + else + hctl->memory_target = &syscache_memory_target; + if (info->prune_min_age) + hctl->prune_min_age = info->prune_min_age; + else + hctl->prune_min_age = &syscache_prune_min_age; + } + else + hctl->prunable = false; + /* make local copies of heavily-used constant fields */ hashp->keysize = hctl->keysize; hashp->ssize = hctl->ssize; @@ -982,7 +1009,7 @@ hash_search_with_hash_value(HTAB *hashp, while (currBucket != NULL) { if (currBucket->hashvalue == hashvalue && - match(ELEMENTKEY(currBucket), keyPtr, keysize) == 0) + match(ELEMENTKEY(currBucket, hctl), keyPtr, keysize) == 0) break; prevBucketPtr = &(currBucket->link); currBucket = *prevBucketPtr; @@ -995,6 +1022,17 @@ hash_search_with_hash_value(HTAB *hashp, if (foundPtr) *foundPtr = (bool) (currBucket != NULL); + /* Update access counter if needed */ + if (hctl->prunable && currBucket && + (action == HASH_FIND || action == HASH_ENTER)) + { + PRUNABLE_HASHELEMENT *prunable_elm = + (PRUNABLE_HASHELEMENT *) currBucket; + if (prunable_elm->naccess < 2) + prunable_elm->naccess++; + prunable_elm->last_access = GetCatCacheClock(); + } + /* * OK, now what? */ @@ -1002,7 +1040,8 @@ hash_search_with_hash_value(HTAB *hashp, { case HASH_FIND: if (currBucket != NULL) - return (void *) ELEMENTKEY(currBucket); + return (void *) ELEMENTKEY(currBucket, hctl); + return NULL; case HASH_REMOVE: @@ -1031,7 +1070,7 @@ hash_search_with_hash_value(HTAB *hashp, * element, because someone else is going to reuse it the next * time something is added to the table */ - return (void *) ELEMENTKEY(currBucket); + return (void *) ELEMENTKEY(currBucket, hctl); } return NULL; @@ -1043,7 +1082,7 @@ hash_search_with_hash_value(HTAB *hashp, case HASH_ENTER: /* Return existing element if found, else create one */ if (currBucket != NULL) - return (void *) ELEMENTKEY(currBucket); + return (void *) ELEMENTKEY(currBucket, hctl); /* disallow inserts if frozen */ if (hashp->frozen) @@ -1073,8 +1112,18 @@ hash_search_with_hash_value(HTAB *hashp, /* copy key into record */ currBucket->hashvalue = hashvalue; - hashp->keycopy(ELEMENTKEY(currBucket), keyPtr, keysize); + hashp->keycopy(ELEMENTKEY(currBucket, hctl), keyPtr, keysize); + /* set access counter */ + if (hctl->prunable) + { + PRUNABLE_HASHELEMENT *prunable_elm = + (PRUNABLE_HASHELEMENT *) currBucket; + if (prunable_elm->naccess < 2) + prunable_elm->naccess++; + prunable_elm->last_access = GetCatCacheClock(); + } + /* * Caller is expected to fill the data field on return. DO NOT * insert any code that could possibly throw error here, as doing @@ -1082,7 +1131,7 @@ hash_search_with_hash_value(HTAB *hashp, * caller's data structure. */ - return (void *) ELEMENTKEY(currBucket); + return (void *) ELEMENTKEY(currBucket, hctl); } elog(ERROR, "unrecognized hash action code: %d", (int) action); @@ -1114,7 +1163,7 @@ hash_update_hash_key(HTAB *hashp, void *existingEntry, const void *newKeyPtr) { - HASHELEMENT *existingElement = ELEMENT_FROM_KEY(existingEntry); + HASHELEMENT *existingElement = ELEMENT_FROM_KEY(existingEntry, hashp->hctl); HASHHDR *hctl = hashp->hctl; uint32 newhashvalue; Size keysize; @@ -1198,7 +1247,7 @@ hash_update_hash_key(HTAB *hashp, while (currBucket != NULL) { if (currBucket->hashvalue == newhashvalue && - match(ELEMENTKEY(currBucket), newKeyPtr, keysize) == 0) + match(ELEMENTKEY(currBucket, hctl), newKeyPtr, keysize) == 0) break; prevBucketPtr = &(currBucket->link); currBucket = *prevBucketPtr; @@ -1232,7 +1281,7 @@ hash_update_hash_key(HTAB *hashp, /* copy new key into record */ currBucket->hashvalue = newhashvalue; - hashp->keycopy(ELEMENTKEY(currBucket), newKeyPtr, keysize); + hashp->keycopy(ELEMENTKEY(currBucket, hctl), newKeyPtr, keysize); /* rest of record is untouched */ @@ -1386,8 +1435,8 @@ hash_seq_init(HASH_SEQ_STATUS *status, HTAB *hashp) void * hash_seq_search(HASH_SEQ_STATUS *status) { - HTAB *hashp; - HASHHDR *hctl; + HTAB *hashp = status->hashp; + HASHHDR *hctl = hashp->hctl; uint32 max_bucket; long ssize; long segment_num; @@ -1402,15 +1451,13 @@ hash_seq_search(HASH_SEQ_STATUS *status) status->curEntry = curElem->link; if (status->curEntry == NULL) /* end of this bucket */ ++status->curBucket; - return (void *) ELEMENTKEY(curElem); + return (void *) ELEMENTKEY(curElem, hctl); } /* * Search for next nonempty bucket starting at curBucket. */ curBucket = status->curBucket; - hashp = status->hashp; - hctl = hashp->hctl; ssize = hashp->ssize; max_bucket = hctl->max_bucket; @@ -1456,7 +1503,7 @@ hash_seq_search(HASH_SEQ_STATUS *status) if (status->curEntry == NULL) /* end of this bucket */ ++curBucket; status->curBucket = curBucket; - return (void *) ELEMENTKEY(curElem); + return (void *) ELEMENTKEY(curElem, hctl); } void @@ -1550,6 +1597,10 @@ expand_table(HTAB *hashp) */ if ((uint32) new_bucket > hctl->high_mask) { + /* try pruning before expansion. return true on success */ + if (hctl->prunable && prune_entries(hashp)) + return true; + hctl->low_mask = hctl->high_mask; hctl->high_mask = (uint32) new_bucket | hctl->low_mask; } @@ -1592,6 +1643,86 @@ expand_table(HTAB *hashp) return true; } +static bool +prune_entries(HTAB *hashp) +{ + HASHHDR *hctl = hashp->hctl; + HASH_SEQ_STATUS status; + void *elm; + TimestampTz currclock = GetCatCacheClock(); + int nall = 0, + nremoved = 0; + + Assert(hctl->prunable); + + /* not called for frozen or under seqscan. see + * hash_search_with_hash_value. */ + Assert(IS_PARTITIONED(hctl) || + hashp->frozen || + hctl->freeList[0].nentries / (long) (hctl->max_bucket + 1) < + hctl->ffactor || + has_seq_scans(hashp)); + + /* This setting prevents pruning */ + if (*hctl->prune_min_age < 0) + return false; + + /* + * return false immediately if this hash is small enough. We only consider + * bucket array size since it is the significant part of memory usage. + * settings is shared with syscache + */ + if (hctl->dsize * sizeof(HASHBUCKET) * hashp->ssize < + (Size) *hctl->memory_target * 1024L) + return false; + + /* + * Ok, this hash can be pruned. start pruning. This function is called + * early enough for doing this via public API. + */ + hash_seq_init(&status, hashp); + while ((elm = hash_seq_search(&status)) != NULL) + { + PRUNABLE_HASHELEMENT *helm = + (PRUNABLE_HASHELEMENT *)ELEMENT_FROM_KEY(elm, hctl); + long entry_age; + int us; + + nall++; + + TimestampDifference(helm->last_access, currclock, &entry_age, &us); + + /* settings is shared with syscache */ + if (entry_age > *hctl->prune_min_age) + { + /* Wait for the next chance if this is recently used */ + if (helm->naccess > 0) + helm->naccess--; + else + { + /* just call it if callback is provided, remove otherwise */ + if (hctl->prune_cb) + { + if (hctl->prune_cb(hashp, (void *)elm)) + nremoved++; + } + else + { + bool found; + + hash_search(hashp, elm, HASH_REMOVE, &found); + Assert(found); + nremoved++; + } + } + } + } + + elog(DEBUG1, "removed %d/%d entries from hash \"%s\"", + nremoved, nall, hashp->tabname); + + return nremoved > 0; +} static bool dir_realloc(HTAB *hashp) @@ -1665,7 +1796,7 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx) return false; /* Each element has a HASHELEMENT header plus user data. */ - elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize); + elementSize = HASHELEMENT_SIZE(hctl) + MAXALIGN(hctl->entrysize); CurrentDynaHashCxt = hashp->hcxt; firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize); diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index c3c4d65998..fcc680bb82 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -208,6 +208,18 @@ SetCatCacheClock(TimestampTz ts) catcacheclock = ts; } +/* + * GetCatCacheClock - get timestamp for catcache access record + * + * This clock is basically provided for catcache usage, but dynahash has a + * similar pruning mechanism and wants to use the same clock. + */ +static inline TimestampTz +GetCatCacheClock(void) +{ + return catcacheclock; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h index 8357faac5a..7ea3c75423 100644 --- a/src/include/utils/hsearch.h +++ b/src/include/utils/hsearch.h @@ -13,7 +13,7 @@ */ #ifndef HSEARCH_H #define HSEARCH_H - +#include "datatype/timestamp.h" /* * Hash functions must have this signature. @@ -47,6 +47,7 @@ typedef void *(*HashAllocFunc) (Size request); * HASHELEMENT is the private part of a hashtable entry. The caller's data * follows the HASHELEMENT structure (on a MAXALIGN'd boundary). The hash key * is expected to be at the start of the caller's hash entry data structure. + * If this hash is prunable, PRUNABLE_HASHELEMENT is used instead. */ typedef struct HASHELEMENT { @@ -54,12 +55,26 @@ typedef struct HASHELEMENT uint32 hashvalue; /* hash function result for this entry */ } HASHELEMENT; +typedef struct PRUNABLE_HASHELEMENT +{ + struct HASHELEMENT *link; /* link to next entry in same bucket */ + uint32 hashvalue; /* hash function result for this entry */ + TimestampTz last_access; /* timestamp of the last usage */ + int naccess; /* takes 0 to 2, counted up when used */ +} PRUNABLE_HASHELEMENT; + /* Hash table header struct is an opaque type known only within dynahash.c */ typedef struct HASHHDR HASHHDR; /* Hash table control struct is an opaque type known only within dynahash.c */ typedef struct HTAB HTAB; +/* + * Hash pruning callback. This is called for the entries which is about to be + * removed without the owner's intention. + */ +typedef bool (*HASH_PRUNE_CB)(HTAB *hashp, void *ent); + /* Parameter data structure for hash_create */ /* Only those fields indicated by hash_flags need be set */ typedef struct HASHCTL @@ -77,6 +92,9 @@ typedef struct HASHCTL HashAllocFunc alloc; /* memory allocator */ MemoryContext hcxt; /* memory context to use for allocations */ HASHHDR *hctl; /* location of header in shared mem */ + HASH_PRUNE_CB prune_cb; /* pruning callback. see above. */ + int *memory_target; /* pointer to memory target */ + int *prune_min_age; /* pointer to prune minimum age */ } HASHCTL; /* Flags to indicate which parameters are supplied */ @@ -94,6 +112,7 @@ typedef struct HASHCTL #define HASH_SHARED_MEM 0x0800 /* Hashtable is in shared memory */ #define HASH_ATTACH 0x1000 /* Do not initialize hctl */ #define HASH_FIXED_SIZE 0x2000 /* Initial size is a hard limit */ +#define HASH_PRUNABLE 0x4000 /* pruning setting */ /* max_dsize value to indicate expansible directory */ -- 2.16.2 From 94c85baed46e1a8330af7d664c44289d97d6df26 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Mon, 12 Mar 2018 17:31:43 +0900 Subject: [PATCH 3/4] Apply purning to relcache --- src/backend/utils/cache/relcache.c | 28 +++++++++++++++++++++++++++- 1 file changed, 27 insertions(+), 1 deletion(-) diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 9ee78f885f..da9ecee15b 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -3503,6 +3503,29 @@ RelationSetNewRelfilenode(Relation relation, char persistence, #define INITRELCACHESIZE 400 +/* callback function for hash pruning */ +static bool +relcache_prune_cb(HTAB *hashp, void *ent) +{ + RelIdCacheEnt *relent = (RelIdCacheEnt *) ent; + Relation relation; + + /* this relation is requested to be removed. */ + RelationIdCacheLookup(relent->reloid, relation); + + /* but cannot remove cache entries currently in use */ + if (!RelationHasReferenceCountZero(relation)) + return false; + + /* + * Otherwise we are allowd to forget it unconditionally. see + * RelationForgetRelation + */ + RelationClearRelation(relation, false); + + return true; +} + void RelationCacheInitialize(void) { @@ -3520,8 +3543,11 @@ RelationCacheInitialize(void) MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(RelIdCacheEnt); + + /* use the same setting with syscache */ + ctl.prune_cb = relcache_prune_cb; RelationIdCache = hash_create("Relcache by OID", INITRELCACHESIZE, - &ctl, HASH_ELEM | HASH_BLOBS); + &ctl, HASH_ELEM | HASH_BLOBS | HASH_PRUNABLE); /* * relation mapper needs to be initialized too -- 2.16.2 From 89bb807c11ec411d1e25b0aa03792ae341435fec Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 13 Mar 2018 17:29:32 +0900 Subject: [PATCH 4/4] PoC of generic plan removal of plancachesource. --- src/backend/utils/cache/plancache.c | 157 ++++++++++++++++++++++++++++++++++++ src/backend/utils/hash/dynahash.c | 16 +++- src/backend/utils/misc/guc.c | 21 +++++ src/backend/utils/mmgr/mcxt.c | 1 + src/include/commands/prepare.h | 4 + src/include/utils/hsearch.h | 2 + src/include/utils/plancache.h | 14 +++- 7 files changed, 208 insertions(+), 7 deletions(-) diff --git a/src/backend/utils/cache/plancache.c b/src/backend/utils/cache/plancache.c index 8d7d8e04c9..9e34e4098e 100644 --- a/src/backend/utils/cache/plancache.c +++ b/src/backend/utils/cache/plancache.c @@ -63,12 +63,14 @@ #include "storage/lmgr.h" #include "tcop/pquery.h" #include "tcop/utility.h" +#include "utils/catcache.h" #include "utils/inval.h" #include "utils/memutils.h" #include "utils/resowner_private.h" #include "utils/rls.h" #include "utils/snapmgr.h" #include "utils/syscache.h" +#include "utils/timestamp.h" /* @@ -86,6 +88,13 @@ * guarantee to save a CachedPlanSource without error. */ static CachedPlanSource *first_saved_plan = NULL; +static CachedPlanSource *last_saved_plan = NULL; +static int num_saved_plans = 0; +static TimestampTz oldest_saved_plan = 0; + +/* GUC variables */ +int min_cached_plans = 1000; +int plancache_prune_min_age = 600; static void ReleaseGenericPlan(CachedPlanSource *plansource); static List *RevalidateCachedQuery(CachedPlanSource *plansource, @@ -105,6 +114,7 @@ static TupleDesc PlanCacheComputeResultDesc(List *stmt_list); static void PlanCacheRelCallback(Datum arg, Oid relid); static void PlanCacheFuncCallback(Datum arg, int cacheid, uint32 hashvalue); static void PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue); +static void PruneCachedPlan(void); /* @@ -207,6 +217,8 @@ CreateCachedPlan(RawStmt *raw_parse_tree, plansource->generic_cost = -1; plansource->total_custom_cost = 0; plansource->num_custom_plans = 0; + plansource->last_access = GetCatCacheClock(); + MemoryContextSwitchTo(oldcxt); @@ -422,6 +434,28 @@ CompleteCachedPlan(CachedPlanSource *plansource, plansource->is_valid = true; } +/* moves the plansource to the first in the list */ +static inline void +MovePlansourceToFirst(CachedPlanSource *plansource) +{ + if (first_saved_plan != plansource) + { + /* delink this element */ + if (plansource->next_saved) + plansource->next_saved->prev_saved = plansource->prev_saved; + if (plansource->prev_saved) + plansource->prev_saved->next_saved = plansource->next_saved; + if (last_saved_plan == plansource) + last_saved_plan = plansource->prev_saved; + + /* insert at the beginning */ + first_saved_plan->prev_saved = plansource; + plansource->next_saved = first_saved_plan; + plansource->prev_saved = NULL; + first_saved_plan = plansource; + } +} + /* * SaveCachedPlan: save a cached plan permanently * @@ -469,6 +503,11 @@ SaveCachedPlan(CachedPlanSource *plansource) * Add the entry to the global list of cached plans. */ plansource->next_saved = first_saved_plan; + if (first_saved_plan) + first_saved_plan->prev_saved = plansource; + else + last_saved_plan = plansource; + plansource->prev_saved = NULL; first_saved_plan = plansource; plansource->is_saved = true; @@ -491,7 +530,11 @@ DropCachedPlan(CachedPlanSource *plansource) if (plansource->is_saved) { if (first_saved_plan == plansource) + { first_saved_plan = plansource->next_saved; + if (first_saved_plan) + first_saved_plan->prev_saved = NULL; + } else { CachedPlanSource *psrc; @@ -501,10 +544,19 @@ DropCachedPlan(CachedPlanSource *plansource) if (psrc->next_saved == plansource) { psrc->next_saved = plansource->next_saved; + if (psrc->next_saved) + psrc->next_saved->prev_saved = psrc; break; } } } + + if (last_saved_plan == plansource) + { + last_saved_plan = plansource->prev_saved; + if (last_saved_plan) + last_saved_plan->next_saved = NULL; + } plansource->is_saved = false; } @@ -536,6 +588,11 @@ ReleaseGenericPlan(CachedPlanSource *plansource) Assert(plan->magic == CACHEDPLAN_MAGIC); plansource->gplan = NULL; ReleaseCachedPlan(plan, false); + if (plansource->is_saved) + { + Assert (num_saved_plans >= 1); + num_saved_plans--; + } } } @@ -1146,6 +1203,15 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams, if (useResOwner && !plansource->is_saved) elog(ERROR, "cannot apply ResourceOwner to non-saved cached plan"); + /* increment access counter and set timestamp */ + if (plansource->is_saved) + { + plansource->last_access = GetCatCacheClock(); + + /* move this plan to the first of the list if needed */ + MovePlansourceToFirst(plansource); + } + /* Make sure the querytree list is valid and we have parse-time locks */ qlist = RevalidateCachedQuery(plansource, queryEnv); @@ -1154,6 +1220,11 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams, if (!customplan) { + /* Prune cached plans if needed */ + if (plansource->is_saved && + (min_cached_plans < 0 || num_saved_plans > min_cached_plans)) + PruneCachedPlan(); + if (CheckCachedPlan(plansource)) { /* We want a generic plan, and we already have a valid one */ @@ -1166,6 +1237,12 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams, plan = BuildCachedPlan(plansource, qlist, NULL, queryEnv); /* Just make real sure plansource->gplan is clear */ ReleaseGenericPlan(plansource); + + + /* Prune cached plans if needed */ + if (plansource->is_saved) + num_saved_plans++; + /* Link the new generic plan into the plansource */ plansource->gplan = plan; plan->refcount++; @@ -1853,6 +1930,86 @@ PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue) ResetPlanCache(); } +/* + * PrunePlanCache: invalidate "old" cached plans. + */ +static void +PruneCachedPlan(void) +{ + CachedPlanSource *plansource; + TimestampTz currclock = GetCatCacheClock(); + long age; + int us; + int nremoved = 0; + + /* do nothing if not wanted */ + if (plancache_prune_min_age < 0 || num_saved_plans <= min_cached_plans) + return; + + /* Fast check for oldest cache */ + if (oldest_saved_plan > 0) + { + TimestampDifference(oldest_saved_plan, currclock, &age, &us); + if (age < plancache_prune_min_age) + return; + } + + /* last plan is the oldest. */ + for (plansource = last_saved_plan; plansource; plansource = plansource->prev_saved) + { + long plan_age; + int us; + + Assert(plansource->magic == CACHEDPLANSOURCE_MAGIC); + + /* + * No work if it already doesn't have gplan and move it to the + * beginning so that we don't see it at the next time + */ + if (!plansource->gplan) + continue; + + /* + * Check age for pruning. Can exit immediately when finding a + * not-older element. + */ + TimestampDifference(plansource->last_access, currclock, &plan_age, &us); + if (plan_age <= plancache_prune_min_age) + { + /* this entry is the next oldest */ + oldest_saved_plan = plansource->last_access; + break; + } + + /* + * Here, remove generic plans of this plansrouceif it is not actually + * used and move it to the beginning of the list. Just update + * last_access and move it to the beginning if the plan is used. + */ + if (plansource->gplan->refcount <= 1) + { + ReleaseGenericPlan(plansource); + nremoved++; + } + + plansource->last_access = currclock; + } + + /* move the "removed" plansrouces to the beginning of the list */ + if (plansource != last_saved_plan && plansource) + { + plansource->next_saved->prev_saved = NULL; + first_saved_plan->prev_saved = last_saved_plan; + last_saved_plan->next_saved = first_saved_plan; + first_saved_plan = plansource->next_saved; + plansource->next_saved = NULL; + last_saved_plan = plansource; + } + + if (nremoved > 0) + elog(DEBUG1, "plancache removed %d/%d", nremoved, num_saved_plans); +} + /* * ResetPlanCache: invalidate all cached plans. */ diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c index 5a8b15652a..a5b4979662 100644 --- a/src/backend/utils/hash/dynahash.c +++ b/src/backend/utils/hash/dynahash.c @@ -187,6 +187,8 @@ struct HASHHDR int nelem_alloc; /* number of entries to allocate at once */ bool prunable; /* true if prunable */ HASH_PRUNE_CB prune_cb; /* pruning callback. see above. */ + int *memory_target; /* pointer to memory target */ + int *prune_min_age; /* pointer to prune minimum age */ #ifdef HASH_STATISTICS @@ -510,6 +512,14 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags) { hctl->prunable = true; hctl->prune_cb = info->prune_cb; + if (info->memory_target) + hctl->memory_target = info->memory_target; + else + hctl->memory_target = &syscache_memory_target; + if (info->prune_min_age) + hctl->prune_min_age = info->prune_min_age; + else + hctl->prune_min_age = &syscache_prune_min_age; } else hctl->prunable = false; @@ -1654,7 +1664,7 @@ prune_entries(HTAB *hashp) has_seq_scans(hashp)); /* This setting prevents pruning */ - if (syscache_prune_min_age < 0) + if (*hctl->prune_min_age < 0) return false; /* @@ -1663,7 +1673,7 @@ prune_entries(HTAB *hashp) * settings is shared with syscache */ if (hctl->dsize * sizeof(HASHBUCKET) * hashp->ssize < - (Size) syscache_memory_target * 1024L) + (Size) *hctl->memory_target * 1024L) return false; /* @@ -1683,7 +1693,7 @@ prune_entries(HTAB *hashp) TimestampDifference(helm->last_access, currclock, &entry_age, &us); /* settings is shared with syscache */ - if (entry_age > syscache_prune_min_age) + if (entry_age > *hctl->prune_min_age) { /* Wait for the next chance if this is recently used */ if (helm->naccess > 0) diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 5e0d18657f..45aab61d62 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -1995,6 +1995,27 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"min_cached_plans", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum number of cached plans kept on memory."), + gettext_noop("Timeout invalidation of plancache is not activated until the number of plancaches reaches thisvalue. -1 means timeout invalidation is always active.") + }, + &min_cached_plans, + 1000, -1, INT_MAX, + NULL, NULL, NULL + }, + + { + {"plancache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum duration of plancache entries to remove."), + gettext_noop("Plancache items that live unused for loger than this seconds are considered to be removed."), + GUC_UNIT_S + }, + &plancache_prune_min_age, + 600, -1, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c index d7baa54808..db225a06da 100644 --- a/src/backend/utils/mmgr/mcxt.c +++ b/src/backend/utils/mmgr/mcxt.c @@ -194,6 +194,7 @@ MemoryContextResetChildren(MemoryContext context) * but we have to recurse to handle the children. * We must also delink the context from its parent, if it has one. */ +int hoge = 0; void MemoryContextDelete(MemoryContext context) { diff --git a/src/include/commands/prepare.h b/src/include/commands/prepare.h index ffec029df4..1a8e8dd50e 100644 --- a/src/include/commands/prepare.h +++ b/src/include/commands/prepare.h @@ -31,6 +31,10 @@ typedef struct CachedPlanSource *plansource; /* the actual cached plan */ bool from_sql; /* prepared via SQL, not FE/BE protocol? */ TimestampTz prepare_time; /* the time when the stmt was prepared */ + RawStmt *raw_stmt; + int num_params; + Oid *param_types; + List *query_list; } PreparedStatement; diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h index df12352a46..7ea3c75423 100644 --- a/src/include/utils/hsearch.h +++ b/src/include/utils/hsearch.h @@ -93,6 +93,8 @@ typedef struct HASHCTL MemoryContext hcxt; /* memory context to use for allocations */ HASHHDR *hctl; /* location of header in shared mem */ HASH_PRUNE_CB prune_cb; /* pruning callback. see above. */ + int *memory_target; /* pointer to memory target */ + int *prune_min_age; /* pointer to prune minimum age */ } HASHCTL; /* Flags to indicate which parameters are supplied */ diff --git a/src/include/utils/plancache.h b/src/include/utils/plancache.h index ab20aa04b0..b5d439985c 100644 --- a/src/include/utils/plancache.h +++ b/src/include/utils/plancache.h @@ -72,10 +72,11 @@ struct RawStmt; * is no way to free memory short of clearing that entire context. A oneshot * plan is always treated as unsaved. * - * Note: the string referenced by commandTag is not subsidiary storage; - * it is assumed to be a compile-time-constant string. As with portals, - * commandTag shall be NULL if and only if the original query string (before - * rewriting) was an empty string. + * Note: the string referenced by commandTag is not subsidiary storage; it is + * assumed to be a compile-time-constant string. As with portals, commandTag + * shall be NULL if and only if the original query string (before rewriting) + * was an empty string. For memory-saving purpose, this struct is separated + * into to parts, the latter is removable in inactive state. */ typedef struct CachedPlanSource { @@ -110,11 +111,13 @@ typedef struct CachedPlanSource bool is_valid; /* is the query_list currently valid? */ int generation; /* increments each time we create a plan */ /* If CachedPlanSource has been saved, it is a member of a global list */ + struct CachedPlanSource *prev_saved; /* list link, if so */ struct CachedPlanSource *next_saved; /* list link, if so */ /* State kept to help decide whether to use custom or generic plans: */ double generic_cost; /* cost of generic plan, or -1 if not known */ double total_custom_cost; /* total cost of custom plans so far */ int num_custom_plans; /* number of plans included in total */ + TimestampTz last_access; /* timestamp of the last usage */ } CachedPlanSource; /* @@ -143,6 +146,9 @@ typedef struct CachedPlan MemoryContext context; /* context containing this CachedPlan */ } CachedPlan; +/* GUC variables */ +extern int min_cached_plans; +extern int plancache_prune_min_age; extern void InitPlanCache(void); extern void ResetPlanCache(void); -- 2.16.2
On 3/15/18 1:12 AM, Kyotaro HORIGUCHI wrote: > At Mon, 12 Mar 2018 17:34:08 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in> > The attached is the patch set including this plancache stuff. > > 0001- catcache time-based expiration (The origin of this thread) > 0002- introduces dynahash pruning feature > 0003- implement relcache pruning using 0002 > 0004- (perhaps) independent from the three above. PoC of > plancache pruning. Details are shown above. It looks like this should be marked Needs Review so I have done so. If that's not right please change it back or let me know and I will. Regards, -- -David david@pgmasters.net
Hello. At Wed, 21 Mar 2018 15:28:07 -0400, David Steele <david@pgmasters.net> wrote in <43095b16-14fc-e4d8-3310-2b86eaaab662@pgmasters.net> > On 3/15/18 1:12 AM, Kyotaro HORIGUCHI wrote: > > At Mon, 12 Mar 2018 17:34:08 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in> > > The attached is the patch set including this plancache stuff. > > > > 0001- catcache time-based expiration (The origin of this thread) > > 0002- introduces dynahash pruning feature > > 0003- implement relcache pruning using 0002 > > 0004- (perhaps) independent from the three above. PoC of > > plancache pruning. Details are shown above. > > It looks like this should be marked Needs Review so I have done so. If > that's not right please change it back or let me know and I will. Mmm. I haven't noticed that. Thanks! regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On 2018-03-23 17:01:11 +0900, Kyotaro HORIGUCHI wrote: > Hello. > > At Wed, 21 Mar 2018 15:28:07 -0400, David Steele <david@pgmasters.net> wrote in <43095b16-14fc-e4d8-3310-2b86eaaab662@pgmasters.net> > > On 3/15/18 1:12 AM, Kyotaro HORIGUCHI wrote: > > > At Mon, 12 Mar 2018 17:34:08 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrotein > > > > The attached is the patch set including this plancache stuff. > > > > > > 0001- catcache time-based expiration (The origin of this thread) > > > 0002- introduces dynahash pruning feature > > > 0003- implement relcache pruning using 0002 > > > 0004- (perhaps) independent from the three above. PoC of > > > plancache pruning. Details are shown above. > > > > It looks like this should be marked Needs Review so I have done so. If > > that's not right please change it back or let me know and I will. > > Mmm. I haven't noticed that. Thanks! I actually think this should be marked as returned with feedback, or at the very least moved to the next CF. This is entirely new development within the last CF. There's no realistic way we can get this into v11. Greetings, Andres Freund
At Thu, 29 Mar 2018 18:22:59 -0700, Andres Freund <andres@anarazel.de> wrote in <20180330012259.7k3442yz7jighg2t@alap3.anarazel.de> > On 2018-03-23 17:01:11 +0900, Kyotaro HORIGUCHI wrote: > > Hello. > > > > At Wed, 21 Mar 2018 15:28:07 -0400, David Steele <david@pgmasters.net> wrote in <43095b16-14fc-e4d8-3310-2b86eaaab662@pgmasters.net> > > > On 3/15/18 1:12 AM, Kyotaro HORIGUCHI wrote: > > > > At Mon, 12 Mar 2018 17:34:08 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrotein > > > > > The attached is the patch set including this plancache stuff. > > > > > > > > 0001- catcache time-based expiration (The origin of this thread) > > > > 0002- introduces dynahash pruning feature > > > > 0003- implement relcache pruning using 0002 > > > > 0004- (perhaps) independent from the three above. PoC of > > > > plancache pruning. Details are shown above. > > > > > > It looks like this should be marked Needs Review so I have done so. If > > > that's not right please change it back or let me know and I will. > > > > Mmm. I haven't noticed that. Thanks! > > I actually think this should be marked as returned with feedback, or at > the very least moved to the next CF. This is entirely new development > within the last CF. There's no realistic way we can get this into v11. 0002-0004 is new, in response to the comment that caches other than the catcache ought to get the same feature. These can be a separate development from 0001 for v12. I don't find a measures to catch the all case at once. If we agree on the point. I wish to discuss only 0001 for v11. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, On 2018-03-30 10:35:48 +0900, Kyotaro HORIGUCHI wrote: > 0002-0004 is new, in response to the comment that caches other > than the catcache ought to get the same feature. These can be a > separate development from 0001 for v12. I don't find a measures > to catch the all case at once. > > If we agree on the point. I wish to discuss only 0001 for v11. I'd personally not want to commit a solution for catcaches without also commiting a solution for a least relcaches in the same release cycle. I think this patch simply has missed the window for v11. Greetings, Andres Freund
At Thu, 29 Mar 2018 18:51:45 -0700, Andres Freund <andres@anarazel.de> wrote in <20180330015145.pvsr6kjtf6tw4uwe@alap3.anarazel.de> > Hi, > > On 2018-03-30 10:35:48 +0900, Kyotaro HORIGUCHI wrote: > > 0002-0004 is new, in response to the comment that caches other > > than the catcache ought to get the same feature. These can be a > > separate development from 0001 for v12. I don't find a measures > > to catch the all case at once. > > > > If we agree on the point. I wish to discuss only 0001 for v11. > > I'd personally not want to commit a solution for catcaches without also > commiting a solution for a least relcaches in the same release cycle. I > think this patch simply has missed the window for v11. Ok. Agreed. I moved this to the next CF. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello. I rebased this patchset. At Thu, 15 Mar 2018 14:12:46 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180315.141246.130742928.horiguchi.kyotaro@lab.ntt.co.jp> > At Mon, 12 Mar 2018 17:34:08 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in<20180312.173408.162882093.horiguchi.kyotaro@lab.ntt.co.jp> > > > > In short, it's not really apparent to me that negative syscache entries > > > > are the major problem of this kind. I'm afraid that you're drawing very > > > > large conclusions from a specific workload. Maybe we could fix that > > > > workload some other way. > > > > > > The current patch doesn't consider whether an entry is negative > > > or positive(?). Just clean up all entries based on time. > > > > > > If relation has to have the same characterictics to syscaches, it > > > might be better be on the catcache mechanism, instaed of adding > > > the same pruning mechanism to dynahash.. This means unifying catcache and dynahash. It doesn't seem win-win consolidation. Addition to that relcache links palloc'ed memory which needs additional treat. Or we could abstract the pruning mechanism applicable to both machinaries. Specifically unifying CatCacheCleanupOldEntries in 0001 and prune_entries in 0002. Or could refactor dynahash and rebuild catcache based on dynahash. > > For the moment, I added such feature to dynahash and let only > > relcache use it in this patch. Hash element has different shape > > in "prunable" hash and pruning is performed in a similar way > > sharing the setting with syscache. This seems working fine. > > I gave consideration on plancache. The most different > characteristics from catcache and relcache is the fact that it is > not voluntarily removable since CachedPlanSource, the root struct > of a plan cache, holds some indispensable inforamtion. In regards > to prepared queries, even if we store the information into > another location, for example in "Prepred Queries" hash, it > merely moving a big data into another place. > > Looking into CachedPlanSoruce, generic plan is a part that is > safely removable since it is rebuilt as necessary. Keeping "old" > plancache entries not holding a generic plan can reduce memory > usage. > > For testing purpose, I made 50000 parepared statement like > "select sum(c) from p where e < $" on 100 partitions, > > With disabling the feature (0004 patch) VSZ of the backend > exceeds 3GB (It is still increasing at the moment), while it > stops to increase at about 997MB for min_cached_plans = 1000 and > plancache_prune_min_age = '10s'. > > # 10s is apparently short for acutual use, of course. > > It is expected to be significant amount if the plan is large > enough but I'm still not sure it is worth doing, or is a right > way. > > > The attached is the patch set including this plancache stuff. > > 0001- catcache time-based expiration (The origin of this thread) > 0002- introduces dynahash pruning feature > 0003- implement relcache pruning using 0002 > 0004- (perhaps) independent from the three above. PoC of > plancache pruning. Details are shown above. I found up to v3 in this thread so I named this version 4. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 842f7b9fd47c6ee4daf1316547679d4298538940 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Thu, 15 Mar 2018 12:04:43 +0900 Subject: [PATCH 4/4] Generic plan removal of PlanCacheSource. We cannot remove saved cached plans while pruning since they are pointed from other structures. But still we can remove generic plan of each saved plans. The behavior is controled by two additional GUC variables min_cached_plans and cache_prune_min_age. The former tells to keep that number of generic plans without pruned. The latter tells how long we shuold keep generic plans before pruning. --- src/backend/utils/cache/plancache.c | 163 ++++++++++++++++++++++++++++++++++++ src/backend/utils/misc/guc.c | 10 +++ src/include/utils/plancache.h | 7 +- 3 files changed, 179 insertions(+), 1 deletion(-) diff --git a/src/backend/utils/cache/plancache.c b/src/backend/utils/cache/plancache.c index 0ad3e3c736..701ead152c 100644 --- a/src/backend/utils/cache/plancache.c +++ b/src/backend/utils/cache/plancache.c @@ -63,12 +63,14 @@ #include "storage/lmgr.h" #include "tcop/pquery.h" #include "tcop/utility.h" +#include "utils/catcache.h" #include "utils/inval.h" #include "utils/memutils.h" #include "utils/resowner_private.h" #include "utils/rls.h" #include "utils/snapmgr.h" #include "utils/syscache.h" +#include "utils/timestamp.h" /* @@ -86,6 +88,12 @@ * guarantee to save a CachedPlanSource without error. */ static CachedPlanSource *first_saved_plan = NULL; +static CachedPlanSource *last_saved_plan = NULL; +static int num_saved_plans = 0; +static TimestampTz oldest_saved_plan = 0; + +/* GUC variables */ +int min_cached_plans = 1000; static void ReleaseGenericPlan(CachedPlanSource *plansource); static List *RevalidateCachedQuery(CachedPlanSource *plansource, @@ -105,6 +113,7 @@ static TupleDesc PlanCacheComputeResultDesc(List *stmt_list); static void PlanCacheRelCallback(Datum arg, Oid relid); static void PlanCacheFuncCallback(Datum arg, int cacheid, uint32 hashvalue); static void PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue); +static void PruneCachedPlan(void); /* @@ -208,6 +217,8 @@ CreateCachedPlan(RawStmt *raw_parse_tree, plansource->generic_cost = -1; plansource->total_custom_cost = 0; plansource->num_custom_plans = 0; + plansource->last_access = GetCatCacheClock(); + MemoryContextSwitchTo(oldcxt); @@ -423,6 +434,28 @@ CompleteCachedPlan(CachedPlanSource *plansource, plansource->is_valid = true; } +/* moves the plansource to the first in the list */ +static inline void +MovePlansourceToFirst(CachedPlanSource *plansource) +{ + if (first_saved_plan != plansource) + { + /* delink this element */ + if (plansource->next_saved) + plansource->next_saved->prev_saved = plansource->prev_saved; + if (plansource->prev_saved) + plansource->prev_saved->next_saved = plansource->next_saved; + if (last_saved_plan == plansource) + last_saved_plan = plansource->prev_saved; + + /* insert at the beginning */ + first_saved_plan->prev_saved = plansource; + plansource->next_saved = first_saved_plan; + plansource->prev_saved = NULL; + first_saved_plan = plansource; + } +} + /* * SaveCachedPlan: save a cached plan permanently * @@ -470,6 +503,11 @@ SaveCachedPlan(CachedPlanSource *plansource) * Add the entry to the global list of cached plans. */ plansource->next_saved = first_saved_plan; + if (first_saved_plan) + first_saved_plan->prev_saved = plansource; + else + last_saved_plan = plansource; + plansource->prev_saved = NULL; first_saved_plan = plansource; plansource->is_saved = true; @@ -492,7 +530,11 @@ DropCachedPlan(CachedPlanSource *plansource) if (plansource->is_saved) { if (first_saved_plan == plansource) + { first_saved_plan = plansource->next_saved; + if (first_saved_plan) + first_saved_plan->prev_saved = NULL; + } else { CachedPlanSource *psrc; @@ -502,10 +544,19 @@ DropCachedPlan(CachedPlanSource *plansource) if (psrc->next_saved == plansource) { psrc->next_saved = plansource->next_saved; + if (psrc->next_saved) + psrc->next_saved->prev_saved = psrc; break; } } } + + if (last_saved_plan == plansource) + { + last_saved_plan = plansource->prev_saved; + if (last_saved_plan) + last_saved_plan->next_saved = NULL; + } plansource->is_saved = false; } @@ -537,6 +588,13 @@ ReleaseGenericPlan(CachedPlanSource *plansource) Assert(plan->magic == CACHEDPLAN_MAGIC); plansource->gplan = NULL; ReleaseCachedPlan(plan, false); + + /* decrement "saved plans" counter */ + if (plansource->is_saved) + { + Assert (num_saved_plans > 0); + num_saved_plans--; + } } } @@ -1148,6 +1206,17 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams, if (useResOwner && !plansource->is_saved) elog(ERROR, "cannot apply ResourceOwner to non-saved cached plan"); + /* + * set last-accessed timestamp and move this plan to the first of the list + */ + if (plansource->is_saved) + { + plansource->last_access = GetCatCacheClock(); + + /* move this plan to the first of the list */ + MovePlansourceToFirst(plansource); + } + /* Make sure the querytree list is valid and we have parse-time locks */ qlist = RevalidateCachedQuery(plansource, queryEnv); @@ -1156,6 +1225,11 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams, if (!customplan) { + /* Prune cached plans if needed */ + if (plansource->is_saved && + min_cached_plans >= 0 && num_saved_plans > min_cached_plans) + PruneCachedPlan(); + if (CheckCachedPlan(plansource)) { /* We want a generic plan, and we already have a valid one */ @@ -1168,6 +1242,11 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams, plan = BuildCachedPlan(plansource, qlist, NULL, queryEnv); /* Just make real sure plansource->gplan is clear */ ReleaseGenericPlan(plansource); + + /* count this new saved plan */ + if (plansource->is_saved) + num_saved_plans++; + /* Link the new generic plan into the plansource */ plansource->gplan = plan; plan->refcount++; @@ -1856,6 +1935,90 @@ PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue) ResetPlanCache(); } +/* + * PrunePlanCache: removes generic plan of "old" saved plans. + */ +static void +PruneCachedPlan(void) +{ + CachedPlanSource *plansource; + TimestampTz currclock = GetCatCacheClock(); + long age; + int us; + int nremoved = 0; + + /* do nothing if not wanted */ + if (cache_prune_min_age < 0 || num_saved_plans <= min_cached_plans) + return; + + /* Fast check for oldest cache */ + if (oldest_saved_plan > 0) + { + TimestampDifference(oldest_saved_plan, currclock, &age, &us); + if (age < cache_prune_min_age) + return; + } + + /* last plan is the oldest. */ + for (plansource = last_saved_plan; plansource; plansource = plansource->prev_saved) + { + long plan_age; + int us; + + Assert(plansource->magic == CACHEDPLANSOURCE_MAGIC); + + /* we want to prune no more plans */ + if (num_saved_plans <= min_cached_plans) + break; + + /* + * No work if it already doesn't have gplan and move it to the + * beginning so that we don't see it at the next time + */ + if (!plansource->gplan) + continue; + + /* + * Check age for pruning. Can exit immediately when finding a + * not-older element. + */ + TimestampDifference(plansource->last_access, currclock, &plan_age, &us); + if (plan_age <= cache_prune_min_age) + { + /* this entry is the next oldest */ + oldest_saved_plan = plansource->last_access; + break; + } + + /* + * Here, remove generic plans of this plansrouceif it is not actually + * used and move it to the beginning of the list. Just update + * last_access and move it to the beginning if the plan is used. + */ + if (plansource->gplan->refcount <= 1) + { + ReleaseGenericPlan(plansource); + nremoved++; + } + + plansource->last_access = currclock; + } + + /* move the "removed" plansrouces altogehter to the beginning of the list */ + if (plansource != last_saved_plan && plansource) + { + plansource->next_saved->prev_saved = NULL; + first_saved_plan->prev_saved = last_saved_plan; + last_saved_plan->next_saved = first_saved_plan; + first_saved_plan = plansource->next_saved; + plansource->next_saved = NULL; + last_saved_plan = plansource; + } + + if (nremoved > 0) + elog(DEBUG1, "plancache removed %d/%d", nremoved, num_saved_plans); +} + /* * ResetPlanCache: invalidate all cached plans. */ diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 9800252965..478bfe96a4 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -2128,6 +2128,16 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"min_cached_plans", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum number of cached plans kept on memory."), + gettext_noop("Timeout invalidation of plancache is not activated until the number of plancaches reaches thisvalue. -1 means timeout invalidation is always active.") + }, + &min_cached_plans, + 1000, -1, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/include/utils/plancache.h b/src/include/utils/plancache.h index ab20aa04b0..f3c5b2010d 100644 --- a/src/include/utils/plancache.h +++ b/src/include/utils/plancache.h @@ -110,11 +110,13 @@ typedef struct CachedPlanSource bool is_valid; /* is the query_list currently valid? */ int generation; /* increments each time we create a plan */ /* If CachedPlanSource has been saved, it is a member of a global list */ - struct CachedPlanSource *next_saved; /* list link, if so */ + struct CachedPlanSource *prev_saved; /* list prev link, if so */ + struct CachedPlanSource *next_saved; /* list next link, if so */ /* State kept to help decide whether to use custom or generic plans: */ double generic_cost; /* cost of generic plan, or -1 if not known */ double total_custom_cost; /* total cost of custom plans so far */ int num_custom_plans; /* number of plans included in total */ + TimestampTz last_access; /* timestamp of the last usage */ } CachedPlanSource; /* @@ -143,6 +145,9 @@ typedef struct CachedPlan MemoryContext context; /* context containing this CachedPlan */ } CachedPlan; +/* GUC variables */ +extern int min_cached_plans; +extern int plancache_prune_min_age; extern void InitPlanCache(void); extern void ResetPlanCache(void); -- 2.16.3 From 06bf577b5092a9fa443122bc8eef51284c6aa339 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Mon, 12 Mar 2018 17:31:43 +0900 Subject: [PATCH 3/3] Apply purning to relcache --- src/backend/utils/cache/relcache.c | 25 ++++++++++++++++++++++++- 1 file changed, 24 insertions(+), 1 deletion(-) diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index d85dc92505..dbbf9855b0 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -3442,6 +3442,26 @@ RelationSetNewRelfilenode(Relation relation, char persistence, #define INITRELCACHESIZE 400 +/* callback function for hash pruning */ +static bool +relcache_prune_cb(HTAB *hashp, void *ent) +{ + RelIdCacheEnt *relent = (RelIdCacheEnt *) ent; + Relation relation; + + /* this relation is requested to be removed. */ + RelationIdCacheLookup(relent->reloid, relation); + + /* don't remove if currently in use */ + if (!RelationHasReferenceCountZero(relation)) + return false; + + /* otherwise we can forget it unconditionally */ + RelationClearRelation(relation, false); + + return true; +} + void RelationCacheInitialize(void) { @@ -3459,8 +3479,11 @@ RelationCacheInitialize(void) MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(RelIdCacheEnt); + + /* use the same setting with syscache */ + ctl.prune_cb = relcache_prune_cb; RelationIdCache = hash_create("Relcache by OID", INITRELCACHESIZE, - &ctl, HASH_ELEM | HASH_BLOBS); + &ctl, HASH_ELEM | HASH_BLOBS | HASH_PRUNABLE); /* * relation mapper needs to be initialized too -- 2.16.3 From 0c0f8ff6e786dc20aa43636cad57f3713c0c89dd Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Mon, 12 Mar 2018 15:52:18 +0900 Subject: [PATCH 2/3] introduce dynhash pruning --- src/backend/utils/hash/dynahash.c | 166 +++++++++++++++++++++++++++++++++----- src/include/utils/catcache.h | 12 +++ src/include/utils/hsearch.h | 21 ++++- 3 files changed, 179 insertions(+), 20 deletions(-) diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c index 785e0faffb..261f8d9577 100644 --- a/src/backend/utils/hash/dynahash.c +++ b/src/backend/utils/hash/dynahash.c @@ -88,6 +88,7 @@ #include "access/xact.h" #include "storage/shmem.h" #include "storage/spin.h" +#include "utils/catcache.h" #include "utils/dynahash.h" #include "utils/memutils.h" @@ -184,6 +185,12 @@ struct HASHHDR long ssize; /* segment size --- must be power of 2 */ int sshift; /* segment shift = log2(ssize) */ int nelem_alloc; /* number of entries to allocate at once */ + bool prunable; /* true if prunable */ + HASH_PRUNE_CB prune_cb; /* function to call instead of just deleting */ + + /* These fields point to variables to control pruning */ + int *memory_target; /* pointer to memory target value in kB */ + int *prune_min_age; /* pointer to prune minimum age value in sec */ #ifdef HASH_STATISTICS @@ -227,16 +234,18 @@ struct HTAB int sshift; /* segment shift = log2(ssize) */ }; +#define HASHELEMENT_SIZE(ctlp) MAXALIGN(ctlp->prunable ? sizeof(PRUNABLE_HASHELEMENT) : sizeof(HASHELEMENT)) + /* * Key (also entry) part of a HASHELEMENT */ -#define ELEMENTKEY(helem) (((char *)(helem)) + MAXALIGN(sizeof(HASHELEMENT))) +#define ELEMENTKEY(helem, ctlp) (((char *)(helem)) + HASHELEMENT_SIZE(ctlp)) /* * Obtain element pointer given pointer to key */ -#define ELEMENT_FROM_KEY(key) \ - ((HASHELEMENT *) (((char *) (key)) - MAXALIGN(sizeof(HASHELEMENT)))) +#define ELEMENT_FROM_KEY(key, ctlp) \ + ((HASHELEMENT *) (((char *) (key)) - HASHELEMENT_SIZE(ctlp))) /* * Fast MOD arithmetic, assuming that y is a power of 2 ! @@ -257,6 +266,7 @@ static HASHSEGMENT seg_alloc(HTAB *hashp); static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx); static bool dir_realloc(HTAB *hashp); static bool expand_table(HTAB *hashp); +static bool prune_entries(HTAB *hashp); static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx); static void hdefault(HTAB *hashp); static int choose_nelem_alloc(Size entrysize); @@ -499,6 +509,29 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags) hctl->entrysize = info->entrysize; } + /* + * Set up pruning. + * + * We have two knobs to control pruning and a hash can share them of + * syscache. + * + */ + if (flags & HASH_PRUNABLE) + { + hctl->prunable = true; + hctl->prune_cb = info->prune_cb; + if (info->memory_target) + hctl->memory_target = info->memory_target; + else + hctl->memory_target = &cache_memory_target; + if (info->prune_min_age) + hctl->prune_min_age = info->prune_min_age; + else + hctl->prune_min_age = &cache_prune_min_age; + } + else + hctl->prunable = false; + /* make local copies of heavily-used constant fields */ hashp->keysize = hctl->keysize; hashp->ssize = hctl->ssize; @@ -984,7 +1017,7 @@ hash_search_with_hash_value(HTAB *hashp, while (currBucket != NULL) { if (currBucket->hashvalue == hashvalue && - match(ELEMENTKEY(currBucket), keyPtr, keysize) == 0) + match(ELEMENTKEY(currBucket, hctl), keyPtr, keysize) == 0) break; prevBucketPtr = &(currBucket->link); currBucket = *prevBucketPtr; @@ -997,6 +1030,17 @@ hash_search_with_hash_value(HTAB *hashp, if (foundPtr) *foundPtr = (bool) (currBucket != NULL); + /* Update access counter if needed */ + if (hctl->prunable && currBucket && + (action == HASH_FIND || action == HASH_ENTER)) + { + PRUNABLE_HASHELEMENT *prunable_elm = + (PRUNABLE_HASHELEMENT *) currBucket; + if (prunable_elm->naccess < 2) + prunable_elm->naccess++; + prunable_elm->last_access = GetCatCacheClock(); + } + /* * OK, now what? */ @@ -1004,7 +1048,8 @@ hash_search_with_hash_value(HTAB *hashp, { case HASH_FIND: if (currBucket != NULL) - return (void *) ELEMENTKEY(currBucket); + return (void *) ELEMENTKEY(currBucket, hctl); + return NULL; case HASH_REMOVE: @@ -1033,7 +1078,7 @@ hash_search_with_hash_value(HTAB *hashp, * element, because someone else is going to reuse it the next * time something is added to the table */ - return (void *) ELEMENTKEY(currBucket); + return (void *) ELEMENTKEY(currBucket, hctl); } return NULL; @@ -1045,7 +1090,7 @@ hash_search_with_hash_value(HTAB *hashp, case HASH_ENTER: /* Return existing element if found, else create one */ if (currBucket != NULL) - return (void *) ELEMENTKEY(currBucket); + return (void *) ELEMENTKEY(currBucket, hctl); /* disallow inserts if frozen */ if (hashp->frozen) @@ -1075,8 +1120,18 @@ hash_search_with_hash_value(HTAB *hashp, /* copy key into record */ currBucket->hashvalue = hashvalue; - hashp->keycopy(ELEMENTKEY(currBucket), keyPtr, keysize); + hashp->keycopy(ELEMENTKEY(currBucket, hctl), keyPtr, keysize); + /* set access counter */ + if (hctl->prunable) + { + PRUNABLE_HASHELEMENT *prunable_elm = + (PRUNABLE_HASHELEMENT *) currBucket; + if (prunable_elm->naccess < 2) + prunable_elm->naccess++; + prunable_elm->last_access = GetCatCacheClock(); + } + /* * Caller is expected to fill the data field on return. DO NOT * insert any code that could possibly throw error here, as doing @@ -1084,7 +1139,7 @@ hash_search_with_hash_value(HTAB *hashp, * caller's data structure. */ - return (void *) ELEMENTKEY(currBucket); + return (void *) ELEMENTKEY(currBucket, hctl); } elog(ERROR, "unrecognized hash action code: %d", (int) action); @@ -1116,7 +1171,7 @@ hash_update_hash_key(HTAB *hashp, void *existingEntry, const void *newKeyPtr) { - HASHELEMENT *existingElement = ELEMENT_FROM_KEY(existingEntry); + HASHELEMENT *existingElement = ELEMENT_FROM_KEY(existingEntry, hashp->hctl); HASHHDR *hctl = hashp->hctl; uint32 newhashvalue; Size keysize; @@ -1200,7 +1255,7 @@ hash_update_hash_key(HTAB *hashp, while (currBucket != NULL) { if (currBucket->hashvalue == newhashvalue && - match(ELEMENTKEY(currBucket), newKeyPtr, keysize) == 0) + match(ELEMENTKEY(currBucket, hctl), newKeyPtr, keysize) == 0) break; prevBucketPtr = &(currBucket->link); currBucket = *prevBucketPtr; @@ -1234,7 +1289,7 @@ hash_update_hash_key(HTAB *hashp, /* copy new key into record */ currBucket->hashvalue = newhashvalue; - hashp->keycopy(ELEMENTKEY(currBucket), newKeyPtr, keysize); + hashp->keycopy(ELEMENTKEY(currBucket, hctl), newKeyPtr, keysize); /* rest of record is untouched */ @@ -1388,8 +1443,8 @@ hash_seq_init(HASH_SEQ_STATUS *status, HTAB *hashp) void * hash_seq_search(HASH_SEQ_STATUS *status) { - HTAB *hashp; - HASHHDR *hctl; + HTAB *hashp = status->hashp; + HASHHDR *hctl = hashp->hctl; uint32 max_bucket; long ssize; long segment_num; @@ -1404,15 +1459,13 @@ hash_seq_search(HASH_SEQ_STATUS *status) status->curEntry = curElem->link; if (status->curEntry == NULL) /* end of this bucket */ ++status->curBucket; - return (void *) ELEMENTKEY(curElem); + return (void *) ELEMENTKEY(curElem, hctl); } /* * Search for next nonempty bucket starting at curBucket. */ curBucket = status->curBucket; - hashp = status->hashp; - hctl = hashp->hctl; ssize = hashp->ssize; max_bucket = hctl->max_bucket; @@ -1458,7 +1511,7 @@ hash_seq_search(HASH_SEQ_STATUS *status) if (status->curEntry == NULL) /* end of this bucket */ ++curBucket; status->curBucket = curBucket; - return (void *) ELEMENTKEY(curElem); + return (void *) ELEMENTKEY(curElem, hctl); } void @@ -1552,6 +1605,10 @@ expand_table(HTAB *hashp) */ if ((uint32) new_bucket > hctl->high_mask) { + /* try pruning before expansion. return true on success */ + if (hctl->prunable && prune_entries(hashp)) + return true; + hctl->low_mask = hctl->high_mask; hctl->high_mask = (uint32) new_bucket | hctl->low_mask; } @@ -1594,6 +1651,77 @@ expand_table(HTAB *hashp) return true; } +static bool +prune_entries(HTAB *hashp) +{ + HASHHDR *hctl = hashp->hctl; + HASH_SEQ_STATUS status; + void *elm; + TimestampTz currclock = GetCatCacheClock(); + int nall = 0, + nremoved = 0; + + Assert(hctl->prunable); + + /* Return if pruning is currently disabled or not doable */ + if (*hctl->prune_min_age < 0 || hashp->frozen || has_seq_scans(hashp)) + return false; + + /* + * we don't prune before reaching this size. We only consider bucket array + * size since it is the significant part of memory usage. + */ + if (hctl->dsize * sizeof(HASHBUCKET) * hashp->ssize < + (Size) *hctl->memory_target * 1024L) + return false; + + /* Ok, start pruning. we can use seq scan here. */ + hash_seq_init(&status, hashp); + while ((elm = hash_seq_search(&status)) != NULL) + { + PRUNABLE_HASHELEMENT *helm = + (PRUNABLE_HASHELEMENT *)ELEMENT_FROM_KEY(elm, hctl); + long entry_age; + int us; + + nall++; + + TimestampDifference(helm->last_access, currclock, &entry_age, &us); + + /* + * consider pruning if this entry has not been accessed for a certain + * time + */ + if (entry_age > *hctl->prune_min_age) + { + /* Wait for the next chance if this is recently used */ + if (helm->naccess > 0) + helm->naccess--; + else + { + /* just call it if callback is provided, remove otherwise */ + if (hctl->prune_cb) + { + if (hctl->prune_cb(hashp, (void *)elm)) + nremoved++; + } + else + { + bool found; + + hash_search(hashp, elm, HASH_REMOVE, &found); + Assert(found); + nremoved++; + } + } + } + } + + elog(DEBUG1, "removed %d/%d entries from hash \"%s\"", + nremoved, nall, hashp->tabname); + + return nremoved > 0; +} static bool dir_realloc(HTAB *hashp) @@ -1667,7 +1795,7 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx) return false; /* Each element has a HASHELEMENT header plus user data. */ - elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize); + elementSize = HASHELEMENT_SIZE(hctl) + MAXALIGN(hctl->entrysize); CurrentDynaHashCxt = hashp->hcxt; firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize); diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 599303be56..b3f73f53d2 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -208,6 +208,18 @@ SetCatCacheClock(TimestampTz ts) catcacheclock = ts; } +/* + * GetCatCacheClock - get timestamp for catcache access record + * + * This clock is basically provided for catcache usage, but dynahash has a + * similar pruning mechanism and wants to use the same clock. + */ +static inline TimestampTz +GetCatCacheClock(void) +{ + return catcacheclock; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h index 8357faac5a..6e9fa74a4f 100644 --- a/src/include/utils/hsearch.h +++ b/src/include/utils/hsearch.h @@ -13,7 +13,7 @@ */ #ifndef HSEARCH_H #define HSEARCH_H - +#include "datatype/timestamp.h" /* * Hash functions must have this signature. @@ -47,6 +47,7 @@ typedef void *(*HashAllocFunc) (Size request); * HASHELEMENT is the private part of a hashtable entry. The caller's data * follows the HASHELEMENT structure (on a MAXALIGN'd boundary). The hash key * is expected to be at the start of the caller's hash entry data structure. + * If this hash is prunable, PRUNABLE_HASHELEMENT is used instead. */ typedef struct HASHELEMENT { @@ -54,12 +55,26 @@ typedef struct HASHELEMENT uint32 hashvalue; /* hash function result for this entry */ } HASHELEMENT; +typedef struct PRUNABLE_HASHELEMENT +{ + struct HASHELEMENT *link; /* link to next entry in same bucket */ + uint32 hashvalue; /* hash function result for this entry */ + TimestampTz last_access; /* timestamp of last usage */ + int naccess; /* takes 0 to 2, counted up when used */ +} PRUNABLE_HASHELEMENT; + /* Hash table header struct is an opaque type known only within dynahash.c */ typedef struct HASHHDR HASHHDR; /* Hash table control struct is an opaque type known only within dynahash.c */ typedef struct HTAB HTAB; +/* + * Hash pruning callback which is called for the entries which is about to be + * pruned and returns false if the entry shuold be kept. + */ +typedef bool (*HASH_PRUNE_CB)(HTAB *hashp, void *ent); + /* Parameter data structure for hash_create */ /* Only those fields indicated by hash_flags need be set */ typedef struct HASHCTL @@ -77,6 +92,9 @@ typedef struct HASHCTL HashAllocFunc alloc; /* memory allocator */ MemoryContext hcxt; /* memory context to use for allocations */ HASHHDR *hctl; /* location of header in shared mem */ + HASH_PRUNE_CB prune_cb; /* pruning callback. see above. */ + int *memory_target; /* pointer to memory target */ + int *prune_min_age; /* pointer to prune minimum age */ } HASHCTL; /* Flags to indicate which parameters are supplied */ @@ -94,6 +112,7 @@ typedef struct HASHCTL #define HASH_SHARED_MEM 0x0800 /* Hashtable is in shared memory */ #define HASH_ATTACH 0x1000 /* Do not initialize hctl */ #define HASH_FIXED_SIZE 0x2000 /* Initial size is a hard limit */ +#define HASH_PRUNABLE 0x4000 /* pruning setting */ /* max_dsize value to indicate expansible directory */ -- 2.16.3 From 870ca3f1403310493b2580314c8b1b478dbff028 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 26 Dec 2017 17:43:09 +0900 Subject: [PATCH 1/3] Remove entries that haven't been used for a certain time Catcache entries can be left alone for several reasons. It is not desirable that they eat up memory. With this patch, This adds consideration of removal of entries that haven't been used for a certain time before enlarging the hash array. --- doc/src/sgml/config.sgml | 38 ++++++ src/backend/access/transam/xact.c | 3 + src/backend/utils/cache/catcache.c | 153 +++++++++++++++++++++++- src/backend/utils/cache/plancache.c | 163 ++++++++++++++++++++++++++ src/backend/utils/misc/guc.c | 33 ++++++ src/backend/utils/misc/postgresql.conf.sample | 2 + src/include/utils/catcache.h | 19 +++ src/include/utils/plancache.h | 7 +- 8 files changed, 413 insertions(+), 5 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 7bfbc87109..4ba4327007 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1617,6 +1617,44 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target"> + <term><varname>syscache_memory_target</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_memory_target</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the maximum amount of memory to which syscache is expanded + without pruning. The value defaults to 0, indicating that pruning is + always considered. After exceeding this size, syscache pruning is + considered according to + <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep + certain amount of syscache entries with intermittent usage, try + increase this setting. + </para> + </listitem> + </varlistentry> + + <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age"> + <term><varname>syscache_prune_min_age</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the minimum amount of unused time in seconds at which a + syscache entry is considered to be removed. -1 indicates that syscache + pruning is disabled at all. The value defaults to 600 seconds + (<literal>10 minutes</literal>). The syscache entries that are not + used for the duration can be removed to prevent syscache bloat. This + behavior is suppressed until the size of syscache exceeds + <xref linkend="guc-syscache-memory-target"/>. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth"> <term><varname>max_stack_depth</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 8e6aef332c..e4a4a5874c 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -732,6 +732,9 @@ void SetCurrentStatementStartTimestamp(void) { stmtStartTimestamp = GetCurrentTimestamp(); + + /* Set this timestamp as aproximated current time */ + SetCatCacheClock(stmtStartTimestamp); } /* diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 5ddbf6eab1..9f421cd242 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -71,9 +71,24 @@ #define CACHE6_elog(a,b,c,d,e,f,g) #endif +/* + * GUC variable to define the minimum size of hash to cosider entry eviction. + * This variable is shared among various cache mechanisms. + */ +int cache_memory_target = 0; + +/* GUC variable to define the minimum age of entries that will be cosidered to + * be evicted in seconds. This variable is shared among various cache + * mechanisms. + */ +int cache_prune_min_age = 600; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Timestamp used for any operation on caches. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -866,9 +881,130 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initilize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries can be left alone for several reasons. We remove them if + * they are not accessed for a certain time to prevent catcache from + * bloating. The eviction is performed with the similar algorithm with buffer + * eviction using access counter. Entries that are accessed several times can + * live longer than those that have had no access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int i; + int nremoved = 0; + size_t hash_size; +#ifdef CATCACHE_STATS + /* These variables are only for debugging purpose */ + int ntotal = 0; + /* + * nth element in nentries stores the number of cache entries that have + * lived unaccessed for corresponding multiple in ageclass of + * cache_prune_min_age. The index of nremoved_entry is the value of the + * clock-sweep counter, which takes from 0 up to 2. + */ + double ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; + int nentries[] = {0, 0, 0, 0, 0, 0}; + int nremoved_entry[3] = {0, 0, 0}; + int j; +#endif + + /* Return immediately if no pruning is wanted */ + if (cache_prune_min_age < 0) + return false; + + /* + * Return without pruning if the size of the hash is below the target. + * Since the area for bucket array is dominant, consider only it. + */ + hash_size = cp->cc_nbuckets * sizeof(dlist_head); + if (hash_size < (Size) cache_memory_target * 1024L) + return false; + + /* Search the whole hash for entries to remove */ + for (i = 0; i < cp->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + + /* + * Calculate the duration from the time of the last access to the + * "current" time. Since catcacheclock is not advanced within a + * transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + +#ifdef CATCACHE_STATS + /* count catcache entries for each age class */ + ntotal++; + for (j = 0 ; + ageclass[j] != 0.0 && + entry_age > cache_prune_min_age * ageclass[j] ; + j++); + if (ageclass[j] == 0.0) j--; + nentries[j]++; +#endif + + /* + * Try to remove entries older than cache_prune_min_age seconds. + * Entries that are not accessed after last pruning are removed in + * that seconds, and that has been accessed several times are + * removed after leaving alone for up to three times of the + * duration. We don't try shrink buckets since pruning effectively + * caps catcache expansion in the long term. + */ + if (entry_age > cache_prune_min_age) + { +#ifdef CATCACHE_STATS + Assert (ct->naccess >= 0 && ct->naccess <= 2); + nremoved_entry[ct->naccess]++; +#endif + if (ct->naccess > 0) + ct->naccess--; + else + { + if (!ct->c_list || ct->c_list->refcount == 0) + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + } + } + } + } + } + +#ifdef CATCACHE_STATS + ereport(DEBUG1, + (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d, 2:%d)", + nremoved, ntotal, + ageclass[0] * cache_prune_min_age, nentries[0], + ageclass[1] * cache_prune_min_age, nentries[1], + ageclass[2] * cache_prune_min_age, nentries[2], + ageclass[3] * cache_prune_min_age, nentries[3], + ageclass[4] * cache_prune_min_age, nentries[4], + nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]), + errhidestmt(true))); +#endif + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -1282,6 +1418,11 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* Update access information for pruning */ + if (ct->naccess < 2) + ct->naccess++; + ct->lastaccess = catcacheclock; + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1813,7 +1954,6 @@ ReleaseCatCacheList(CatCList *list) CatCacheRemoveCList(list->my_cache, list); } - /* * CatalogCacheCreateEntry * Create a new CatCTup entry, copying the given HeapTuple and other @@ -1906,6 +2046,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); @@ -1913,10 +2055,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, CacheHdr->ch_ntup++; /* - * If the hash table has become too full, enlarge the buckets array. Quite - * arbitrarily, we enlarge when fill factor > 2. + * If the hash table has become too full, try cleanup by removing + * infrequently used entries to make a room for the new entry. If it + * failed, enlarge the bucket array instead. Quite arbitrarily, we try + * this when fill factor > 2. */ - if (cache->cc_ntup > cache->cc_nbuckets * 2) + if (cache->cc_ntup > cache->cc_nbuckets * 2 && + !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); return ct; diff --git a/src/backend/utils/cache/plancache.c b/src/backend/utils/cache/plancache.c index 0ad3e3c736..701ead152c 100644 --- a/src/backend/utils/cache/plancache.c +++ b/src/backend/utils/cache/plancache.c @@ -63,12 +63,14 @@ #include "storage/lmgr.h" #include "tcop/pquery.h" #include "tcop/utility.h" +#include "utils/catcache.h" #include "utils/inval.h" #include "utils/memutils.h" #include "utils/resowner_private.h" #include "utils/rls.h" #include "utils/snapmgr.h" #include "utils/syscache.h" +#include "utils/timestamp.h" /* @@ -86,6 +88,12 @@ * guarantee to save a CachedPlanSource without error. */ static CachedPlanSource *first_saved_plan = NULL; +static CachedPlanSource *last_saved_plan = NULL; +static int num_saved_plans = 0; +static TimestampTz oldest_saved_plan = 0; + +/* GUC variables */ +int min_cached_plans = 1000; static void ReleaseGenericPlan(CachedPlanSource *plansource); static List *RevalidateCachedQuery(CachedPlanSource *plansource, @@ -105,6 +113,7 @@ static TupleDesc PlanCacheComputeResultDesc(List *stmt_list); static void PlanCacheRelCallback(Datum arg, Oid relid); static void PlanCacheFuncCallback(Datum arg, int cacheid, uint32 hashvalue); static void PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue); +static void PruneCachedPlan(void); /* @@ -208,6 +217,8 @@ CreateCachedPlan(RawStmt *raw_parse_tree, plansource->generic_cost = -1; plansource->total_custom_cost = 0; plansource->num_custom_plans = 0; + plansource->last_access = GetCatCacheClock(); + MemoryContextSwitchTo(oldcxt); @@ -423,6 +434,28 @@ CompleteCachedPlan(CachedPlanSource *plansource, plansource->is_valid = true; } +/* moves the plansource to the first in the list */ +static inline void +MovePlansourceToFirst(CachedPlanSource *plansource) +{ + if (first_saved_plan != plansource) + { + /* delink this element */ + if (plansource->next_saved) + plansource->next_saved->prev_saved = plansource->prev_saved; + if (plansource->prev_saved) + plansource->prev_saved->next_saved = plansource->next_saved; + if (last_saved_plan == plansource) + last_saved_plan = plansource->prev_saved; + + /* insert at the beginning */ + first_saved_plan->prev_saved = plansource; + plansource->next_saved = first_saved_plan; + plansource->prev_saved = NULL; + first_saved_plan = plansource; + } +} + /* * SaveCachedPlan: save a cached plan permanently * @@ -470,6 +503,11 @@ SaveCachedPlan(CachedPlanSource *plansource) * Add the entry to the global list of cached plans. */ plansource->next_saved = first_saved_plan; + if (first_saved_plan) + first_saved_plan->prev_saved = plansource; + else + last_saved_plan = plansource; + plansource->prev_saved = NULL; first_saved_plan = plansource; plansource->is_saved = true; @@ -492,7 +530,11 @@ DropCachedPlan(CachedPlanSource *plansource) if (plansource->is_saved) { if (first_saved_plan == plansource) + { first_saved_plan = plansource->next_saved; + if (first_saved_plan) + first_saved_plan->prev_saved = NULL; + } else { CachedPlanSource *psrc; @@ -502,10 +544,19 @@ DropCachedPlan(CachedPlanSource *plansource) if (psrc->next_saved == plansource) { psrc->next_saved = plansource->next_saved; + if (psrc->next_saved) + psrc->next_saved->prev_saved = psrc; break; } } } + + if (last_saved_plan == plansource) + { + last_saved_plan = plansource->prev_saved; + if (last_saved_plan) + last_saved_plan->next_saved = NULL; + } plansource->is_saved = false; } @@ -537,6 +588,13 @@ ReleaseGenericPlan(CachedPlanSource *plansource) Assert(plan->magic == CACHEDPLAN_MAGIC); plansource->gplan = NULL; ReleaseCachedPlan(plan, false); + + /* decrement "saved plans" counter */ + if (plansource->is_saved) + { + Assert (num_saved_plans > 0); + num_saved_plans--; + } } } @@ -1148,6 +1206,17 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams, if (useResOwner && !plansource->is_saved) elog(ERROR, "cannot apply ResourceOwner to non-saved cached plan"); + /* + * set last-accessed timestamp and move this plan to the first of the list + */ + if (plansource->is_saved) + { + plansource->last_access = GetCatCacheClock(); + + /* move this plan to the first of the list */ + MovePlansourceToFirst(plansource); + } + /* Make sure the querytree list is valid and we have parse-time locks */ qlist = RevalidateCachedQuery(plansource, queryEnv); @@ -1156,6 +1225,11 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams, if (!customplan) { + /* Prune cached plans if needed */ + if (plansource->is_saved && + min_cached_plans >= 0 && num_saved_plans > min_cached_plans) + PruneCachedPlan(); + if (CheckCachedPlan(plansource)) { /* We want a generic plan, and we already have a valid one */ @@ -1168,6 +1242,11 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams, plan = BuildCachedPlan(plansource, qlist, NULL, queryEnv); /* Just make real sure plansource->gplan is clear */ ReleaseGenericPlan(plansource); + + /* count this new saved plan */ + if (plansource->is_saved) + num_saved_plans++; + /* Link the new generic plan into the plansource */ plansource->gplan = plan; plan->refcount++; @@ -1856,6 +1935,90 @@ PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue) ResetPlanCache(); } +/* + * PrunePlanCache: removes generic plan of "old" saved plans. + */ +static void +PruneCachedPlan(void) +{ + CachedPlanSource *plansource; + TimestampTz currclock = GetCatCacheClock(); + long age; + int us; + int nremoved = 0; + + /* do nothing if not wanted */ + if (cache_prune_min_age < 0 || num_saved_plans <= min_cached_plans) + return; + + /* Fast check for oldest cache */ + if (oldest_saved_plan > 0) + { + TimestampDifference(oldest_saved_plan, currclock, &age, &us); + if (age < cache_prune_min_age) + return; + } + + /* last plan is the oldest. */ + for (plansource = last_saved_plan; plansource; plansource = plansource->prev_saved) + { + long plan_age; + int us; + + Assert(plansource->magic == CACHEDPLANSOURCE_MAGIC); + + /* we want to prune no more plans */ + if (num_saved_plans <= min_cached_plans) + break; + + /* + * No work if it already doesn't have gplan and move it to the + * beginning so that we don't see it at the next time + */ + if (!plansource->gplan) + continue; + + /* + * Check age for pruning. Can exit immediately when finding a + * not-older element. + */ + TimestampDifference(plansource->last_access, currclock, &plan_age, &us); + if (plan_age <= cache_prune_min_age) + { + /* this entry is the next oldest */ + oldest_saved_plan = plansource->last_access; + break; + } + + /* + * Here, remove generic plans of this plansrouceif it is not actually + * used and move it to the beginning of the list. Just update + * last_access and move it to the beginning if the plan is used. + */ + if (plansource->gplan->refcount <= 1) + { + ReleaseGenericPlan(plansource); + nremoved++; + } + + plansource->last_access = currclock; + } + + /* move the "removed" plansrouces altogehter to the beginning of the list */ + if (plansource != last_saved_plan && plansource) + { + plansource->next_saved->prev_saved = NULL; + first_saved_plan->prev_saved = last_saved_plan; + last_saved_plan->next_saved = first_saved_plan; + first_saved_plan = plansource->next_saved; + plansource->next_saved = NULL; + last_saved_plan = plansource; + } + + if (nremoved > 0) + elog(DEBUG1, "plancache removed %d/%d", nremoved, num_saved_plans); +} + /* * ResetPlanCache: invalidate all cached plans. */ diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 859ef931e7..774a87ed2c 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -79,6 +79,7 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/guc_tables.h" #include "utils/memutils.h" #include "utils/pg_locale.h" @@ -2105,6 +2106,38 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"cache_memory_target", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum syscache size to keep."), + gettext_noop("Cache is not pruned before exceeding this size."), + GUC_UNIT_KB + }, + &cache_memory_target, + 0, 0, MAX_KILOBYTES, + NULL, NULL, NULL + }, + + { + {"cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum unused duration of cache entries before removal."), + gettext_noop("Cache entries that live unused for longer than this seconds are considered to be removed."), + GUC_UNIT_S + }, + &cache_prune_min_age, + 600, -1, INT_MAX, + NULL, NULL, NULL + }, + + { + {"min_cached_plans", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum number of cached plans kept on memory."), + gettext_noop("Timeout invalidation of plancache is not activated until the number of plancaches reaches thisvalue. -1 means timeout invalidation is always active.") + }, + &min_cached_plans, + 1000, -1, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 9e39baf466..3f2760ef9d 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -126,6 +126,8 @@ #work_mem = 4MB # min 64kB #maintenance_work_mem = 64MB # min 1MB #autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem +#cache_memory_target = 0kB # in kB +#cache_prune_min_age = 600s # -1 disables pruning #max_stack_depth = 2MB # min 100kB #dynamic_shared_memory_type = posix # the default is the first option # supported by the operating system: diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 7b22f9c7bc..599303be56 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -119,6 +120,8 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ + int naccess; /* # of access to this entry, up to 2 */ + TimestampTz lastaccess; /* approx. timestamp of the last usage */ /* * The tuple may also be a member of at most one CatCList. (If a single @@ -189,6 +192,22 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int cache_prune_min_age; +extern int cache_memory_target; + +/* to use as access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* + * SetCatCacheClock - set timestamp for catcache access record + */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, diff --git a/src/include/utils/plancache.h b/src/include/utils/plancache.h index ab20aa04b0..f3c5b2010d 100644 --- a/src/include/utils/plancache.h +++ b/src/include/utils/plancache.h @@ -110,11 +110,13 @@ typedef struct CachedPlanSource bool is_valid; /* is the query_list currently valid? */ int generation; /* increments each time we create a plan */ /* If CachedPlanSource has been saved, it is a member of a global list */ - struct CachedPlanSource *next_saved; /* list link, if so */ + struct CachedPlanSource *prev_saved; /* list prev link, if so */ + struct CachedPlanSource *next_saved; /* list next link, if so */ /* State kept to help decide whether to use custom or generic plans: */ double generic_cost; /* cost of generic plan, or -1 if not known */ double total_custom_cost; /* total cost of custom plans so far */ int num_custom_plans; /* number of plans included in total */ + TimestampTz last_access; /* timestamp of the last usage */ } CachedPlanSource; /* @@ -143,6 +145,9 @@ typedef struct CachedPlan MemoryContext context; /* context containing this CachedPlan */ } CachedPlan; +/* GUC variables */ +extern int min_cached_plans; +extern int plancache_prune_min_age; extern void InitPlanCache(void); extern void ResetPlanCache(void); -- 2.16.3
On 06/26/2018 05:00 AM, Kyotaro HORIGUCHI wrote: > >> The attached is the patch set including this plancache stuff. >> >> 0001- catcache time-based expiration (The origin of this thread) >> 0002- introduces dynahash pruning feature >> 0003- implement relcache pruning using 0002 >> 0004- (perhaps) independent from the three above. PoC of >> plancache pruning. Details are shown above. > I found up to v3 in this thread so I named this version 4. > Andres suggested back in March (and again privately to me) that given how much this has changed from the original this CF item should be marked Returned With Feedback and the current patchset submitted as a new item. Does anyone object to that course of action? cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello. The previous v4 patchset was just broken. At Tue, 26 Jun 2018 18:00:03 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180626.180003.127457941.horiguchi.kyotaro@lab.ntt.co.jp> > Hello. I rebased this patchset. .. > > The attached is the patch set including this plancache stuff. > > > > 0001- catcache time-based expiration (The origin of this thread) > > 0002- introduces dynahash pruning feature > > 0003- implement relcache pruning using 0002 > > 0004- (perhaps) independent from the three above. PoC of > > plancache pruning. Details are shown above. > > I found up to v3 in this thread so I named this version 4. Somehow the 0004 was merged into the 0003 and applying 0004 results in failure. I removed 0004 part from the 0003 and rebased and repost it. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From e267985853b100a8ecfd10cc02f464f8c802d19e Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 26 Dec 2017 17:43:09 +0900 Subject: [PATCH 1/4] Remove entries that haven't been used for a certain time Catcache entries can be left alone for several reasons. It is not desirable that they eat up memory. With this patch, This adds consideration of removal of entries that haven't been used for a certain time before enlarging the hash array. --- doc/src/sgml/config.sgml | 38 ++++++ src/backend/access/transam/xact.c | 3 + src/backend/utils/cache/catcache.c | 153 +++++++++++++++++++++++- src/backend/utils/cache/plancache.c | 163 ++++++++++++++++++++++++++ src/backend/utils/misc/guc.c | 33 ++++++ src/backend/utils/misc/postgresql.conf.sample | 2 + src/include/utils/catcache.h | 19 +++ src/include/utils/plancache.h | 7 +- 8 files changed, 413 insertions(+), 5 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 5b913f00c1..76745047af 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1617,6 +1617,44 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target"> + <term><varname>syscache_memory_target</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_memory_target</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the maximum amount of memory to which syscache is expanded + without pruning. The value defaults to 0, indicating that pruning is + always considered. After exceeding this size, syscache pruning is + considered according to + <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep + certain amount of syscache entries with intermittent usage, try + increase this setting. + </para> + </listitem> + </varlistentry> + + <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age"> + <term><varname>syscache_prune_min_age</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the minimum amount of unused time in seconds at which a + syscache entry is considered to be removed. -1 indicates that syscache + pruning is disabled at all. The value defaults to 600 seconds + (<literal>10 minutes</literal>). The syscache entries that are not + used for the duration can be removed to prevent syscache bloat. This + behavior is suppressed until the size of syscache exceeds + <xref linkend="guc-syscache-memory-target"/>. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth"> <term><varname>max_stack_depth</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 8e6aef332c..e4a4a5874c 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -732,6 +732,9 @@ void SetCurrentStatementStartTimestamp(void) { stmtStartTimestamp = GetCurrentTimestamp(); + + /* Set this timestamp as aproximated current time */ + SetCatCacheClock(stmtStartTimestamp); } /* diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 5ddbf6eab1..9f421cd242 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -71,9 +71,24 @@ #define CACHE6_elog(a,b,c,d,e,f,g) #endif +/* + * GUC variable to define the minimum size of hash to cosider entry eviction. + * This variable is shared among various cache mechanisms. + */ +int cache_memory_target = 0; + +/* GUC variable to define the minimum age of entries that will be cosidered to + * be evicted in seconds. This variable is shared among various cache + * mechanisms. + */ +int cache_prune_min_age = 600; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Timestamp used for any operation on caches. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -866,9 +881,130 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initilize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries can be left alone for several reasons. We remove them if + * they are not accessed for a certain time to prevent catcache from + * bloating. The eviction is performed with the similar algorithm with buffer + * eviction using access counter. Entries that are accessed several times can + * live longer than those that have had no access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int i; + int nremoved = 0; + size_t hash_size; +#ifdef CATCACHE_STATS + /* These variables are only for debugging purpose */ + int ntotal = 0; + /* + * nth element in nentries stores the number of cache entries that have + * lived unaccessed for corresponding multiple in ageclass of + * cache_prune_min_age. The index of nremoved_entry is the value of the + * clock-sweep counter, which takes from 0 up to 2. + */ + double ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; + int nentries[] = {0, 0, 0, 0, 0, 0}; + int nremoved_entry[3] = {0, 0, 0}; + int j; +#endif + + /* Return immediately if no pruning is wanted */ + if (cache_prune_min_age < 0) + return false; + + /* + * Return without pruning if the size of the hash is below the target. + * Since the area for bucket array is dominant, consider only it. + */ + hash_size = cp->cc_nbuckets * sizeof(dlist_head); + if (hash_size < (Size) cache_memory_target * 1024L) + return false; + + /* Search the whole hash for entries to remove */ + for (i = 0; i < cp->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + + /* + * Calculate the duration from the time of the last access to the + * "current" time. Since catcacheclock is not advanced within a + * transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + +#ifdef CATCACHE_STATS + /* count catcache entries for each age class */ + ntotal++; + for (j = 0 ; + ageclass[j] != 0.0 && + entry_age > cache_prune_min_age * ageclass[j] ; + j++); + if (ageclass[j] == 0.0) j--; + nentries[j]++; +#endif + + /* + * Try to remove entries older than cache_prune_min_age seconds. + * Entries that are not accessed after last pruning are removed in + * that seconds, and that has been accessed several times are + * removed after leaving alone for up to three times of the + * duration. We don't try shrink buckets since pruning effectively + * caps catcache expansion in the long term. + */ + if (entry_age > cache_prune_min_age) + { +#ifdef CATCACHE_STATS + Assert (ct->naccess >= 0 && ct->naccess <= 2); + nremoved_entry[ct->naccess]++; +#endif + if (ct->naccess > 0) + ct->naccess--; + else + { + if (!ct->c_list || ct->c_list->refcount == 0) + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + } + } + } + } + } + +#ifdef CATCACHE_STATS + ereport(DEBUG1, + (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d, 2:%d)", + nremoved, ntotal, + ageclass[0] * cache_prune_min_age, nentries[0], + ageclass[1] * cache_prune_min_age, nentries[1], + ageclass[2] * cache_prune_min_age, nentries[2], + ageclass[3] * cache_prune_min_age, nentries[3], + ageclass[4] * cache_prune_min_age, nentries[4], + nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]), + errhidestmt(true))); +#endif + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -1282,6 +1418,11 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* Update access information for pruning */ + if (ct->naccess < 2) + ct->naccess++; + ct->lastaccess = catcacheclock; + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1813,7 +1954,6 @@ ReleaseCatCacheList(CatCList *list) CatCacheRemoveCList(list->my_cache, list); } - /* * CatalogCacheCreateEntry * Create a new CatCTup entry, copying the given HeapTuple and other @@ -1906,6 +2046,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); @@ -1913,10 +2055,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, CacheHdr->ch_ntup++; /* - * If the hash table has become too full, enlarge the buckets array. Quite - * arbitrarily, we enlarge when fill factor > 2. + * If the hash table has become too full, try cleanup by removing + * infrequently used entries to make a room for the new entry. If it + * failed, enlarge the bucket array instead. Quite arbitrarily, we try + * this when fill factor > 2. */ - if (cache->cc_ntup > cache->cc_nbuckets * 2) + if (cache->cc_ntup > cache->cc_nbuckets * 2 && + !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); return ct; diff --git a/src/backend/utils/cache/plancache.c b/src/backend/utils/cache/plancache.c index 0ad3e3c736..701ead152c 100644 --- a/src/backend/utils/cache/plancache.c +++ b/src/backend/utils/cache/plancache.c @@ -63,12 +63,14 @@ #include "storage/lmgr.h" #include "tcop/pquery.h" #include "tcop/utility.h" +#include "utils/catcache.h" #include "utils/inval.h" #include "utils/memutils.h" #include "utils/resowner_private.h" #include "utils/rls.h" #include "utils/snapmgr.h" #include "utils/syscache.h" +#include "utils/timestamp.h" /* @@ -86,6 +88,12 @@ * guarantee to save a CachedPlanSource without error. */ static CachedPlanSource *first_saved_plan = NULL; +static CachedPlanSource *last_saved_plan = NULL; +static int num_saved_plans = 0; +static TimestampTz oldest_saved_plan = 0; + +/* GUC variables */ +int min_cached_plans = 1000; static void ReleaseGenericPlan(CachedPlanSource *plansource); static List *RevalidateCachedQuery(CachedPlanSource *plansource, @@ -105,6 +113,7 @@ static TupleDesc PlanCacheComputeResultDesc(List *stmt_list); static void PlanCacheRelCallback(Datum arg, Oid relid); static void PlanCacheFuncCallback(Datum arg, int cacheid, uint32 hashvalue); static void PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue); +static void PruneCachedPlan(void); /* @@ -208,6 +217,8 @@ CreateCachedPlan(RawStmt *raw_parse_tree, plansource->generic_cost = -1; plansource->total_custom_cost = 0; plansource->num_custom_plans = 0; + plansource->last_access = GetCatCacheClock(); + MemoryContextSwitchTo(oldcxt); @@ -423,6 +434,28 @@ CompleteCachedPlan(CachedPlanSource *plansource, plansource->is_valid = true; } +/* moves the plansource to the first in the list */ +static inline void +MovePlansourceToFirst(CachedPlanSource *plansource) +{ + if (first_saved_plan != plansource) + { + /* delink this element */ + if (plansource->next_saved) + plansource->next_saved->prev_saved = plansource->prev_saved; + if (plansource->prev_saved) + plansource->prev_saved->next_saved = plansource->next_saved; + if (last_saved_plan == plansource) + last_saved_plan = plansource->prev_saved; + + /* insert at the beginning */ + first_saved_plan->prev_saved = plansource; + plansource->next_saved = first_saved_plan; + plansource->prev_saved = NULL; + first_saved_plan = plansource; + } +} + /* * SaveCachedPlan: save a cached plan permanently * @@ -470,6 +503,11 @@ SaveCachedPlan(CachedPlanSource *plansource) * Add the entry to the global list of cached plans. */ plansource->next_saved = first_saved_plan; + if (first_saved_plan) + first_saved_plan->prev_saved = plansource; + else + last_saved_plan = plansource; + plansource->prev_saved = NULL; first_saved_plan = plansource; plansource->is_saved = true; @@ -492,7 +530,11 @@ DropCachedPlan(CachedPlanSource *plansource) if (plansource->is_saved) { if (first_saved_plan == plansource) + { first_saved_plan = plansource->next_saved; + if (first_saved_plan) + first_saved_plan->prev_saved = NULL; + } else { CachedPlanSource *psrc; @@ -502,10 +544,19 @@ DropCachedPlan(CachedPlanSource *plansource) if (psrc->next_saved == plansource) { psrc->next_saved = plansource->next_saved; + if (psrc->next_saved) + psrc->next_saved->prev_saved = psrc; break; } } } + + if (last_saved_plan == plansource) + { + last_saved_plan = plansource->prev_saved; + if (last_saved_plan) + last_saved_plan->next_saved = NULL; + } plansource->is_saved = false; } @@ -537,6 +588,13 @@ ReleaseGenericPlan(CachedPlanSource *plansource) Assert(plan->magic == CACHEDPLAN_MAGIC); plansource->gplan = NULL; ReleaseCachedPlan(plan, false); + + /* decrement "saved plans" counter */ + if (plansource->is_saved) + { + Assert (num_saved_plans > 0); + num_saved_plans--; + } } } @@ -1148,6 +1206,17 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams, if (useResOwner && !plansource->is_saved) elog(ERROR, "cannot apply ResourceOwner to non-saved cached plan"); + /* + * set last-accessed timestamp and move this plan to the first of the list + */ + if (plansource->is_saved) + { + plansource->last_access = GetCatCacheClock(); + + /* move this plan to the first of the list */ + MovePlansourceToFirst(plansource); + } + /* Make sure the querytree list is valid and we have parse-time locks */ qlist = RevalidateCachedQuery(plansource, queryEnv); @@ -1156,6 +1225,11 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams, if (!customplan) { + /* Prune cached plans if needed */ + if (plansource->is_saved && + min_cached_plans >= 0 && num_saved_plans > min_cached_plans) + PruneCachedPlan(); + if (CheckCachedPlan(plansource)) { /* We want a generic plan, and we already have a valid one */ @@ -1168,6 +1242,11 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams, plan = BuildCachedPlan(plansource, qlist, NULL, queryEnv); /* Just make real sure plansource->gplan is clear */ ReleaseGenericPlan(plansource); + + /* count this new saved plan */ + if (plansource->is_saved) + num_saved_plans++; + /* Link the new generic plan into the plansource */ plansource->gplan = plan; plan->refcount++; @@ -1856,6 +1935,90 @@ PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue) ResetPlanCache(); } +/* + * PrunePlanCache: removes generic plan of "old" saved plans. + */ +static void +PruneCachedPlan(void) +{ + CachedPlanSource *plansource; + TimestampTz currclock = GetCatCacheClock(); + long age; + int us; + int nremoved = 0; + + /* do nothing if not wanted */ + if (cache_prune_min_age < 0 || num_saved_plans <= min_cached_plans) + return; + + /* Fast check for oldest cache */ + if (oldest_saved_plan > 0) + { + TimestampDifference(oldest_saved_plan, currclock, &age, &us); + if (age < cache_prune_min_age) + return; + } + + /* last plan is the oldest. */ + for (plansource = last_saved_plan; plansource; plansource = plansource->prev_saved) + { + long plan_age; + int us; + + Assert(plansource->magic == CACHEDPLANSOURCE_MAGIC); + + /* we want to prune no more plans */ + if (num_saved_plans <= min_cached_plans) + break; + + /* + * No work if it already doesn't have gplan and move it to the + * beginning so that we don't see it at the next time + */ + if (!plansource->gplan) + continue; + + /* + * Check age for pruning. Can exit immediately when finding a + * not-older element. + */ + TimestampDifference(plansource->last_access, currclock, &plan_age, &us); + if (plan_age <= cache_prune_min_age) + { + /* this entry is the next oldest */ + oldest_saved_plan = plansource->last_access; + break; + } + + /* + * Here, remove generic plans of this plansrouceif it is not actually + * used and move it to the beginning of the list. Just update + * last_access and move it to the beginning if the plan is used. + */ + if (plansource->gplan->refcount <= 1) + { + ReleaseGenericPlan(plansource); + nremoved++; + } + + plansource->last_access = currclock; + } + + /* move the "removed" plansrouces altogehter to the beginning of the list */ + if (plansource != last_saved_plan && plansource) + { + plansource->next_saved->prev_saved = NULL; + first_saved_plan->prev_saved = last_saved_plan; + last_saved_plan->next_saved = first_saved_plan; + first_saved_plan = plansource->next_saved; + plansource->next_saved = NULL; + last_saved_plan = plansource; + } + + if (nremoved > 0) + elog(DEBUG1, "plancache removed %d/%d", nremoved, num_saved_plans); +} + /* * ResetPlanCache: invalidate all cached plans. */ diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index b05fb209bb..e49346707d 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -79,6 +79,7 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/guc_tables.h" #include "utils/memutils.h" #include "utils/pg_locale.h" @@ -2105,6 +2106,38 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"cache_memory_target", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum syscache size to keep."), + gettext_noop("Cache is not pruned before exceeding this size."), + GUC_UNIT_KB + }, + &cache_memory_target, + 0, 0, MAX_KILOBYTES, + NULL, NULL, NULL + }, + + { + {"cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum unused duration of cache entries before removal."), + gettext_noop("Cache entries that live unused for longer than this seconds are considered to be removed."), + GUC_UNIT_S + }, + &cache_prune_min_age, + 600, -1, INT_MAX, + NULL, NULL, NULL + }, + + { + {"min_cached_plans", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum number of cached plans kept on memory."), + gettext_noop("Timeout invalidation of plancache is not activated until the number of plancaches reaches thisvalue. -1 means timeout invalidation is always active.") + }, + &min_cached_plans, + 1000, -1, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 9e39baf466..3f2760ef9d 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -126,6 +126,8 @@ #work_mem = 4MB # min 64kB #maintenance_work_mem = 64MB # min 1MB #autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem +#cache_memory_target = 0kB # in kB +#cache_prune_min_age = 600s # -1 disables pruning #max_stack_depth = 2MB # min 100kB #dynamic_shared_memory_type = posix # the default is the first option # supported by the operating system: diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 7b22f9c7bc..599303be56 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -119,6 +120,8 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ + int naccess; /* # of access to this entry, up to 2 */ + TimestampTz lastaccess; /* approx. timestamp of the last usage */ /* * The tuple may also be a member of at most one CatCList. (If a single @@ -189,6 +192,22 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int cache_prune_min_age; +extern int cache_memory_target; + +/* to use as access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* + * SetCatCacheClock - set timestamp for catcache access record + */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, diff --git a/src/include/utils/plancache.h b/src/include/utils/plancache.h index ab20aa04b0..f3c5b2010d 100644 --- a/src/include/utils/plancache.h +++ b/src/include/utils/plancache.h @@ -110,11 +110,13 @@ typedef struct CachedPlanSource bool is_valid; /* is the query_list currently valid? */ int generation; /* increments each time we create a plan */ /* If CachedPlanSource has been saved, it is a member of a global list */ - struct CachedPlanSource *next_saved; /* list link, if so */ + struct CachedPlanSource *prev_saved; /* list prev link, if so */ + struct CachedPlanSource *next_saved; /* list next link, if so */ /* State kept to help decide whether to use custom or generic plans: */ double generic_cost; /* cost of generic plan, or -1 if not known */ double total_custom_cost; /* total cost of custom plans so far */ int num_custom_plans; /* number of plans included in total */ + TimestampTz last_access; /* timestamp of the last usage */ } CachedPlanSource; /* @@ -143,6 +145,9 @@ typedef struct CachedPlan MemoryContext context; /* context containing this CachedPlan */ } CachedPlan; +/* GUC variables */ +extern int min_cached_plans; +extern int plancache_prune_min_age; extern void InitPlanCache(void); extern void ResetPlanCache(void); -- 2.16.3 From 8994c6d038b72ff253ad24dda0f0da99e6916b05 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Mon, 12 Mar 2018 15:52:18 +0900 Subject: [PATCH 2/4] introduce dynhash pruning --- src/backend/utils/hash/dynahash.c | 166 +++++++++++++++++++++++++++++++++----- src/include/utils/catcache.h | 12 +++ src/include/utils/hsearch.h | 21 ++++- 3 files changed, 179 insertions(+), 20 deletions(-) diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c index 785e0faffb..261f8d9577 100644 --- a/src/backend/utils/hash/dynahash.c +++ b/src/backend/utils/hash/dynahash.c @@ -88,6 +88,7 @@ #include "access/xact.h" #include "storage/shmem.h" #include "storage/spin.h" +#include "utils/catcache.h" #include "utils/dynahash.h" #include "utils/memutils.h" @@ -184,6 +185,12 @@ struct HASHHDR long ssize; /* segment size --- must be power of 2 */ int sshift; /* segment shift = log2(ssize) */ int nelem_alloc; /* number of entries to allocate at once */ + bool prunable; /* true if prunable */ + HASH_PRUNE_CB prune_cb; /* function to call instead of just deleting */ + + /* These fields point to variables to control pruning */ + int *memory_target; /* pointer to memory target value in kB */ + int *prune_min_age; /* pointer to prune minimum age value in sec */ #ifdef HASH_STATISTICS @@ -227,16 +234,18 @@ struct HTAB int sshift; /* segment shift = log2(ssize) */ }; +#define HASHELEMENT_SIZE(ctlp) MAXALIGN(ctlp->prunable ? sizeof(PRUNABLE_HASHELEMENT) : sizeof(HASHELEMENT)) + /* * Key (also entry) part of a HASHELEMENT */ -#define ELEMENTKEY(helem) (((char *)(helem)) + MAXALIGN(sizeof(HASHELEMENT))) +#define ELEMENTKEY(helem, ctlp) (((char *)(helem)) + HASHELEMENT_SIZE(ctlp)) /* * Obtain element pointer given pointer to key */ -#define ELEMENT_FROM_KEY(key) \ - ((HASHELEMENT *) (((char *) (key)) - MAXALIGN(sizeof(HASHELEMENT)))) +#define ELEMENT_FROM_KEY(key, ctlp) \ + ((HASHELEMENT *) (((char *) (key)) - HASHELEMENT_SIZE(ctlp))) /* * Fast MOD arithmetic, assuming that y is a power of 2 ! @@ -257,6 +266,7 @@ static HASHSEGMENT seg_alloc(HTAB *hashp); static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx); static bool dir_realloc(HTAB *hashp); static bool expand_table(HTAB *hashp); +static bool prune_entries(HTAB *hashp); static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx); static void hdefault(HTAB *hashp); static int choose_nelem_alloc(Size entrysize); @@ -499,6 +509,29 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags) hctl->entrysize = info->entrysize; } + /* + * Set up pruning. + * + * We have two knobs to control pruning and a hash can share them of + * syscache. + * + */ + if (flags & HASH_PRUNABLE) + { + hctl->prunable = true; + hctl->prune_cb = info->prune_cb; + if (info->memory_target) + hctl->memory_target = info->memory_target; + else + hctl->memory_target = &cache_memory_target; + if (info->prune_min_age) + hctl->prune_min_age = info->prune_min_age; + else + hctl->prune_min_age = &cache_prune_min_age; + } + else + hctl->prunable = false; + /* make local copies of heavily-used constant fields */ hashp->keysize = hctl->keysize; hashp->ssize = hctl->ssize; @@ -984,7 +1017,7 @@ hash_search_with_hash_value(HTAB *hashp, while (currBucket != NULL) { if (currBucket->hashvalue == hashvalue && - match(ELEMENTKEY(currBucket), keyPtr, keysize) == 0) + match(ELEMENTKEY(currBucket, hctl), keyPtr, keysize) == 0) break; prevBucketPtr = &(currBucket->link); currBucket = *prevBucketPtr; @@ -997,6 +1030,17 @@ hash_search_with_hash_value(HTAB *hashp, if (foundPtr) *foundPtr = (bool) (currBucket != NULL); + /* Update access counter if needed */ + if (hctl->prunable && currBucket && + (action == HASH_FIND || action == HASH_ENTER)) + { + PRUNABLE_HASHELEMENT *prunable_elm = + (PRUNABLE_HASHELEMENT *) currBucket; + if (prunable_elm->naccess < 2) + prunable_elm->naccess++; + prunable_elm->last_access = GetCatCacheClock(); + } + /* * OK, now what? */ @@ -1004,7 +1048,8 @@ hash_search_with_hash_value(HTAB *hashp, { case HASH_FIND: if (currBucket != NULL) - return (void *) ELEMENTKEY(currBucket); + return (void *) ELEMENTKEY(currBucket, hctl); + return NULL; case HASH_REMOVE: @@ -1033,7 +1078,7 @@ hash_search_with_hash_value(HTAB *hashp, * element, because someone else is going to reuse it the next * time something is added to the table */ - return (void *) ELEMENTKEY(currBucket); + return (void *) ELEMENTKEY(currBucket, hctl); } return NULL; @@ -1045,7 +1090,7 @@ hash_search_with_hash_value(HTAB *hashp, case HASH_ENTER: /* Return existing element if found, else create one */ if (currBucket != NULL) - return (void *) ELEMENTKEY(currBucket); + return (void *) ELEMENTKEY(currBucket, hctl); /* disallow inserts if frozen */ if (hashp->frozen) @@ -1075,8 +1120,18 @@ hash_search_with_hash_value(HTAB *hashp, /* copy key into record */ currBucket->hashvalue = hashvalue; - hashp->keycopy(ELEMENTKEY(currBucket), keyPtr, keysize); + hashp->keycopy(ELEMENTKEY(currBucket, hctl), keyPtr, keysize); + /* set access counter */ + if (hctl->prunable) + { + PRUNABLE_HASHELEMENT *prunable_elm = + (PRUNABLE_HASHELEMENT *) currBucket; + if (prunable_elm->naccess < 2) + prunable_elm->naccess++; + prunable_elm->last_access = GetCatCacheClock(); + } + /* * Caller is expected to fill the data field on return. DO NOT * insert any code that could possibly throw error here, as doing @@ -1084,7 +1139,7 @@ hash_search_with_hash_value(HTAB *hashp, * caller's data structure. */ - return (void *) ELEMENTKEY(currBucket); + return (void *) ELEMENTKEY(currBucket, hctl); } elog(ERROR, "unrecognized hash action code: %d", (int) action); @@ -1116,7 +1171,7 @@ hash_update_hash_key(HTAB *hashp, void *existingEntry, const void *newKeyPtr) { - HASHELEMENT *existingElement = ELEMENT_FROM_KEY(existingEntry); + HASHELEMENT *existingElement = ELEMENT_FROM_KEY(existingEntry, hashp->hctl); HASHHDR *hctl = hashp->hctl; uint32 newhashvalue; Size keysize; @@ -1200,7 +1255,7 @@ hash_update_hash_key(HTAB *hashp, while (currBucket != NULL) { if (currBucket->hashvalue == newhashvalue && - match(ELEMENTKEY(currBucket), newKeyPtr, keysize) == 0) + match(ELEMENTKEY(currBucket, hctl), newKeyPtr, keysize) == 0) break; prevBucketPtr = &(currBucket->link); currBucket = *prevBucketPtr; @@ -1234,7 +1289,7 @@ hash_update_hash_key(HTAB *hashp, /* copy new key into record */ currBucket->hashvalue = newhashvalue; - hashp->keycopy(ELEMENTKEY(currBucket), newKeyPtr, keysize); + hashp->keycopy(ELEMENTKEY(currBucket, hctl), newKeyPtr, keysize); /* rest of record is untouched */ @@ -1388,8 +1443,8 @@ hash_seq_init(HASH_SEQ_STATUS *status, HTAB *hashp) void * hash_seq_search(HASH_SEQ_STATUS *status) { - HTAB *hashp; - HASHHDR *hctl; + HTAB *hashp = status->hashp; + HASHHDR *hctl = hashp->hctl; uint32 max_bucket; long ssize; long segment_num; @@ -1404,15 +1459,13 @@ hash_seq_search(HASH_SEQ_STATUS *status) status->curEntry = curElem->link; if (status->curEntry == NULL) /* end of this bucket */ ++status->curBucket; - return (void *) ELEMENTKEY(curElem); + return (void *) ELEMENTKEY(curElem, hctl); } /* * Search for next nonempty bucket starting at curBucket. */ curBucket = status->curBucket; - hashp = status->hashp; - hctl = hashp->hctl; ssize = hashp->ssize; max_bucket = hctl->max_bucket; @@ -1458,7 +1511,7 @@ hash_seq_search(HASH_SEQ_STATUS *status) if (status->curEntry == NULL) /* end of this bucket */ ++curBucket; status->curBucket = curBucket; - return (void *) ELEMENTKEY(curElem); + return (void *) ELEMENTKEY(curElem, hctl); } void @@ -1552,6 +1605,10 @@ expand_table(HTAB *hashp) */ if ((uint32) new_bucket > hctl->high_mask) { + /* try pruning before expansion. return true on success */ + if (hctl->prunable && prune_entries(hashp)) + return true; + hctl->low_mask = hctl->high_mask; hctl->high_mask = (uint32) new_bucket | hctl->low_mask; } @@ -1594,6 +1651,77 @@ expand_table(HTAB *hashp) return true; } +static bool +prune_entries(HTAB *hashp) +{ + HASHHDR *hctl = hashp->hctl; + HASH_SEQ_STATUS status; + void *elm; + TimestampTz currclock = GetCatCacheClock(); + int nall = 0, + nremoved = 0; + + Assert(hctl->prunable); + + /* Return if pruning is currently disabled or not doable */ + if (*hctl->prune_min_age < 0 || hashp->frozen || has_seq_scans(hashp)) + return false; + + /* + * we don't prune before reaching this size. We only consider bucket array + * size since it is the significant part of memory usage. + */ + if (hctl->dsize * sizeof(HASHBUCKET) * hashp->ssize < + (Size) *hctl->memory_target * 1024L) + return false; + + /* Ok, start pruning. we can use seq scan here. */ + hash_seq_init(&status, hashp); + while ((elm = hash_seq_search(&status)) != NULL) + { + PRUNABLE_HASHELEMENT *helm = + (PRUNABLE_HASHELEMENT *)ELEMENT_FROM_KEY(elm, hctl); + long entry_age; + int us; + + nall++; + + TimestampDifference(helm->last_access, currclock, &entry_age, &us); + + /* + * consider pruning if this entry has not been accessed for a certain + * time + */ + if (entry_age > *hctl->prune_min_age) + { + /* Wait for the next chance if this is recently used */ + if (helm->naccess > 0) + helm->naccess--; + else + { + /* just call it if callback is provided, remove otherwise */ + if (hctl->prune_cb) + { + if (hctl->prune_cb(hashp, (void *)elm)) + nremoved++; + } + else + { + bool found; + + hash_search(hashp, elm, HASH_REMOVE, &found); + Assert(found); + nremoved++; + } + } + } + } + + elog(DEBUG1, "removed %d/%d entries from hash \"%s\"", + nremoved, nall, hashp->tabname); + + return nremoved > 0; +} static bool dir_realloc(HTAB *hashp) @@ -1667,7 +1795,7 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx) return false; /* Each element has a HASHELEMENT header plus user data. */ - elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize); + elementSize = HASHELEMENT_SIZE(hctl) + MAXALIGN(hctl->entrysize); CurrentDynaHashCxt = hashp->hcxt; firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize); diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 599303be56..b3f73f53d2 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -208,6 +208,18 @@ SetCatCacheClock(TimestampTz ts) catcacheclock = ts; } +/* + * GetCatCacheClock - get timestamp for catcache access record + * + * This clock is basically provided for catcache usage, but dynahash has a + * similar pruning mechanism and wants to use the same clock. + */ +static inline TimestampTz +GetCatCacheClock(void) +{ + return catcacheclock; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h index 8357faac5a..6e9fa74a4f 100644 --- a/src/include/utils/hsearch.h +++ b/src/include/utils/hsearch.h @@ -13,7 +13,7 @@ */ #ifndef HSEARCH_H #define HSEARCH_H - +#include "datatype/timestamp.h" /* * Hash functions must have this signature. @@ -47,6 +47,7 @@ typedef void *(*HashAllocFunc) (Size request); * HASHELEMENT is the private part of a hashtable entry. The caller's data * follows the HASHELEMENT structure (on a MAXALIGN'd boundary). The hash key * is expected to be at the start of the caller's hash entry data structure. + * If this hash is prunable, PRUNABLE_HASHELEMENT is used instead. */ typedef struct HASHELEMENT { @@ -54,12 +55,26 @@ typedef struct HASHELEMENT uint32 hashvalue; /* hash function result for this entry */ } HASHELEMENT; +typedef struct PRUNABLE_HASHELEMENT +{ + struct HASHELEMENT *link; /* link to next entry in same bucket */ + uint32 hashvalue; /* hash function result for this entry */ + TimestampTz last_access; /* timestamp of last usage */ + int naccess; /* takes 0 to 2, counted up when used */ +} PRUNABLE_HASHELEMENT; + /* Hash table header struct is an opaque type known only within dynahash.c */ typedef struct HASHHDR HASHHDR; /* Hash table control struct is an opaque type known only within dynahash.c */ typedef struct HTAB HTAB; +/* + * Hash pruning callback which is called for the entries which is about to be + * pruned and returns false if the entry shuold be kept. + */ +typedef bool (*HASH_PRUNE_CB)(HTAB *hashp, void *ent); + /* Parameter data structure for hash_create */ /* Only those fields indicated by hash_flags need be set */ typedef struct HASHCTL @@ -77,6 +92,9 @@ typedef struct HASHCTL HashAllocFunc alloc; /* memory allocator */ MemoryContext hcxt; /* memory context to use for allocations */ HASHHDR *hctl; /* location of header in shared mem */ + HASH_PRUNE_CB prune_cb; /* pruning callback. see above. */ + int *memory_target; /* pointer to memory target */ + int *prune_min_age; /* pointer to prune minimum age */ } HASHCTL; /* Flags to indicate which parameters are supplied */ @@ -94,6 +112,7 @@ typedef struct HASHCTL #define HASH_SHARED_MEM 0x0800 /* Hashtable is in shared memory */ #define HASH_ATTACH 0x1000 /* Do not initialize hctl */ #define HASH_FIXED_SIZE 0x2000 /* Initial size is a hard limit */ +#define HASH_PRUNABLE 0x4000 /* pruning setting */ /* max_dsize value to indicate expansible directory */ -- 2.16.3 From d2367a23911ff9d231dab80ec22108950bb3f9fc Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Mon, 12 Mar 2018 17:31:43 +0900 Subject: [PATCH 3/4] Apply purning to relcache Implement relcache invalidtion. --- src/backend/utils/cache/plancache.c | 163 ------------------------------------ src/backend/utils/cache/relcache.c | 25 +++++- src/backend/utils/misc/guc.c | 10 --- src/include/utils/plancache.h | 7 +- 4 files changed, 25 insertions(+), 180 deletions(-) diff --git a/src/backend/utils/cache/plancache.c b/src/backend/utils/cache/plancache.c index 701ead152c..0ad3e3c736 100644 --- a/src/backend/utils/cache/plancache.c +++ b/src/backend/utils/cache/plancache.c @@ -63,14 +63,12 @@ #include "storage/lmgr.h" #include "tcop/pquery.h" #include "tcop/utility.h" -#include "utils/catcache.h" #include "utils/inval.h" #include "utils/memutils.h" #include "utils/resowner_private.h" #include "utils/rls.h" #include "utils/snapmgr.h" #include "utils/syscache.h" -#include "utils/timestamp.h" /* @@ -88,12 +86,6 @@ * guarantee to save a CachedPlanSource without error. */ static CachedPlanSource *first_saved_plan = NULL; -static CachedPlanSource *last_saved_plan = NULL; -static int num_saved_plans = 0; -static TimestampTz oldest_saved_plan = 0; - -/* GUC variables */ -int min_cached_plans = 1000; static void ReleaseGenericPlan(CachedPlanSource *plansource); static List *RevalidateCachedQuery(CachedPlanSource *plansource, @@ -113,7 +105,6 @@ static TupleDesc PlanCacheComputeResultDesc(List *stmt_list); static void PlanCacheRelCallback(Datum arg, Oid relid); static void PlanCacheFuncCallback(Datum arg, int cacheid, uint32 hashvalue); static void PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue); -static void PruneCachedPlan(void); /* @@ -217,8 +208,6 @@ CreateCachedPlan(RawStmt *raw_parse_tree, plansource->generic_cost = -1; plansource->total_custom_cost = 0; plansource->num_custom_plans = 0; - plansource->last_access = GetCatCacheClock(); - MemoryContextSwitchTo(oldcxt); @@ -434,28 +423,6 @@ CompleteCachedPlan(CachedPlanSource *plansource, plansource->is_valid = true; } -/* moves the plansource to the first in the list */ -static inline void -MovePlansourceToFirst(CachedPlanSource *plansource) -{ - if (first_saved_plan != plansource) - { - /* delink this element */ - if (plansource->next_saved) - plansource->next_saved->prev_saved = plansource->prev_saved; - if (plansource->prev_saved) - plansource->prev_saved->next_saved = plansource->next_saved; - if (last_saved_plan == plansource) - last_saved_plan = plansource->prev_saved; - - /* insert at the beginning */ - first_saved_plan->prev_saved = plansource; - plansource->next_saved = first_saved_plan; - plansource->prev_saved = NULL; - first_saved_plan = plansource; - } -} - /* * SaveCachedPlan: save a cached plan permanently * @@ -503,11 +470,6 @@ SaveCachedPlan(CachedPlanSource *plansource) * Add the entry to the global list of cached plans. */ plansource->next_saved = first_saved_plan; - if (first_saved_plan) - first_saved_plan->prev_saved = plansource; - else - last_saved_plan = plansource; - plansource->prev_saved = NULL; first_saved_plan = plansource; plansource->is_saved = true; @@ -530,11 +492,7 @@ DropCachedPlan(CachedPlanSource *plansource) if (plansource->is_saved) { if (first_saved_plan == plansource) - { first_saved_plan = plansource->next_saved; - if (first_saved_plan) - first_saved_plan->prev_saved = NULL; - } else { CachedPlanSource *psrc; @@ -544,19 +502,10 @@ DropCachedPlan(CachedPlanSource *plansource) if (psrc->next_saved == plansource) { psrc->next_saved = plansource->next_saved; - if (psrc->next_saved) - psrc->next_saved->prev_saved = psrc; break; } } } - - if (last_saved_plan == plansource) - { - last_saved_plan = plansource->prev_saved; - if (last_saved_plan) - last_saved_plan->next_saved = NULL; - } plansource->is_saved = false; } @@ -588,13 +537,6 @@ ReleaseGenericPlan(CachedPlanSource *plansource) Assert(plan->magic == CACHEDPLAN_MAGIC); plansource->gplan = NULL; ReleaseCachedPlan(plan, false); - - /* decrement "saved plans" counter */ - if (plansource->is_saved) - { - Assert (num_saved_plans > 0); - num_saved_plans--; - } } } @@ -1206,17 +1148,6 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams, if (useResOwner && !plansource->is_saved) elog(ERROR, "cannot apply ResourceOwner to non-saved cached plan"); - /* - * set last-accessed timestamp and move this plan to the first of the list - */ - if (plansource->is_saved) - { - plansource->last_access = GetCatCacheClock(); - - /* move this plan to the first of the list */ - MovePlansourceToFirst(plansource); - } - /* Make sure the querytree list is valid and we have parse-time locks */ qlist = RevalidateCachedQuery(plansource, queryEnv); @@ -1225,11 +1156,6 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams, if (!customplan) { - /* Prune cached plans if needed */ - if (plansource->is_saved && - min_cached_plans >= 0 && num_saved_plans > min_cached_plans) - PruneCachedPlan(); - if (CheckCachedPlan(plansource)) { /* We want a generic plan, and we already have a valid one */ @@ -1242,11 +1168,6 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams, plan = BuildCachedPlan(plansource, qlist, NULL, queryEnv); /* Just make real sure plansource->gplan is clear */ ReleaseGenericPlan(plansource); - - /* count this new saved plan */ - if (plansource->is_saved) - num_saved_plans++; - /* Link the new generic plan into the plansource */ plansource->gplan = plan; plan->refcount++; @@ -1935,90 +1856,6 @@ PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue) ResetPlanCache(); } -/* - * PrunePlanCache: removes generic plan of "old" saved plans. - */ -static void -PruneCachedPlan(void) -{ - CachedPlanSource *plansource; - TimestampTz currclock = GetCatCacheClock(); - long age; - int us; - int nremoved = 0; - - /* do nothing if not wanted */ - if (cache_prune_min_age < 0 || num_saved_plans <= min_cached_plans) - return; - - /* Fast check for oldest cache */ - if (oldest_saved_plan > 0) - { - TimestampDifference(oldest_saved_plan, currclock, &age, &us); - if (age < cache_prune_min_age) - return; - } - - /* last plan is the oldest. */ - for (plansource = last_saved_plan; plansource; plansource = plansource->prev_saved) - { - long plan_age; - int us; - - Assert(plansource->magic == CACHEDPLANSOURCE_MAGIC); - - /* we want to prune no more plans */ - if (num_saved_plans <= min_cached_plans) - break; - - /* - * No work if it already doesn't have gplan and move it to the - * beginning so that we don't see it at the next time - */ - if (!plansource->gplan) - continue; - - /* - * Check age for pruning. Can exit immediately when finding a - * not-older element. - */ - TimestampDifference(plansource->last_access, currclock, &plan_age, &us); - if (plan_age <= cache_prune_min_age) - { - /* this entry is the next oldest */ - oldest_saved_plan = plansource->last_access; - break; - } - - /* - * Here, remove generic plans of this plansrouceif it is not actually - * used and move it to the beginning of the list. Just update - * last_access and move it to the beginning if the plan is used. - */ - if (plansource->gplan->refcount <= 1) - { - ReleaseGenericPlan(plansource); - nremoved++; - } - - plansource->last_access = currclock; - } - - /* move the "removed" plansrouces altogehter to the beginning of the list */ - if (plansource != last_saved_plan && plansource) - { - plansource->next_saved->prev_saved = NULL; - first_saved_plan->prev_saved = last_saved_plan; - last_saved_plan->next_saved = first_saved_plan; - first_saved_plan = plansource->next_saved; - plansource->next_saved = NULL; - last_saved_plan = plansource; - } - - if (nremoved > 0) - elog(DEBUG1, "plancache removed %d/%d", nremoved, num_saved_plans); -} - /* * ResetPlanCache: invalidate all cached plans. */ diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 6125421d39..19502978cc 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -3442,6 +3442,26 @@ RelationSetNewRelfilenode(Relation relation, char persistence, #define INITRELCACHESIZE 400 +/* callback function for hash pruning */ +static bool +relcache_prune_cb(HTAB *hashp, void *ent) +{ + RelIdCacheEnt *relent = (RelIdCacheEnt *) ent; + Relation relation; + + /* this relation is requested to be removed. */ + RelationIdCacheLookup(relent->reloid, relation); + + /* don't remove if currently in use */ + if (!RelationHasReferenceCountZero(relation)) + return false; + + /* otherwise we can forget it unconditionally */ + RelationClearRelation(relation, false); + + return true; +} + void RelationCacheInitialize(void) { @@ -3459,8 +3479,11 @@ RelationCacheInitialize(void) MemSet(&ctl, 0, sizeof(ctl)); ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(RelIdCacheEnt); + + /* use the same setting with syscache */ + ctl.prune_cb = relcache_prune_cb; RelationIdCache = hash_create("Relcache by OID", INITRELCACHESIZE, - &ctl, HASH_ELEM | HASH_BLOBS); + &ctl, HASH_ELEM | HASH_BLOBS | HASH_PRUNABLE); /* * relation mapper needs to be initialized too diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index e49346707d..d89654cf8a 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -2128,16 +2128,6 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, - { - {"min_cached_plans", PGC_USERSET, RESOURCES_MEM, - gettext_noop("Sets the minimum number of cached plans kept on memory."), - gettext_noop("Timeout invalidation of plancache is not activated until the number of plancaches reaches thisvalue. -1 means timeout invalidation is always active.") - }, - &min_cached_plans, - 1000, -1, INT_MAX, - NULL, NULL, NULL - }, - /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/include/utils/plancache.h b/src/include/utils/plancache.h index f3c5b2010d..ab20aa04b0 100644 --- a/src/include/utils/plancache.h +++ b/src/include/utils/plancache.h @@ -110,13 +110,11 @@ typedef struct CachedPlanSource bool is_valid; /* is the query_list currently valid? */ int generation; /* increments each time we create a plan */ /* If CachedPlanSource has been saved, it is a member of a global list */ - struct CachedPlanSource *prev_saved; /* list prev link, if so */ - struct CachedPlanSource *next_saved; /* list next link, if so */ + struct CachedPlanSource *next_saved; /* list link, if so */ /* State kept to help decide whether to use custom or generic plans: */ double generic_cost; /* cost of generic plan, or -1 if not known */ double total_custom_cost; /* total cost of custom plans so far */ int num_custom_plans; /* number of plans included in total */ - TimestampTz last_access; /* timestamp of the last usage */ } CachedPlanSource; /* @@ -145,9 +143,6 @@ typedef struct CachedPlan MemoryContext context; /* context containing this CachedPlan */ } CachedPlan; -/* GUC variables */ -extern int min_cached_plans; -extern int plancache_prune_min_age; extern void InitPlanCache(void); extern void ResetPlanCache(void); -- 2.16.3 From 775a952b51ca4fc597683b80a20380edd1af1328 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 3 Jul 2018 09:05:32 +0900 Subject: [PATCH 4/4] Generic plan removal of PlanCacheSource. We cannot remove saved cached plans while pruning since they are pointed from other variables. But still we can remove generic plan of each saved plans. The behavior is controled by two additional GUC variables min_cached_plans and cache_prune_min_age. The former tells to keep that number of generic plans without pruned. The latter tells how long we shuold keep generic plans before pruning. --- src/backend/utils/cache/plancache.c | 163 ++++++++++++++++++++++++++++++++++++ src/backend/utils/misc/guc.c | 10 +++ src/include/utils/plancache.h | 7 +- 3 files changed, 179 insertions(+), 1 deletion(-) diff --git a/src/backend/utils/cache/plancache.c b/src/backend/utils/cache/plancache.c index 0ad3e3c736..701ead152c 100644 --- a/src/backend/utils/cache/plancache.c +++ b/src/backend/utils/cache/plancache.c @@ -63,12 +63,14 @@ #include "storage/lmgr.h" #include "tcop/pquery.h" #include "tcop/utility.h" +#include "utils/catcache.h" #include "utils/inval.h" #include "utils/memutils.h" #include "utils/resowner_private.h" #include "utils/rls.h" #include "utils/snapmgr.h" #include "utils/syscache.h" +#include "utils/timestamp.h" /* @@ -86,6 +88,12 @@ * guarantee to save a CachedPlanSource without error. */ static CachedPlanSource *first_saved_plan = NULL; +static CachedPlanSource *last_saved_plan = NULL; +static int num_saved_plans = 0; +static TimestampTz oldest_saved_plan = 0; + +/* GUC variables */ +int min_cached_plans = 1000; static void ReleaseGenericPlan(CachedPlanSource *plansource); static List *RevalidateCachedQuery(CachedPlanSource *plansource, @@ -105,6 +113,7 @@ static TupleDesc PlanCacheComputeResultDesc(List *stmt_list); static void PlanCacheRelCallback(Datum arg, Oid relid); static void PlanCacheFuncCallback(Datum arg, int cacheid, uint32 hashvalue); static void PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue); +static void PruneCachedPlan(void); /* @@ -208,6 +217,8 @@ CreateCachedPlan(RawStmt *raw_parse_tree, plansource->generic_cost = -1; plansource->total_custom_cost = 0; plansource->num_custom_plans = 0; + plansource->last_access = GetCatCacheClock(); + MemoryContextSwitchTo(oldcxt); @@ -423,6 +434,28 @@ CompleteCachedPlan(CachedPlanSource *plansource, plansource->is_valid = true; } +/* moves the plansource to the first in the list */ +static inline void +MovePlansourceToFirst(CachedPlanSource *plansource) +{ + if (first_saved_plan != plansource) + { + /* delink this element */ + if (plansource->next_saved) + plansource->next_saved->prev_saved = plansource->prev_saved; + if (plansource->prev_saved) + plansource->prev_saved->next_saved = plansource->next_saved; + if (last_saved_plan == plansource) + last_saved_plan = plansource->prev_saved; + + /* insert at the beginning */ + first_saved_plan->prev_saved = plansource; + plansource->next_saved = first_saved_plan; + plansource->prev_saved = NULL; + first_saved_plan = plansource; + } +} + /* * SaveCachedPlan: save a cached plan permanently * @@ -470,6 +503,11 @@ SaveCachedPlan(CachedPlanSource *plansource) * Add the entry to the global list of cached plans. */ plansource->next_saved = first_saved_plan; + if (first_saved_plan) + first_saved_plan->prev_saved = plansource; + else + last_saved_plan = plansource; + plansource->prev_saved = NULL; first_saved_plan = plansource; plansource->is_saved = true; @@ -492,7 +530,11 @@ DropCachedPlan(CachedPlanSource *plansource) if (plansource->is_saved) { if (first_saved_plan == plansource) + { first_saved_plan = plansource->next_saved; + if (first_saved_plan) + first_saved_plan->prev_saved = NULL; + } else { CachedPlanSource *psrc; @@ -502,10 +544,19 @@ DropCachedPlan(CachedPlanSource *plansource) if (psrc->next_saved == plansource) { psrc->next_saved = plansource->next_saved; + if (psrc->next_saved) + psrc->next_saved->prev_saved = psrc; break; } } } + + if (last_saved_plan == plansource) + { + last_saved_plan = plansource->prev_saved; + if (last_saved_plan) + last_saved_plan->next_saved = NULL; + } plansource->is_saved = false; } @@ -537,6 +588,13 @@ ReleaseGenericPlan(CachedPlanSource *plansource) Assert(plan->magic == CACHEDPLAN_MAGIC); plansource->gplan = NULL; ReleaseCachedPlan(plan, false); + + /* decrement "saved plans" counter */ + if (plansource->is_saved) + { + Assert (num_saved_plans > 0); + num_saved_plans--; + } } } @@ -1148,6 +1206,17 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams, if (useResOwner && !plansource->is_saved) elog(ERROR, "cannot apply ResourceOwner to non-saved cached plan"); + /* + * set last-accessed timestamp and move this plan to the first of the list + */ + if (plansource->is_saved) + { + plansource->last_access = GetCatCacheClock(); + + /* move this plan to the first of the list */ + MovePlansourceToFirst(plansource); + } + /* Make sure the querytree list is valid and we have parse-time locks */ qlist = RevalidateCachedQuery(plansource, queryEnv); @@ -1156,6 +1225,11 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams, if (!customplan) { + /* Prune cached plans if needed */ + if (plansource->is_saved && + min_cached_plans >= 0 && num_saved_plans > min_cached_plans) + PruneCachedPlan(); + if (CheckCachedPlan(plansource)) { /* We want a generic plan, and we already have a valid one */ @@ -1168,6 +1242,11 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams, plan = BuildCachedPlan(plansource, qlist, NULL, queryEnv); /* Just make real sure plansource->gplan is clear */ ReleaseGenericPlan(plansource); + + /* count this new saved plan */ + if (plansource->is_saved) + num_saved_plans++; + /* Link the new generic plan into the plansource */ plansource->gplan = plan; plan->refcount++; @@ -1856,6 +1935,90 @@ PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue) ResetPlanCache(); } +/* + * PrunePlanCache: removes generic plan of "old" saved plans. + */ +static void +PruneCachedPlan(void) +{ + CachedPlanSource *plansource; + TimestampTz currclock = GetCatCacheClock(); + long age; + int us; + int nremoved = 0; + + /* do nothing if not wanted */ + if (cache_prune_min_age < 0 || num_saved_plans <= min_cached_plans) + return; + + /* Fast check for oldest cache */ + if (oldest_saved_plan > 0) + { + TimestampDifference(oldest_saved_plan, currclock, &age, &us); + if (age < cache_prune_min_age) + return; + } + + /* last plan is the oldest. */ + for (plansource = last_saved_plan; plansource; plansource = plansource->prev_saved) + { + long plan_age; + int us; + + Assert(plansource->magic == CACHEDPLANSOURCE_MAGIC); + + /* we want to prune no more plans */ + if (num_saved_plans <= min_cached_plans) + break; + + /* + * No work if it already doesn't have gplan and move it to the + * beginning so that we don't see it at the next time + */ + if (!plansource->gplan) + continue; + + /* + * Check age for pruning. Can exit immediately when finding a + * not-older element. + */ + TimestampDifference(plansource->last_access, currclock, &plan_age, &us); + if (plan_age <= cache_prune_min_age) + { + /* this entry is the next oldest */ + oldest_saved_plan = plansource->last_access; + break; + } + + /* + * Here, remove generic plans of this plansrouceif it is not actually + * used and move it to the beginning of the list. Just update + * last_access and move it to the beginning if the plan is used. + */ + if (plansource->gplan->refcount <= 1) + { + ReleaseGenericPlan(plansource); + nremoved++; + } + + plansource->last_access = currclock; + } + + /* move the "removed" plansrouces altogehter to the beginning of the list */ + if (plansource != last_saved_plan && plansource) + { + plansource->next_saved->prev_saved = NULL; + first_saved_plan->prev_saved = last_saved_plan; + last_saved_plan->next_saved = first_saved_plan; + first_saved_plan = plansource->next_saved; + plansource->next_saved = NULL; + last_saved_plan = plansource; + } + + if (nremoved > 0) + elog(DEBUG1, "plancache removed %d/%d", nremoved, num_saved_plans); +} + /* * ResetPlanCache: invalidate all cached plans. */ diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index d89654cf8a..e49346707d 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -2128,6 +2128,16 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"min_cached_plans", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum number of cached plans kept on memory."), + gettext_noop("Timeout invalidation of plancache is not activated until the number of plancaches reaches thisvalue. -1 means timeout invalidation is always active.") + }, + &min_cached_plans, + 1000, -1, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/include/utils/plancache.h b/src/include/utils/plancache.h index ab20aa04b0..f3c5b2010d 100644 --- a/src/include/utils/plancache.h +++ b/src/include/utils/plancache.h @@ -110,11 +110,13 @@ typedef struct CachedPlanSource bool is_valid; /* is the query_list currently valid? */ int generation; /* increments each time we create a plan */ /* If CachedPlanSource has been saved, it is a member of a global list */ - struct CachedPlanSource *next_saved; /* list link, if so */ + struct CachedPlanSource *prev_saved; /* list prev link, if so */ + struct CachedPlanSource *next_saved; /* list next link, if so */ /* State kept to help decide whether to use custom or generic plans: */ double generic_cost; /* cost of generic plan, or -1 if not known */ double total_custom_cost; /* total cost of custom plans so far */ int num_custom_plans; /* number of plans included in total */ + TimestampTz last_access; /* timestamp of the last usage */ } CachedPlanSource; /* @@ -143,6 +145,9 @@ typedef struct CachedPlan MemoryContext context; /* context containing this CachedPlan */ } CachedPlan; +/* GUC variables */ +extern int min_cached_plans; +extern int plancache_prune_min_age; extern void InitPlanCache(void); extern void ResetPlanCache(void); -- 2.16.3
On 2018-Jul-02, Andrew Dunstan wrote: > Andres suggested back in March (and again privately to me) that given how > much this has changed from the original this CF item should be marked > Returned With Feedback and the current patchset submitted as a new item. > > Does anyone object to that course of action? If doing that makes the "CF count" reset back to one for the new submission, then I object to that course of action. If we really think this item does not belong into this commitfest, lets punt it to the next one. However, it seems rather strange to do so this early in the cycle. Is there really no small item that could be cherry-picked from this series to be committed standalone? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2018-07-02 21:50:36 -0400, Alvaro Herrera wrote: > On 2018-Jul-02, Andrew Dunstan wrote: > > > Andres suggested back in March (and again privately to me) that given how > > much this has changed from the original this CF item should be marked > > Returned With Feedback and the current patchset submitted as a new item. > > > > Does anyone object to that course of action? > > If doing that makes the "CF count" reset back to one for the new > submission, then I object to that course of action. If we really > think this item does not belong into this commitfest, lets punt it to > the next one. However, it seems rather strange to do so this early in > the cycle. Is there really no small item that could be cherry-picked > from this series to be committed standalone? Well, I think it should just have been RWFed last cycle. It got plenty of feedback. So it doesn't seem that strange to me, not to include it in the "mop-up" CF? Either way, I don't feel strongly about it, I just know that I won't have energy for the topic in this CF. Greetings, Andres Freund
Hi, >Subject: Re: Protect syscache from bloating with negative cache entries > >Hello. The previous v4 patchset was just broken. >Somehow the 0004 was merged into the 0003 and applying 0004 results in failure. I >removed 0004 part from the 0003 and rebased and repost it. I have some questions about syscache and relcache pruning though they may be discussed at upper thread or out of point. Can I confirm about catcache pruning? syscache_memory_target is the max figure per CatCache. (Any CatCache has the same max value.) So the total max size of catalog caches is estimated around or slightly more than # of SysCache array times syscache_memory_target. If correct, I'm thinking writing down the above estimation to the document would help db administrators with estimation of memory usage. Current description might lead misunderstanding that syscache_memory_target is the total size of catalog cache in my impression. Related to the above I just thought changing sysycache_memory_target per CatCache would make memory usage more efficient. Though I haven't checked if there's a case that each system catalog cache memory usage varies largely, pg_class cache might need more memory than others and others might need less. But it would be difficult for users to check each CatCache memory usage and tune it because right now postgresql hasn't provided a handy way to check them. Another option is that users only specify the total memory target size and postgres dynamically change each CatCache memory target size according to a certain metric. (, which still seems difficult and expensive to develop per benefit) What do you think about this? + /* + * Set up pruning. + * + * We have two knobs to control pruning and a hash can share them of + * syscache. + * + */ + if (flags & HASH_PRUNABLE) + { + hctl->prunable = true; + hctl->prune_cb = info->prune_cb; + if (info->memory_target) + hctl->memory_target = info->memory_target; + else + hctl->memory_target = &cache_memory_target; + if (info->prune_min_age) + hctl->prune_min_age = info->prune_min_age; + else + hctl->prune_min_age = &cache_prune_min_age; + } + else + hctl->prunable = false; As you commented here, guc variable syscache_memory_target and syscache_prune_min_age are used for both syscache and relcache (HTAB), right? Do syscache and relcache have the similar amount of memory usage? If not, I'm thinking that introducing separate guc variable would be fine. So as syscache_prune_min_age. Regards, ==================== Takeshi Ideriha Fujitsu Limited
Hello. Thank you for looking this. At Wed, 12 Sep 2018 05:16:52 +0000, "Ideriha, Takeshi" <ideriha.takeshi@jp.fujitsu.com> wrote in <4E72940DA2BF16479384A86D54D0988A6F197012@G01JPEXMBKW04> > Hi, > > >Subject: Re: Protect syscache from bloating with negative cache entries > > > >Hello. The previous v4 patchset was just broken. > > >Somehow the 0004 was merged into the 0003 and applying 0004 results in failure. I > >removed 0004 part from the 0003 and rebased and repost it. > > I have some questions about syscache and relcache pruning > though they may be discussed at upper thread or out of point. > > Can I confirm about catcache pruning? > syscache_memory_target is the max figure per CatCache. > (Any CatCache has the same max value.) > So the total max size of catalog caches is estimated around or > slightly more than # of SysCache array times syscache_memory_target. Right. > If correct, I'm thinking writing down the above estimation to the document > would help db administrators with estimation of memory usage. > Current description might lead misunderstanding that syscache_memory_target > is the total size of catalog cache in my impression. Honestly I'm not sure that is the right design. Howerver, I don't think providing such formula to users helps users, since they don't know exactly how many CatCaches and brothres live in their server and it is a soft limit, and finally only few or just one catalogs can reach the limit. The current design based on the assumption that we would have only one extremely-growable cache in one use case. > Related to the above I just thought changing sysycache_memory_target per CatCache > would make memory usage more efficient. We could easily have per-cache settings in CatCache, but how do we provide the knobs for them? I can guess only too much solutions for that. > Though I haven't checked if there's a case that each system catalog cache memory usage varies largely, > pg_class cache might need more memory than others and others might need less. > But it would be difficult for users to check each CatCache memory usage and tune it > because right now postgresql hasn't provided a handy way to check them. I supposed that this is used without such a means. Someone suffers syscache bloat just can set this GUC to avoid the bloat. End. Apart from that, in the current patch, syscache_memory_target is not exact at all in the first place to avoid overhead to count the correct size. The major difference comes from the size of cache tuple itself. But I came to think it is too much to omit. As a *PoC*, in the attached patch (which applies to current master), size of CTups are counted as the catcache size. It also provides pg_catcache_size system view just to give a rough idea of how such view looks. I'll consider more on that but do you have any opinion on this? =# select relid::regclass, indid::regclass, size from pg_syscache_sizes order by size desc; relid | indid | size -------------------------+-------------------------------------------+-------- pg_class | pg_class_oid_index | 131072 pg_class | pg_class_relname_nsp_index | 131072 pg_cast | pg_cast_source_target_index | 5504 pg_operator | pg_operator_oprname_l_r_n_index | 4096 pg_statistic | pg_statistic_relid_att_inh_index | 2048 pg_proc | pg_proc_proname_args_nsp_index | 2048 .. > Another option is that users only specify the total memory target size and postgres > dynamically change each CatCache memory target size according to a certain metric. > (, which still seems difficult and expensive to develop per benefit) > What do you think about this? Given that few caches bloat at once, it's effect is not so different from the current design. > As you commented here, guc variable syscache_memory_target and > syscache_prune_min_age are used for both syscache and relcache (HTAB), right? Right, just not to add knobs for unclear reasons. Since ... > Do syscache and relcache have the similar amount of memory usage? They may be different but would make not so much in the case of cache bloat. > If not, I'm thinking that introducing separate guc variable would be fine. > So as syscache_prune_min_age. I implemented that so that it is easily replaceable in case, but I'm not sure separating them makes significant difference.. Thanks for the opinion, I'll put consideration on this more. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index bee4afbe4e..6a00141fc9 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1617,6 +1617,44 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target"> + <term><varname>syscache_memory_target</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_memory_target</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the maximum amount of memory to which syscache is expanded + without pruning. The value defaults to 0, indicating that pruning is + always considered. After exceeding this size, syscache pruning is + considered according to + <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep + certain amount of syscache entries with intermittent usage, try + increase this setting. + </para> + </listitem> + </varlistentry> + + <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age"> + <term><varname>syscache_prune_min_age</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the minimum amount of unused time in seconds at which a + syscache entry is considered to be removed. -1 indicates that syscache + pruning is disabled at all. The value defaults to 600 seconds + (<literal>10 minutes</literal>). The syscache entries that are not + used for the duration can be removed to prevent syscache bloat. This + behavior is suppressed until the size of syscache exceeds + <xref linkend="guc-syscache-memory-target"/>. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth"> <term><varname>max_stack_depth</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 875be180fe..df4256466c 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -713,6 +713,9 @@ void SetCurrentStatementStartTimestamp(void) { stmtStartTimestamp = GetCurrentTimestamp(); + + /* Set this timestamp as aproximated current time */ + SetCatCacheClock(stmtStartTimestamp); } /* diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 7251552419..1a1acd9bc7 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -938,6 +938,11 @@ REVOKE ALL ON pg_subscription FROM public; GRANT SELECT (subdbid, subname, subowner, subenabled, subslotname, subpublications) ON pg_subscription TO public; +-- XXXXXXXXXXXXXXXXXXXXXX +CREATE VIEW pg_syscache_sizes AS + SELECT * + FROM pg_get_syscache_sizes(); + -- -- We have a few function definitions in here, too. diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 5ddbf6eab1..aafdc4f8f2 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -71,9 +71,24 @@ #define CACHE6_elog(a,b,c,d,e,f,g) #endif +/* + * GUC variable to define the minimum size of hash to cosider entry eviction. + * This variable is shared among various cache mechanisms. + */ +int cache_memory_target = 0; + +/* GUC variable to define the minimum age of entries that will be cosidered to + * be evicted in seconds. This variable is shared among various cache + * mechanisms. + */ +int cache_prune_min_age = 600; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Timestamp used for any operation on caches. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -498,6 +513,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys, cache->cc_keyno, ct->keys); + cache->cc_tupsize -= ct->size; pfree(ct); --cache->cc_ntup; @@ -849,6 +865,7 @@ InitCatCache(int id, cp->cc_nkeys = nkeys; for (i = 0; i < nkeys; ++i) cp->cc_keyno[i] = key[i]; + cp->cc_tupsize = 0; /* * new cache is initialized as far as we can go for now. print some @@ -866,9 +883,129 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initilize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries can be left alone for several reasons. We remove them if + * they are not accessed for a certain time to prevent catcache from + * bloating. The eviction is performed with the similar algorithm with buffer + * eviction using access counter. Entries that are accessed several times can + * live longer than those that have had no access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int i; + int nremoved = 0; + size_t hash_size; +#ifdef CATCACHE_STATS + /* These variables are only for debugging purpose */ + int ntotal = 0; + /* + * nth element in nentries stores the number of cache entries that have + * lived unaccessed for corresponding multiple in ageclass of + * cache_prune_min_age. The index of nremoved_entry is the value of the + * clock-sweep counter, which takes from 0 up to 2. + */ + double ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; + int nentries[] = {0, 0, 0, 0, 0, 0}; + int nremoved_entry[3] = {0, 0, 0}; + int j; +#endif + + /* Return immediately if no pruning is wanted */ + if (cache_prune_min_age < 0) + return false; + + /* + * Return without pruning if the size of the hash is below the target. + */ + hash_size = cp->cc_nbuckets * sizeof(dlist_head); + if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L) + return false; + + /* Search the whole hash for entries to remove */ + for (i = 0; i < cp->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + + /* + * Calculate the duration from the time of the last access to the + * "current" time. Since catcacheclock is not advanced within a + * transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + +#ifdef CATCACHE_STATS + /* count catcache entries for each age class */ + ntotal++; + for (j = 0 ; + ageclass[j] != 0.0 && + entry_age > cache_prune_min_age * ageclass[j] ; + j++); + if (ageclass[j] == 0.0) j--; + nentries[j]++; +#endif + + /* + * Try to remove entries older than cache_prune_min_age seconds. + * Entries that are not accessed after last pruning are removed in + * that seconds, and that has been accessed several times are + * removed after leaving alone for up to three times of the + * duration. We don't try shrink buckets since pruning effectively + * caps catcache expansion in the long term. + */ + if (entry_age > cache_prune_min_age) + { +#ifdef CATCACHE_STATS + Assert (ct->naccess >= 0 && ct->naccess <= 2); + nremoved_entry[ct->naccess]++; +#endif + if (ct->naccess > 0) + ct->naccess--; + else + { + if (!ct->c_list || ct->c_list->refcount == 0) + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + } + } + } + } + } + +#ifdef CATCACHE_STATS + ereport(DEBUG1, + (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d, 2:%d)", + nremoved, ntotal, + ageclass[0] * cache_prune_min_age, nentries[0], + ageclass[1] * cache_prune_min_age, nentries[1], + ageclass[2] * cache_prune_min_age, nentries[2], + ageclass[3] * cache_prune_min_age, nentries[3], + ageclass[4] * cache_prune_min_age, nentries[4], + nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]), + errhidestmt(true))); +#endif + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -1282,6 +1419,11 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* Update access information for pruning */ + if (ct->naccess < 2) + ct->naccess++; + ct->lastaccess = catcacheclock; + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1813,7 +1955,6 @@ ReleaseCatCacheList(CatCList *list) CatCacheRemoveCList(list->my_cache, list); } - /* * CatalogCacheCreateEntry * Create a new CatCTup entry, copying the given HeapTuple and other @@ -1827,11 +1968,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, CatCTup *ct; HeapTuple dtp; MemoryContext oldcxt; + int tupsize = 0; /* negative entries have no tuple associated */ if (ntp) { int i; + int tupsize; Assert(!negative); @@ -1850,13 +1993,14 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, /* Allocate memory for CatCTup and the cached tuple in one go */ oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - ct = (CatCTup *) palloc(sizeof(CatCTup) + - MAXIMUM_ALIGNOF + dtp->t_len); + tupsize = sizeof(CatCTup) + MAXIMUM_ALIGNOF + dtp->t_len; + ct = (CatCTup *) palloc(tupsize); ct->tuple.t_len = dtp->t_len; ct->tuple.t_self = dtp->t_self; ct->tuple.t_tableOid = dtp->t_tableOid; ct->tuple.t_data = (HeapTupleHeader) MAXALIGN(((char *) ct) + sizeof(CatCTup)); + ct->size = tupsize; /* copy tuple contents */ memcpy((char *) ct->tuple.t_data, (const char *) dtp->t_data, @@ -1884,8 +2028,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, { Assert(negative); oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - ct = (CatCTup *) palloc(sizeof(CatCTup)); - + tupsize = sizeof(CatCTup); + ct = (CatCTup *) palloc(tupsize); /* * Store keys - they'll point into separately allocated memory if not * by-value. @@ -1906,17 +2050,24 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; + ct->size = tupsize; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); cache->cc_ntup++; CacheHdr->ch_ntup++; + cache->cc_tupsize += tupsize; /* - * If the hash table has become too full, enlarge the buckets array. Quite - * arbitrarily, we enlarge when fill factor > 2. + * If the hash table has become too full, try cleanup by removing + * infrequently used entries to make a room for the new entry. If it + * failed, enlarge the bucket array instead. Quite arbitrarily, we try + * this when fill factor > 2. */ - if (cache->cc_ntup > cache->cc_nbuckets * 2) + if (cache->cc_ntup > cache->cc_nbuckets * 2 && + !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); return ct; @@ -2118,3 +2269,9 @@ PrintCatCacheListLeakWarning(CatCList *list) list->my_cache->cc_relname, list->my_cache->id, list, list->refcount); } + +int +CatCacheGetSize(CatCache *cache) +{ + return cache->cc_tupsize + cache->cc_nbuckets * sizeof(dlist_head); +} diff --git a/src/backend/utils/cache/plancache.c b/src/backend/utils/cache/plancache.c index 7271b5880b..490cb8ec8a 100644 --- a/src/backend/utils/cache/plancache.c +++ b/src/backend/utils/cache/plancache.c @@ -63,12 +63,14 @@ #include "storage/lmgr.h" #include "tcop/pquery.h" #include "tcop/utility.h" +#include "utils/catcache.h" #include "utils/inval.h" #include "utils/memutils.h" #include "utils/resowner_private.h" #include "utils/rls.h" #include "utils/snapmgr.h" #include "utils/syscache.h" +#include "utils/timestamp.h" /* @@ -86,6 +88,12 @@ * guarantee to save a CachedPlanSource without error. */ static CachedPlanSource *first_saved_plan = NULL; +static CachedPlanSource *last_saved_plan = NULL; +static int num_saved_plans = 0; +static TimestampTz oldest_saved_plan = 0; + +/* GUC variables */ +int min_cached_plans = 1000; static void ReleaseGenericPlan(CachedPlanSource *plansource); static List *RevalidateCachedQuery(CachedPlanSource *plansource, @@ -105,6 +113,7 @@ static TupleDesc PlanCacheComputeResultDesc(List *stmt_list); static void PlanCacheRelCallback(Datum arg, Oid relid); static void PlanCacheFuncCallback(Datum arg, int cacheid, uint32 hashvalue); static void PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue); +static void PruneCachedPlan(void); /* GUC parameter */ int plan_cache_mode; @@ -210,6 +219,8 @@ CreateCachedPlan(RawStmt *raw_parse_tree, plansource->generic_cost = -1; plansource->total_custom_cost = 0; plansource->num_custom_plans = 0; + plansource->last_access = GetCatCacheClock(); + MemoryContextSwitchTo(oldcxt); @@ -425,6 +436,28 @@ CompleteCachedPlan(CachedPlanSource *plansource, plansource->is_valid = true; } +/* moves the plansource to the first in the list */ +static inline void +MovePlansourceToFirst(CachedPlanSource *plansource) +{ + if (first_saved_plan != plansource) + { + /* delink this element */ + if (plansource->next_saved) + plansource->next_saved->prev_saved = plansource->prev_saved; + if (plansource->prev_saved) + plansource->prev_saved->next_saved = plansource->next_saved; + if (last_saved_plan == plansource) + last_saved_plan = plansource->prev_saved; + + /* insert at the beginning */ + first_saved_plan->prev_saved = plansource; + plansource->next_saved = first_saved_plan; + plansource->prev_saved = NULL; + first_saved_plan = plansource; + } +} + /* * SaveCachedPlan: save a cached plan permanently * @@ -472,6 +505,11 @@ SaveCachedPlan(CachedPlanSource *plansource) * Add the entry to the global list of cached plans. */ plansource->next_saved = first_saved_plan; + if (first_saved_plan) + first_saved_plan->prev_saved = plansource; + else + last_saved_plan = plansource; + plansource->prev_saved = NULL; first_saved_plan = plansource; plansource->is_saved = true; @@ -494,7 +532,11 @@ DropCachedPlan(CachedPlanSource *plansource) if (plansource->is_saved) { if (first_saved_plan == plansource) + { first_saved_plan = plansource->next_saved; + if (first_saved_plan) + first_saved_plan->prev_saved = NULL; + } else { CachedPlanSource *psrc; @@ -504,10 +546,19 @@ DropCachedPlan(CachedPlanSource *plansource) if (psrc->next_saved == plansource) { psrc->next_saved = plansource->next_saved; + if (psrc->next_saved) + psrc->next_saved->prev_saved = psrc; break; } } } + + if (last_saved_plan == plansource) + { + last_saved_plan = plansource->prev_saved; + if (last_saved_plan) + last_saved_plan->next_saved = NULL; + } plansource->is_saved = false; } @@ -539,6 +590,13 @@ ReleaseGenericPlan(CachedPlanSource *plansource) Assert(plan->magic == CACHEDPLAN_MAGIC); plansource->gplan = NULL; ReleaseCachedPlan(plan, false); + + /* decrement "saved plans" counter */ + if (plansource->is_saved) + { + Assert (num_saved_plans > 0); + num_saved_plans--; + } } } @@ -1156,6 +1214,17 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams, if (useResOwner && !plansource->is_saved) elog(ERROR, "cannot apply ResourceOwner to non-saved cached plan"); + /* + * set last-accessed timestamp and move this plan to the first of the list + */ + if (plansource->is_saved) + { + plansource->last_access = GetCatCacheClock(); + + /* move this plan to the first of the list */ + MovePlansourceToFirst(plansource); + } + /* Make sure the querytree list is valid and we have parse-time locks */ qlist = RevalidateCachedQuery(plansource, queryEnv); @@ -1164,6 +1233,11 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams, if (!customplan) { + /* Prune cached plans if needed */ + if (plansource->is_saved && + min_cached_plans >= 0 && num_saved_plans > min_cached_plans) + PruneCachedPlan(); + if (CheckCachedPlan(plansource)) { /* We want a generic plan, and we already have a valid one */ @@ -1176,6 +1250,11 @@ GetCachedPlan(CachedPlanSource *plansource, ParamListInfo boundParams, plan = BuildCachedPlan(plansource, qlist, NULL, queryEnv); /* Just make real sure plansource->gplan is clear */ ReleaseGenericPlan(plansource); + + /* count this new saved plan */ + if (plansource->is_saved) + num_saved_plans++; + /* Link the new generic plan into the plansource */ plansource->gplan = plan; plan->refcount++; @@ -1864,6 +1943,90 @@ PlanCacheSysCallback(Datum arg, int cacheid, uint32 hashvalue) ResetPlanCache(); } +/* + * PrunePlanCache: removes generic plan of "old" saved plans. + */ +static void +PruneCachedPlan(void) +{ + CachedPlanSource *plansource; + TimestampTz currclock = GetCatCacheClock(); + long age; + int us; + int nremoved = 0; + + /* do nothing if not wanted */ + if (cache_prune_min_age < 0 || num_saved_plans <= min_cached_plans) + return; + + /* Fast check for oldest cache */ + if (oldest_saved_plan > 0) + { + TimestampDifference(oldest_saved_plan, currclock, &age, &us); + if (age < cache_prune_min_age) + return; + } + + /* last plan is the oldest. */ + for (plansource = last_saved_plan; plansource; plansource = plansource->prev_saved) + { + long plan_age; + int us; + + Assert(plansource->magic == CACHEDPLANSOURCE_MAGIC); + + /* we want to prune no more plans */ + if (num_saved_plans <= min_cached_plans) + break; + + /* + * No work if it already doesn't have gplan and move it to the + * beginning so that we don't see it at the next time + */ + if (!plansource->gplan) + continue; + + /* + * Check age for pruning. Can exit immediately when finding a + * not-older element. + */ + TimestampDifference(plansource->last_access, currclock, &plan_age, &us); + if (plan_age <= cache_prune_min_age) + { + /* this entry is the next oldest */ + oldest_saved_plan = plansource->last_access; + break; + } + + /* + * Here, remove generic plans of this plansrouceif it is not actually + * used and move it to the beginning of the list. Just update + * last_access and move it to the beginning if the plan is used. + */ + if (plansource->gplan->refcount <= 1) + { + ReleaseGenericPlan(plansource); + nremoved++; + } + + plansource->last_access = currclock; + } + + /* move the "removed" plansrouces altogehter to the beginning of the list */ + if (plansource != last_saved_plan && plansource) + { + plansource->next_saved->prev_saved = NULL; + first_saved_plan->prev_saved = last_saved_plan; + last_saved_plan->next_saved = first_saved_plan; + first_saved_plan = plansource->next_saved; + plansource->next_saved = NULL; + last_saved_plan = plansource; + } + + if (nremoved > 0) + elog(DEBUG1, "plancache removed %d/%d", nremoved, num_saved_plans); +} + /* * ResetPlanCache: invalidate all cached plans. */ diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c index 2b381782a3..9cdb75afb8 100644 --- a/src/backend/utils/cache/syscache.c +++ b/src/backend/utils/cache/syscache.c @@ -73,9 +73,14 @@ #include "catalog/pg_ts_template.h" #include "catalog/pg_type.h" #include "catalog/pg_user_mapping.h" +#include "funcapi.h" +#include "miscadmin.h" +#include "nodes/execnodes.h" #include "utils/rel.h" #include "utils/catcache.h" #include "utils/syscache.h" +#include "utils/tuplestore.h" +#include "utils/fmgrprotos.h" /*--------------------------------------------------------------------------- @@ -1530,6 +1535,64 @@ RelationSupportsSysCache(Oid relid) } +/* + * rough size of this syscache + */ +Datum +pg_get_syscache_sizes(PG_FUNCTION_ARGS) +{ +#define PG_GET_SYSCACHE_SIZE 3 + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; + TupleDesc tupdesc; + Tuplestorestate *tupstore; + MemoryContext per_query_ctx; + MemoryContext oldcontext; + int cacheId; + + if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("set-valued function called in context that cannot accept a set"))); + if (!(rsinfo->allowedModes & SFRM_Materialize)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("materialize mode required, but it is not " \ + "allowed in this context"))); + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + per_query_ctx = rsinfo->econtext->ecxt_per_query_memory; + oldcontext = MemoryContextSwitchTo(per_query_ctx); + + tupstore = tuplestore_begin_heap(true, false, work_mem); + rsinfo->returnMode = SFRM_Materialize; + rsinfo->setResult = tupstore; + rsinfo->setDesc = tupdesc; + + MemoryContextSwitchTo(oldcontext); + + for (cacheId = 0 ; cacheId < SysCacheSize ; cacheId++) + { + Datum values[PG_GET_SYSCACHE_SIZE]; + bool nulls[PG_GET_SYSCACHE_SIZE]; + int i; + + memset(nulls, 0, sizeof(nulls)); + + i = 0; + values[i++] = cacheinfo[cacheId].reloid; + values[i++] = cacheinfo[cacheId].indoid; + values[i++] = Int64GetDatum(CatCacheGetSize(SysCache[cacheId])); + tuplestore_putvalues(tupstore, tupdesc, values, nulls); + } + + tuplestore_donestoring(tupstore); + + return (Datum) 0; +} + /* * OID comparator for pg_qsort */ diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 0625eff219..3154574f62 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -79,6 +79,7 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/guc_tables.h" #include "utils/float.h" #include "utils/memutils.h" @@ -2113,6 +2114,38 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"cache_memory_target", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum syscache size to keep."), + gettext_noop("Cache is not pruned before exceeding this size."), + GUC_UNIT_KB + }, + &cache_memory_target, + 0, 0, MAX_KILOBYTES, + NULL, NULL, NULL + }, + + { + {"cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum unused duration of cache entries before removal."), + gettext_noop("Cache entries that live unused for longer than this seconds are considered to be removed."), + GUC_UNIT_S + }, + &cache_prune_min_age, + 600, -1, INT_MAX, + NULL, NULL, NULL + }, + + { + {"min_cached_plans", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum number of cached plans kept on memory."), + gettext_noop("Timeout invalidation of plancache is not activated until the number of plancaches reaches thisvalue. -1 means timeout invalidation is always active.") + }, + &min_cached_plans, + 1000, -1, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 7486d20a34..917d7cb5cf 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -126,6 +126,8 @@ #work_mem = 4MB # min 64kB #maintenance_work_mem = 64MB # min 1MB #autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem +#cache_memory_target = 0kB # in kB +#cache_prune_min_age = 600s # -1 disables pruning #max_stack_depth = 2MB # min 100kB #dynamic_shared_memory_type = posix # the default is the first option # supported by the operating system: diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 860571440a..c0bfcc9f70 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -9800,6 +9800,15 @@ proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}', proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}', prosrc => 'pg_get_replication_slots' }, +{ oid => '3423', + descr => 'syscache size', + proname => 'pg_get_syscache_sizes', prorows => '100', proisstrict => 'f', + proretset => 't', provolatile => 'v', prorettype => 'record', + proargtypes => '', + proallargtypes => '{oid,oid,int8}', + proargmodes => '{o,o,o}', + proargnames => '{relid,indid,size}', + prosrc => 'pg_get_syscache_sizes' }, { oid => '3786', descr => 'set up a logical replication slot', proname => 'pg_create_logical_replication_slot', provolatile => 'v', proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool', diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 7b22f9c7bc..9c326d6af6 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -61,6 +62,7 @@ typedef struct catcache slist_node cc_next; /* list link */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ + int cc_tupsize; /* total amount of catcache tuples */ /* * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS @@ -119,7 +121,9 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ - + int naccess; /* # of access to this entry, up to 2 */ + TimestampTz lastaccess; /* approx. timestamp of the last usage */ + int size; /* palloc'ed size off this tuple */ /* * The tuple may also be a member of at most one CatCList. (If a single * catcache is list-searched with varying numbers of keys, we may have to @@ -189,6 +193,28 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int cache_prune_min_age; +extern int cache_memory_target; + +/* to use as access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* + * SetCatCacheClock - set timestamp for catcache access record + */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + +static inline TimestampTz +GetCatCacheClock(void) +{ + return catcacheclock; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, @@ -227,5 +253,6 @@ extern void PrepareToInvalidateCacheTuple(Relation relation, extern void PrintCatCacheLeakWarning(HeapTuple tuple); extern void PrintCatCacheListLeakWarning(CatCList *list); +extern int CatCacheGetSize(CatCache *cache); #endif /* CATCACHE_H */ diff --git a/src/include/utils/plancache.h b/src/include/utils/plancache.h index 5fc7903a06..338b3470b7 100644 --- a/src/include/utils/plancache.h +++ b/src/include/utils/plancache.h @@ -110,11 +110,13 @@ typedef struct CachedPlanSource bool is_valid; /* is the query_list currently valid? */ int generation; /* increments each time we create a plan */ /* If CachedPlanSource has been saved, it is a member of a global list */ - struct CachedPlanSource *next_saved; /* list link, if so */ + struct CachedPlanSource *prev_saved; /* list prev link, if so */ + struct CachedPlanSource *next_saved; /* list next link, if so */ /* State kept to help decide whether to use custom or generic plans: */ double generic_cost; /* cost of generic plan, or -1 if not known */ double total_custom_cost; /* total cost of custom plans so far */ int num_custom_plans; /* number of plans included in total */ + TimestampTz last_access; /* timestamp of the last usage */ } CachedPlanSource; /* @@ -143,6 +145,9 @@ typedef struct CachedPlan MemoryContext context; /* context containing this CachedPlan */ } CachedPlan; +/* GUC variables */ +extern int min_cached_plans; +extern int plancache_prune_min_age; extern void InitPlanCache(void); extern void ResetPlanCache(void);
Hi, thank you for the explanation. >From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] >> >> Can I confirm about catcache pruning? >> syscache_memory_target is the max figure per CatCache. >> (Any CatCache has the same max value.) So the total max size of >> catalog caches is estimated around or slightly more than # of SysCache >> array times syscache_memory_target. > >Right. > >> If correct, I'm thinking writing down the above estimation to the >> document would help db administrators with estimation of memory usage. >> Current description might lead misunderstanding that >> syscache_memory_target is the total size of catalog cache in my impression. > >Honestly I'm not sure that is the right design. Howerver, I don't think providing such >formula to users helps users, since they don't know exactly how many CatCaches and >brothres live in their server and it is a soft limit, and finally only few or just one catalogs >can reach the limit. Yeah, I agree with that kind of formula is not suited for the document. But if users don't know how many catcaches and brothers is used at postgres, then how about changing syscache_memory_target as total soft limit of catcache, rather than size limit of individual catcache. Internally syscache_memory_target can be divided by the number of Syscache and does its work. The total amount would be easier to understand for users who don't know the detailed contents of catalog caches. Or if user can tell how many/what kind of catcaches exists, for instance by using the system view you provided in the previous email, the current design looks good to me. >The current design based on the assumption that we would have only one >extremely-growable cache in one use case. > >> Related to the above I just thought changing sysycache_memory_target >> per CatCache would make memory usage more efficient. > >We could easily have per-cache settings in CatCache, but how do we provide the knobs >for them? I can guess only too much solutions for that. Agreed. >> Though I haven't checked if there's a case that each system catalog >> cache memory usage varies largely, pg_class cache might need more memory than >others and others might need less. >> But it would be difficult for users to check each CatCache memory >> usage and tune it because right now postgresql hasn't provided a handy way to >check them. > >I supposed that this is used without such a means. Someone suffers syscache bloat >just can set this GUC to avoid the bloat. End. Yeah, I took the purpose wrong. >Apart from that, in the current patch, syscache_memory_target is not exact at all in >the first place to avoid overhead to count the correct size. The major difference comes >from the size of cache tuple itself. But I came to think it is too much to omit. > >As a *PoC*, in the attached patch (which applies to current master), size of CTups are >counted as the catcache size. > >It also provides pg_catcache_size system view just to give a rough idea of how such >view looks. I'll consider more on that but do you have any opinion on this? > >=# select relid::regclass, indid::regclass, size from pg_syscache_sizes order by size >desc; > relid | indid | size >-------------------------+-------------------------------------------+-- >-------------------------+-------------------------------------------+-- >-------------------------+-------------------------------------------+-- >-------------------------+-------------------------------------------+-- > pg_class | pg_class_oid_index | 131072 > pg_class | pg_class_relname_nsp_index | 131072 > pg_cast | pg_cast_source_target_index | 5504 > pg_operator | pg_operator_oprname_l_r_n_index | 4096 > pg_statistic | pg_statistic_relid_att_inh_index | 2048 > pg_proc | pg_proc_proname_args_nsp_index | 2048 >.. Great! I like this view. One of the extreme idea would be adding all the members printed by CatCachePrintStats(), which is only enabled with -DCATCACHE_STATS at this moment. All of the members seems too much for customers who tries to change the cache limit size But it may be some of the members are useful because for example cc_hits would indicate that current cache limit size is too small. >> Another option is that users only specify the total memory target size >> and postgres dynamically change each CatCache memory target size according to a >certain metric. >> (, which still seems difficult and expensive to develop per benefit) >> What do you think about this? > >Given that few caches bloat at once, it's effect is not so different from the current >design. Yes agreed. >> As you commented here, guc variable syscache_memory_target and >> syscache_prune_min_age are used for both syscache and relcache (HTAB), right? > >Right, just not to add knobs for unclear reasons. Since ... > >> Do syscache and relcache have the similar amount of memory usage? > >They may be different but would make not so much in the case of cache bloat. >> If not, I'm thinking that introducing separate guc variable would be fine. >> So as syscache_prune_min_age. > >I implemented that so that it is easily replaceable in case, but I'm not sure separating >them makes significant difference.. Maybe I was overthinking mixing my development. Regards, Takeshi Ideriha
Hello. Thank you for the comment. At Thu, 4 Oct 2018 04:27:04 +0000, "Ideriha, Takeshi" <ideriha.takeshi@jp.fujitsu.com> wrote in <4E72940DA2BF16479384A86D54D0988A6F1BCB6F@G01JPEXMBKW04> > >As a *PoC*, in the attached patch (which applies to current master), size of CTups are > >counted as the catcache size. > > > >It also provides pg_catcache_size system view just to give a rough idea of how such > >view looks. I'll consider more on that but do you have any opinion on this? > > ... > Great! I like this view. > One of the extreme idea would be adding all the members printed by CatCachePrintStats(), > which is only enabled with -DCATCACHE_STATS at this moment. > All of the members seems too much for customers who tries to change the cache limit size > But it may be some of the members are useful because for example cc_hits would indicate that current > cache limit size is too small. The attached introduces four features below. (But the features on relcache and plancache are omitted). 1. syscache stats collector (in 0002) Records syscache status consists of the same columns above and "ageclass" information. We could somehow triggering a stats report with signal but we don't want take/send/write the statistics in signal handler. Instead, it is turned on by setting track_syscache_usage_interval to a positive number in milliseconds. 2. pg_stat_syscache view. (in 0002) This view shows catcache statistics. Statistics is taken only on the backends where syscache tracking is active. > pid | application_name | relname | cache_name | size | ageclass | nentries > ------+------------------+----------------+-----------------------------------+----------+-------------------------+--------------------------- > 9984 | psql | pg_statistic | pg_statistic_relid_att_inh_index | 12676096 | {30,60,600,1200,1800,0} | {17660,17310,55870,0,0,0} Age class is the basis of catcache truncation mechanism and shows the distribution based on elapsed time since last access. As I didn't came up an appropriate way, it is represented as two arrays. Ageclass stores maximum age for each class in seconds. Nentries holds entry numbers correnponding to the same element in ageclass. In the above example, age class : # of entries in the cache up to 30s : 17660 up to 60s : 17310 up to 600s : 55870 up to 1200s : 0 up to 1800s : 0 more longer : 0 The ageclass is {0, 0.05, 0.1, 1, 2, 3}th multiples of cache_prune_min_age on the backend. 3. non-transactional GUC setting (in 0003) It allows setting GUC variable set by the action GUC_ACTION_NONXACT(the name requires condieration) survive beyond rollback. It is required by remote guc setting to work sanely. Without the feature a remote-set value within a trasction will disappear involved in rollback. The only local interface for the NONXACT action is set_config(name, value, is_local=false, is_nonxact = true). pg_set_backend_guc() below works on this feature. 4. pg_set_backend_guc() function. Of course syscache statistics recording consumes significant amount of time so it cannot be turned on usually. On the other hand since this feature is turned on by GUC, it is needed to grab the active client connection to turn on/off the feature(but we cannot). Instead, I provided a means to change GUC variables in another backend. pg_set_backend_guc(pid, name, value) sets the GUC variable "name" on the backend "pid" to "value". With the above tools, we can inspect catcache statistics of seemingly bloated process. A. Find a bloated process pid using ps or something. B. Turn on syscache stats on the process. =# select pg_set_backend_guc(9984, 'track_syscache_usage_interval', '10000'); C. Examine the statitics. =# select pid, relname, cache_name, size from pg_stat_syscache order by size desc limit 3; pid | relname | cache_name | size ------+--------------+----------------------------------+---------- 9984 | pg_statistic | pg_statistic_relid_att_inh_index | 32154112 9984 | pg_cast | pg_cast_source_target_index | 4096 9984 | pg_operator | pg_operator_oprname_l_r_n_index | 4096 =# select * from pg_stat_syscache where cache_name = 'pg_statistic_relid_att_inh_index'::regclass; -[ RECORD 1 ]--------------------------------- pid | 9984 relname | pg_statistic cache_name | pg_statistic_relid_att_inh_index size | 11026176 ntuples | 77950 searches | 77950 hits | 0 neg_hits | 0 ageclass | {30,60,600,1200,1800,0} nentries | {17630,16950,43370,0,0,0} last_update | 2018-10-17 15:58:19.738164+09 > >> Another option is that users only specify the total memory target size > >> and postgres dynamically change each CatCache memory target size according to a > >certain metric. > >> (, which still seems difficult and expensive to develop per benefit) > >> What do you think about this? > > > >Given that few caches bloat at once, it's effect is not so different from the current > >design. > Yes agreed. > > >> As you commented here, guc variable syscache_memory_target and > >> syscache_prune_min_age are used for both syscache and relcache (HTAB), right? > > > >Right, just not to add knobs for unclear reasons. Since ... > > > >> Do syscache and relcache have the similar amount of memory usage? > > > >They may be different but would make not so much in the case of cache bloat. > >> If not, I'm thinking that introducing separate guc variable would be fine. > >> So as syscache_prune_min_age. > > > >I implemented that so that it is easily replaceable in case, but I'm not sure separating > >them makes significant difference.. > Maybe I was overthinking mixing my development. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 4125f38c439d305797907bb95e5a35c7f869244e Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 16 Oct 2018 13:04:30 +0900 Subject: [PATCH 1/3] Remove entries that haven't been used for a certain time Catcache entries can be left alone for several reasons. It is not desirable that they eat up memory. With this patch, This adds consideration of removal of entries that haven't been used for a certain time before enlarging the hash array. --- doc/src/sgml/config.sgml | 38 ++++++ src/backend/access/transam/xact.c | 5 + src/backend/utils/cache/catcache.c | 166 ++++++++++++++++++++++++-- src/backend/utils/misc/guc.c | 23 ++++ src/backend/utils/misc/postgresql.conf.sample | 2 + src/include/utils/catcache.h | 28 ++++- 6 files changed, 254 insertions(+), 8 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 7554cba3f9..c3133d742b 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1618,6 +1618,44 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target"> + <term><varname>syscache_memory_target</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_memory_target</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the maximum amount of memory to which syscache is expanded + without pruning. The value defaults to 0, indicating that pruning is + always considered. After exceeding this size, syscache pruning is + considered according to + <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep + certain amount of syscache entries with intermittent usage, try + increase this setting. + </para> + </listitem> + </varlistentry> + + <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age"> + <term><varname>syscache_prune_min_age</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the minimum amount of unused time in seconds at which a + syscache entry is considered to be removed. -1 indicates that syscache + pruning is disabled at all. The value defaults to 600 seconds + (<literal>10 minutes</literal>). The syscache entries that are not + used for the duration can be removed to prevent syscache bloat. This + behavior is suppressed until the size of syscache exceeds + <xref linkend="guc-syscache-memory-target"/>. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth"> <term><varname>max_stack_depth</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 8c1621d949..083b6dc7aa 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -733,7 +733,12 @@ void SetCurrentStatementStartTimestamp(void) { if (!IsParallelWorker()) + { stmtStartTimestamp = GetCurrentTimestamp(); + + /* Set this timestamp as aproximated current time */ + SetCatCacheClock(stmtStartTimestamp); + } else Assert(stmtStartTimestamp != 0); } diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 5ddbf6eab1..9be463311d 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -71,9 +71,24 @@ #define CACHE6_elog(a,b,c,d,e,f,g) #endif +/* + * GUC variable to define the minimum size of hash to cosider entry eviction. + * This variable is shared among various cache mechanisms. + */ +int cache_memory_target = 0; + +/* GUC variable to define the minimum age of entries that will be cosidered to + * be evicted in seconds. This variable is shared among various cache + * mechanisms. + */ +int cache_prune_min_age = 600; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Timestamp used for any operation on caches. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -498,6 +513,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys, cache->cc_keyno, ct->keys); + cache->cc_tupsize -= ct->size; pfree(ct); --cache->cc_ntup; @@ -849,6 +865,7 @@ InitCatCache(int id, cp->cc_nkeys = nkeys; for (i = 0; i < nkeys; ++i) cp->cc_keyno[i] = key[i]; + cp->cc_tupsize = 0; /* * new cache is initialized as far as we can go for now. print some @@ -866,9 +883,129 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initilize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries can be left alone for several reasons. We remove them if + * they are not accessed for a certain time to prevent catcache from + * bloating. The eviction is performed with the similar algorithm with buffer + * eviction using access counter. Entries that are accessed several times can + * live longer than those that have had no access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int i; + int nremoved = 0; + size_t hash_size; +#ifdef CATCACHE_STATS + /* These variables are only for debugging purpose */ + int ntotal = 0; + /* + * nth element in nentries stores the number of cache entries that have + * lived unaccessed for corresponding multiple in ageclass of + * cache_prune_min_age. The index of nremoved_entry is the value of the + * clock-sweep counter, which takes from 0 up to 2. + */ + double ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; + int nentries[] = {0, 0, 0, 0, 0, 0}; + int nremoved_entry[3] = {0, 0, 0}; + int j; +#endif + + /* Return immediately if no pruning is wanted */ + if (cache_prune_min_age < 0) + return false; + + /* + * Return without pruning if the size of the hash is below the target. + */ + hash_size = cp->cc_nbuckets * sizeof(dlist_head); + if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L) + return false; + + /* Search the whole hash for entries to remove */ + for (i = 0; i < cp->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + + /* + * Calculate the duration from the time of the last access to the + * "current" time. Since catcacheclock is not advanced within a + * transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + +#ifdef CATCACHE_STATS + /* count catcache entries for each age class */ + ntotal++; + for (j = 0 ; + ageclass[j] != 0.0 && + entry_age > cache_prune_min_age * ageclass[j] ; + j++); + if (ageclass[j] == 0.0) j--; + nentries[j]++; +#endif + + /* + * Try to remove entries older than cache_prune_min_age seconds. + * Entries that are not accessed after last pruning are removed in + * that seconds, and that has been accessed several times are + * removed after leaving alone for up to three times of the + * duration. We don't try shrink buckets since pruning effectively + * caps catcache expansion in the long term. + */ + if (entry_age > cache_prune_min_age) + { +#ifdef CATCACHE_STATS + Assert (ct->naccess >= 0 && ct->naccess <= 2); + nremoved_entry[ct->naccess]++; +#endif + if (ct->naccess > 0) + ct->naccess--; + else + { + if (!ct->c_list || ct->c_list->refcount == 0) + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + } + } + } + } + } + +#ifdef CATCACHE_STATS + ereport(DEBUG1, + (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d, 2:%d)", + nremoved, ntotal, + ageclass[0] * cache_prune_min_age, nentries[0], + ageclass[1] * cache_prune_min_age, nentries[1], + ageclass[2] * cache_prune_min_age, nentries[2], + ageclass[3] * cache_prune_min_age, nentries[3], + ageclass[4] * cache_prune_min_age, nentries[4], + nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]), + errhidestmt(true))); +#endif + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -1282,6 +1419,11 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* Update access information for pruning */ + if (ct->naccess < 2) + ct->naccess++; + ct->lastaccess = catcacheclock; + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1827,11 +1969,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, CatCTup *ct; HeapTuple dtp; MemoryContext oldcxt; + int tupsize = 0; /* negative entries have no tuple associated */ if (ntp) { int i; + int tupsize; Assert(!negative); @@ -1850,13 +1994,14 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, /* Allocate memory for CatCTup and the cached tuple in one go */ oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - ct = (CatCTup *) palloc(sizeof(CatCTup) + - MAXIMUM_ALIGNOF + dtp->t_len); + tupsize = sizeof(CatCTup) + MAXIMUM_ALIGNOF + dtp->t_len; + ct = (CatCTup *) palloc(tupsize); ct->tuple.t_len = dtp->t_len; ct->tuple.t_self = dtp->t_self; ct->tuple.t_tableOid = dtp->t_tableOid; ct->tuple.t_data = (HeapTupleHeader) MAXALIGN(((char *) ct) + sizeof(CatCTup)); + ct->size = tupsize; /* copy tuple contents */ memcpy((char *) ct->tuple.t_data, (const char *) dtp->t_data, @@ -1884,8 +2029,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, { Assert(negative); oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - ct = (CatCTup *) palloc(sizeof(CatCTup)); - + tupsize = sizeof(CatCTup); + ct = (CatCTup *) palloc(tupsize); /* * Store keys - they'll point into separately allocated memory if not * by-value. @@ -1906,17 +2051,24 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; + ct->size = tupsize; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); cache->cc_ntup++; CacheHdr->ch_ntup++; + cache->cc_tupsize += tupsize; /* - * If the hash table has become too full, enlarge the buckets array. Quite - * arbitrarily, we enlarge when fill factor > 2. + * If the hash table has become too full, try cleanup by removing + * infrequently used entries to make a room for the new entry. If it + * failed, enlarge the bucket array instead. Quite arbitrarily, we try + * this when fill factor > 2. */ - if (cache->cc_ntup > cache->cc_nbuckets * 2) + if (cache->cc_ntup > cache->cc_nbuckets * 2 && + !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); return ct; diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 2317e8be6b..1a49d576fa 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -80,6 +80,7 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/guc_tables.h" #include "utils/float.h" #include "utils/memutils.h" @@ -2113,6 +2114,28 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"cache_memory_target", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum syscache size to keep."), + gettext_noop("Cache is not pruned before exceeding this size."), + GUC_UNIT_KB + }, + &cache_memory_target, + 0, 0, MAX_KILOBYTES, + NULL, NULL, NULL + }, + + { + {"cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum unused duration of cache entries before removal."), + gettext_noop("Cache entries that live unused for longer than this seconds are considered to be removed."), + GUC_UNIT_S + }, + &cache_prune_min_age, + 600, -1, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 4e61bc6521..c59dd898ac 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -126,6 +126,8 @@ #work_mem = 4MB # min 64kB #maintenance_work_mem = 64MB # min 1MB #autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem +#cache_memory_target = 0kB # in kB +#cache_prune_min_age = 600s # -1 disables pruning #max_stack_depth = 2MB # min 100kB #dynamic_shared_memory_type = posix # the default is the first option # supported by the operating system: diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 7b22f9c7bc..ace4178619 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -61,6 +62,7 @@ typedef struct catcache slist_node cc_next; /* list link */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ + int cc_tupsize; /* total amount of catcache tuples */ /* * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS @@ -119,7 +121,9 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ - + int naccess; /* # of access to this entry, up to 2 */ + TimestampTz lastaccess; /* approx. timestamp of the last usage */ + int size; /* palloc'ed size off this tuple */ /* * The tuple may also be a member of at most one CatCList. (If a single * catcache is list-searched with varying numbers of keys, we may have to @@ -189,6 +193,28 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int cache_prune_min_age; +extern int cache_memory_target; + +/* to use as access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* + * SetCatCacheClock - set timestamp for catcache access record + */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + +static inline TimestampTz +GetCatCacheClock(void) +{ + return catcacheclock; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, -- 2.16.3 From c565131cf9db0d0b6e475a101ec247bbfc2df8ab Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 16 Oct 2018 15:48:28 +0900 Subject: [PATCH 2/3] Syscache usage tracking feature. Collects syscache usage statictics and show it using the view pg_stat_syscache. The feature is controlled by the GUC variable track_syscache_usage_interval. --- doc/src/sgml/config.sgml | 15 ++ src/backend/catalog/system_views.sql | 18 +++ src/backend/postmaster/pgstat.c | 206 ++++++++++++++++++++++++-- src/backend/tcop/postgres.c | 23 +++ src/backend/utils/adt/pgstatfuncs.c | 136 +++++++++++++++++ src/backend/utils/cache/catcache.c | 115 ++++++++++---- src/backend/utils/cache/syscache.c | 24 +++ src/backend/utils/init/globals.c | 1 + src/backend/utils/init/postinit.c | 11 ++ src/backend/utils/misc/guc.c | 10 ++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/catalog/pg_proc.dat | 9 ++ src/include/miscadmin.h | 1 + src/include/pgstat.h | 7 +- src/include/utils/catcache.h | 9 +- src/include/utils/syscache.h | 19 +++ src/include/utils/timeout.h | 1 + 17 files changed, 562 insertions(+), 44 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index c3133d742b..976a505205 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -6106,6 +6106,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; </listitem> </varlistentry> + <varlistentry id="guc-track-syscache-usage-interval" xreflabel="track_syscache_usage_interval"> + <term><varname>track_syscache_usage_interval</varname> (<type>integer</type>) + <indexterm> + <primary><varname>track_syscache_usage_interval</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the interval to collect system cache usage statistics in + milliseconds. This parameter is 0 by default, which means disabled. + Only superusers can change this setting. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-track-io-timing" xreflabel="track_io_timing"> <term><varname>track_io_timing</varname> (<type>boolean</type>) <indexterm> diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index a03b005f73..6cd19c8ecb 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -903,6 +903,23 @@ CREATE VIEW pg_stat_progress_vacuum AS FROM pg_stat_get_progress_info('VACUUM') AS S LEFT JOIN pg_database D ON S.datid = D.oid; +CREATE VIEW pg_stat_syscache AS + SELECT + S.pid AS pid, + S.relid::regclass AS relname, + S.indid::regclass AS cache_name, + S.size AS size, + S.ntup AS ntuples, + S.searches AS searches, + S.hits AS hits, + S.neg_hits AS neg_hits, + S.ageclass AS ageclass, + S.nentries AS nentries, + S.last_update AS last_update + FROM pg_stat_activity A + JOIN LATERAL (SELECT A.pid, * FROM pg_get_syscache_stats(A.pid)) S + ON (A.pid = S.pid); + CREATE VIEW pg_user_mappings AS SELECT U.oid AS umid, @@ -1176,6 +1193,7 @@ GRANT EXECUTE ON FUNCTION pg_ls_waldir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_archive_statusdir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_tmpdir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_tmpdir(oid) TO pg_monitor; +GRANT EXECUTE ON FUNCTION pg_get_syscache_stats(int) TO pg_monitor; GRANT pg_read_all_settings TO pg_monitor; GRANT pg_read_all_stats TO pg_monitor; diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 8a5b2b3b42..572d181b75 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -66,6 +66,7 @@ #include "utils/ps_status.h" #include "utils/rel.h" #include "utils/snapmgr.h" +#include "utils/syscache.h" #include "utils/timestamp.h" #include "utils/tqual.h" @@ -125,6 +126,7 @@ bool pgstat_track_activities = false; bool pgstat_track_counts = false; int pgstat_track_functions = TRACK_FUNC_OFF; +int pgstat_track_syscache_usage_interval = 0; int pgstat_track_activity_query_size = 1024; /* ---------- @@ -237,6 +239,11 @@ typedef struct TwoPhasePgStatRecord bool t_truncated; /* was the relation truncated? */ } TwoPhasePgStatRecord; +/* bitmap symbols to specify target file types remove */ +#define PGSTAT_REMFILE_DBSTAT 1 /* remove only databsae stats files */ +#define PGSTAT_REMFILE_SYSCACHE 2 /* remove only syscache stats files */ +#define PGSTAT_REMFILE_ALL 3 /* remove both type of files */ + /* * Info about current "snapshot" of stats file */ @@ -631,10 +638,13 @@ startup_failed: } /* - * subroutine for pgstat_reset_all + * remove stats files + * + * clean up stats files in specified directory. target is one of + * PGSTAT_REFILE_DBSTAT/SYSCACHE/ALL and restricts files to remove. */ static void -pgstat_reset_remove_files(const char *directory) +pgstat_reset_remove_files(const char *directory, int target) { DIR *dir; struct dirent *entry; @@ -645,25 +655,39 @@ pgstat_reset_remove_files(const char *directory) { int nchars; Oid tmp_oid; + int filetype = 0; /* * Skip directory entries that don't match the file names we write. * See get_dbstat_filename for the database-specific pattern. */ if (strncmp(entry->d_name, "global.", 7) == 0) + { + filetype = PGSTAT_REMFILE_DBSTAT; nchars = 7; + } else { + char head[2]; + nchars = 0; - (void) sscanf(entry->d_name, "db_%u.%n", - &tmp_oid, &nchars); - if (nchars <= 0) - continue; + (void) sscanf(entry->d_name, "%c%c_%u.%n", + head, head + 1, &tmp_oid, &nchars); + /* %u allows leading whitespace, so reject that */ - if (strchr("0123456789", entry->d_name[3]) == NULL) + if (nchars < 3 || !isdigit(entry->d_name[3])) continue; + + if (strncmp(head, "db", 2) == 0) + filetype = PGSTAT_REMFILE_DBSTAT; + else if (strncmp(head, "cc", 2) == 0) + filetype = PGSTAT_REMFILE_SYSCACHE; } + /* skip if this is not a target */ + if ((filetype & target) == 0) + continue; + if (strcmp(entry->d_name + nchars, "tmp") != 0 && strcmp(entry->d_name + nchars, "stat") != 0) continue; @@ -684,8 +708,9 @@ pgstat_reset_remove_files(const char *directory) void pgstat_reset_all(void) { - pgstat_reset_remove_files(pgstat_stat_directory); - pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY); + pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_ALL); + pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY, + PGSTAT_REMFILE_ALL); } #ifdef EXEC_BACKEND @@ -4290,6 +4315,9 @@ PgstatCollectorMain(int argc, char *argv[]) pgStatRunningInCollector = true; pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true); + /* Remove left-over syscache stats files */ + pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_SYSCACHE); + /* * Loop to process messages until we get SIGQUIT or detect ungraceful * death of our parent postmaster. @@ -6380,3 +6408,163 @@ pgstat_clip_activity(const char *raw_activity) return activity; } + +/* + * return the filename for a syscache stat file; filename is the output + * buffer, of length len. + */ +void +pgstat_get_syscachestat_filename(bool permanent, bool tempname, int backendid, + char *filename, int len) +{ + int printed; + + /* NB -- pgstat_reset_remove_files knows about the pattern this uses */ + printed = snprintf(filename, len, "%s/cc_%u.%s", + permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY : + pgstat_stat_directory, + backendid, + tempname ? "tmp" : "stat"); + if (printed >= len) + elog(ERROR, "overlength pgstat path"); +} + +/* + * pgstat_write_syscache_stats() - + * Write the syscache statistics files. + * + * If 'force' is false, this function skips writing a file and resturns the + * time remaining in the current interval in milliseconds. If'force' is true, + * writes a file regardless of the remaining time and reset the interval. + */ +long +pgstat_write_syscache_stats(bool force) +{ + static TimestampTz last_report = 0; + TimestampTz now; + long elapsed; + long secs; + int usecs; + int cacheId; + FILE *fpout; + char statfile[MAXPGPATH]; + char tmpfile[MAXPGPATH]; + + /* Return if we don't want it */ + if (!force && pgstat_track_syscache_usage_interval <= 0) + return 0; + + + /* Check aginst the in*/ + now = GetCurrentTransactionStopTimestamp(); + TimestampDifference(last_report, now, &secs, &usecs); + elapsed = secs * 1000 + usecs / 1000; + + if (!force && elapsed < pgstat_track_syscache_usage_interval) + { + /* not yet the time, inform the remaining time to the caller */ + return pgstat_track_syscache_usage_interval - elapsed; + } + + /* now write the file */ + last_report = now; + + pgstat_get_syscachestat_filename(false, true, + MyBackendId, tmpfile, MAXPGPATH); + pgstat_get_syscachestat_filename(false, false, + MyBackendId, statfile, MAXPGPATH); + + /* + * This function can be called from ProcessInterrupts(). Inhibit recursive + * interrupts to avoid recursive entry. + */ + HOLD_INTERRUPTS(); + + fpout = AllocateFile(tmpfile, PG_BINARY_W); + if (fpout == NULL) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not open temporary statistics file \"%s\": %m", + tmpfile))); + /* + * Failure writing this file is not critical. Just skip this time and + * tell caller to wait for the next interval. + */ + RESUME_INTERRUPTS(); + return pgstat_track_syscache_usage_interval; + } + + /* write out every catcache stats */ + for (cacheId = 0 ; cacheId < SysCacheSize ; cacheId++) + { + SysCacheStats *stats; + + stats = SysCacheGetStats(cacheId); + Assert (stats); + + /* write error is checked later using ferror() */ + fputc('T', fpout); + (void)fwrite(&cacheId, sizeof(int), 1, fpout); + (void)fwrite(&last_report, sizeof(TimestampTz), 1, fpout); + (void)fwrite(stats, sizeof(*stats), 1, fpout); + } + fputc('E', fpout); + + if (ferror(fpout)) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not write syscache statistics file \"%s\": %m", + tmpfile))); + FreeFile(fpout); + unlink(tmpfile); + } + else if (FreeFile(fpout) < 0) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not close syscache statistics file \"%s\": %m", + tmpfile))); + unlink(tmpfile); + } + else if (rename(tmpfile, statfile) < 0) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not rename syscache statistics file \"%s\" to \"%s\": %m", + tmpfile, statfile))); + unlink(tmpfile); + } + + RESUME_INTERRUPTS(); + return 0; +} + +/* + * GUC assignment callback for track_syscache_usage_interval. + * + * Make a statistics file immedately when syscache statistics is turned + * on. Remove it as soon as turned off as well. + */ +void +pgstat_track_syscache_assign_hook(int newval, void *extra) +{ + if (newval > 0) + { + /* + * Immediately create a stats file. It's safe since we're not midst + * accessing syscache. + */ + pgstat_write_syscache_stats(true); + } + else + { + /* Turned off, immediately remove the statsfile */ + char fname[MAXPGPATH]; + + pgstat_get_syscachestat_filename(false, false, MyBackendId, + fname, MAXPGPATH); + unlink(fname); /* don't care of the result */ + } +} diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index e4c6e3d406..c68b857c0e 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -3121,6 +3121,12 @@ ProcessInterrupts(void) } + if (IdleSyscacheStatsUpdateTimeoutPending) + { + IdleSyscacheStatsUpdateTimeoutPending = false; + pgstat_write_syscache_stats(true); + } + if (ParallelMessagePending) HandleParallelMessages(); } @@ -3697,6 +3703,7 @@ PostgresMain(int argc, char *argv[], sigjmp_buf local_sigjmp_buf; volatile bool send_ready_for_query = true; bool disable_idle_in_transaction_timeout = false; + bool disable_idle_catcache_update_timeout = false; /* Initialize startup process environment if necessary. */ if (!IsUnderPostmaster) @@ -4137,9 +4144,19 @@ PostgresMain(int argc, char *argv[], } else { + long timeout; + ProcessCompletedNotifies(); pgstat_report_stat(false); + timeout = pgstat_write_syscache_stats(false); + + if (timeout > 0) + { + disable_idle_catcache_update_timeout = true; + enable_timeout_after(IDLE_CATCACHE_UPDATE_TIMEOUT, + timeout); + } set_ps_display("idle", false); pgstat_report_activity(STATE_IDLE, NULL); } @@ -4182,6 +4199,12 @@ PostgresMain(int argc, char *argv[], disable_idle_in_transaction_timeout = false; } + if (disable_idle_catcache_update_timeout) + { + disable_timeout(IDLE_CATCACHE_UPDATE_TIMEOUT, false); + disable_idle_catcache_update_timeout = false; + } + /* * (6) check for any other interesting events that happened while we * slept. diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index e95e347184..27df8cf825 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -14,6 +14,8 @@ */ #include "postgres.h" +#include <sys/stat.h> + #include "access/htup_details.h" #include "catalog/pg_authid.h" #include "catalog/pg_type.h" @@ -28,6 +30,7 @@ #include "utils/acl.h" #include "utils/builtins.h" #include "utils/inet.h" +#include "utils/syscache.h" #include "utils/timestamp.h" #define UINT32_ACCESS_ONCE(var) ((uint32)(*((volatile uint32 *)&(var)))) @@ -1882,3 +1885,136 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS) PG_RETURN_DATUM(HeapTupleGetDatum( heap_form_tuple(tupdesc, values, nulls))); } + +Datum +pgstat_get_syscache_stats(PG_FUNCTION_ARGS) +{ +#define PG_GET_SYSCACHE_SIZE 10 + int pid = PG_GETARG_INT32(0); + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; + TupleDesc tupdesc; + Tuplestorestate *tupstore; + MemoryContext per_query_ctx; + MemoryContext oldcontext; + PgBackendStatus *beentry; + int beid; + char fname[MAXPGPATH]; + FILE *fpin; + char c; + + if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("set-valued function called in context that cannot accept a set"))); + if (!(rsinfo->allowedModes & SFRM_Materialize)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("materialize mode required, but it is not " \ + "allowed in this context"))); + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + + per_query_ctx = rsinfo->econtext->ecxt_per_query_memory; + + oldcontext = MemoryContextSwitchTo(per_query_ctx); + tupstore = tuplestore_begin_heap(true, false, work_mem); + rsinfo->returnMode = SFRM_Materialize; + rsinfo->setResult = tupstore; + rsinfo->setDesc = tupdesc; + + MemoryContextSwitchTo(oldcontext); + + /* find beentry for given pid*/ + beentry = NULL; + for (beid = 1; + (beentry = pgstat_fetch_stat_beentry(beid)) && + beentry->st_procpid != pid ; + beid++); + + /* + * we silently return empty result on failure or insufficient privileges + */ + if (!beentry || + (!has_privs_of_role(GetUserId(), beentry->st_userid) && + !is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS))) + goto no_data; + + pgstat_get_syscachestat_filename(false, false, beid, fname, MAXPGPATH); + + if ((fpin = AllocateFile(fname, PG_BINARY_R)) == NULL) + { + if (errno != ENOENT) + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not open statistics file \"%s\": %m", + fname))); + /* also return empty on no statistics file */ + goto no_data; + } + + /* read the statistics file into tuplestore */ + while ((c = fgetc(fpin)) == 'T') + { + TimestampTz last_update; + SysCacheStats stats; + int cacheid; + Datum values[PG_GET_SYSCACHE_SIZE]; + bool nulls[PG_GET_SYSCACHE_SIZE]; + Datum datums[SYSCACHE_STATS_NAGECLASSES]; + ArrayType *arr; + int i, j; + + fread(&cacheid, sizeof(int), 1, fpin); + fread(&last_update, sizeof(TimestampTz), 1, fpin); + if (fread(&stats, 1, sizeof(stats), fpin) != sizeof(stats)) + { + ereport(WARNING, + (errmsg("corrupted syscache statistics file \"%s\"", + fname))); + goto no_data; + } + + memset(nulls, 0, sizeof(nulls)); + + i = 0; + values[i++] = ObjectIdGetDatum(stats.reloid); + values[i++] = ObjectIdGetDatum(stats.indoid); + values[i++] = Int64GetDatum(stats.size); + values[i++] = Int64GetDatum(stats.ntuples); + values[i++] = Int64GetDatum(stats.nsearches); + values[i++] = Int64GetDatum(stats.nhits); + values[i++] = Int64GetDatum(stats.nneg_hits); + + for (j = 0 ; j < SYSCACHE_STATS_NAGECLASSES ; j++) + datums[j] = Int32GetDatum((int32) stats.ageclasses[j]); + + arr = construct_array(datums, SYSCACHE_STATS_NAGECLASSES, + INT4OID, sizeof(int32), true, 'i'); + values[i++] = PointerGetDatum(arr); + + for (j = 0 ; j < SYSCACHE_STATS_NAGECLASSES ; j++) + datums[j] = Int32GetDatum((int32) stats.nclass_entries[j]); + arr = construct_array(datums, SYSCACHE_STATS_NAGECLASSES, + INT4OID, sizeof(int32), true, 'i'); + values[i++] = PointerGetDatum(arr); + + values[i++] = TimestampTzGetDatum(last_update); + + Assert (i == PG_GET_SYSCACHE_SIZE); + + tuplestore_putvalues(tupstore, tupdesc, values, nulls); + } + + /* check for the end of file. abandon the result if file is broken */ + if (c != 'E' || fgetc(fpin) != EOF) + tuplestore_clear(tupstore); + + FreeFile(fpin); + +no_data: + tuplestore_donestoring(tupstore); + return (Datum) 0; +} diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 9be463311d..31e19541a6 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -89,6 +89,10 @@ static CatCacheHeader *CacheHdr = NULL; /* Timestamp used for any operation on caches. */ TimestampTz catcacheclock = 0; +/* age classes for pruning */ +static double ageclass[SYSCACHE_STATS_NAGECLASSES] + = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -627,9 +631,7 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue) else CatCacheRemoveCTup(cache, ct); CACHE1_elog(DEBUG2, "CatCacheInvalidate: invalidated"); -#ifdef CATCACHE_STATS cache->cc_invals++; -#endif /* could be multiple matches, so keep looking! */ } } @@ -705,9 +707,7 @@ ResetCatalogCache(CatCache *cache) } else CatCacheRemoveCTup(cache, ct); -#ifdef CATCACHE_STATS cache->cc_invals++; -#endif } } } @@ -914,10 +914,11 @@ CatCacheCleanupOldEntries(CatCache *cp) * cache_prune_min_age. The index of nremoved_entry is the value of the * clock-sweep counter, which takes from 0 up to 2. */ - double ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; - int nentries[] = {0, 0, 0, 0, 0, 0}; + int nentries[SYSCACHE_STATS_NAGECLASSES] = {0, 0, 0, 0, 0, 0}; int nremoved_entry[3] = {0, 0, 0}; int j; + + Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0); #endif /* Return immediately if no pruning is wanted */ @@ -931,7 +932,11 @@ CatCacheCleanupOldEntries(CatCache *cp) if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L) return false; - /* Search the whole hash for entries to remove */ + /* + * Search the whole hash for entries to remove. This is a quite time + * consuming task during catcache lookup, but accetable since now we are + * going to expand the hash table. + */ for (i = 0; i < cp->cc_nbuckets; i++) { dlist_mutable_iter iter; @@ -944,21 +949,21 @@ CatCacheCleanupOldEntries(CatCache *cp) /* - * Calculate the duration from the time of the last access to the - * "current" time. Since catcacheclock is not advanced within a - * transaction, the entries that are accessed within the current - * transaction won't be pruned. + * Calculate the duration from the time from the last access to + * the "current" time. Since catcacheclock is not advanced within + * a transaction, the entries that are accessed within the current + * transaction always get 0 as the result. */ TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); #ifdef CATCACHE_STATS /* count catcache entries for each age class */ ntotal++; - for (j = 0 ; - ageclass[j] != 0.0 && - entry_age > cache_prune_min_age * ageclass[j] ; - j++); - if (ageclass[j] == 0.0) j--; + + j = 0; + while (j < SYSCACHE_STATS_NAGECLASSES - 1 && + entry_age > cache_prune_min_age * ageclass[j]) + j++; nentries[j]++; #endif @@ -991,14 +996,17 @@ CatCacheCleanupOldEntries(CatCache *cp) } #ifdef CATCACHE_STATS + StaticAssertStmt(SYSCACHE_STATS_NAGECLASSES == 6, + "number of syscache age class must be 6"); ereport(DEBUG1, - (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d, 2:%d)", + (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d, rest:%d) naccessed(0:%d,1:%d, 2:%d)", nremoved, ntotal, ageclass[0] * cache_prune_min_age, nentries[0], ageclass[1] * cache_prune_min_age, nentries[1], ageclass[2] * cache_prune_min_age, nentries[2], ageclass[3] * cache_prune_min_age, nentries[3], ageclass[4] * cache_prune_min_age, nentries[4], + nentries[5], nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]), errhidestmt(true))); #endif @@ -1375,9 +1383,7 @@ SearchCatCacheInternal(CatCache *cache, if (unlikely(cache->cc_tupdesc == NULL)) CatalogCacheInitializeCache(cache); -#ifdef CATCACHE_STATS cache->cc_searches++; -#endif /* Initialize local parameter array */ arguments[0] = v1; @@ -1437,9 +1443,7 @@ SearchCatCacheInternal(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_hits++; -#endif return &ct->tuple; } @@ -1448,9 +1452,7 @@ SearchCatCacheInternal(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_neg_hits++; -#endif return NULL; } @@ -1578,9 +1580,7 @@ SearchCatCacheMiss(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): put in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_newloads++; -#endif return &ct->tuple; } @@ -1691,9 +1691,7 @@ SearchCatCacheList(CatCache *cache, Assert(nkeys > 0 && nkeys < cache->cc_nkeys); -#ifdef CATCACHE_STATS cache->cc_lsearches++; -#endif /* Initialize local parameter array */ arguments[0] = v1; @@ -1750,9 +1748,7 @@ SearchCatCacheList(CatCache *cache, CACHE2_elog(DEBUG2, "SearchCatCacheList(%s): found list", cache->cc_relname); -#ifdef CATCACHE_STATS cache->cc_lhits++; -#endif return cl; } @@ -2270,3 +2266,64 @@ PrintCatCacheListLeakWarning(CatCList *list) list->my_cache->cc_relname, list->my_cache->id, list, list->refcount); } + +/* + * CatCacheGetStats - fill in SysCacheStats struct. + * + * This is a support routine for SysCacheGetStats, substantially fills in the + * result. The classification here is based on the same criteria to + * CatCacheCleanupOldEntries(). + */ +void +CatCacheGetStats(CatCache *cache, SysCacheStats *stats) +{ + int i, j; + + Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0); + + /* fill in the stats struct */ + stats->size = cache->cc_tupsize + cache->cc_nbuckets * sizeof(dlist_head); + stats->ntuples = cache->cc_ntup; + stats->nsearches = cache->cc_searches; + stats->nhits = cache->cc_hits; + stats->nneg_hits = cache->cc_neg_hits; + + /* cache_prune_min_age can be changed on-session, fill it every time */ + for (i = 0 ; i < SYSCACHE_STATS_NAGECLASSES ; i++) + stats->ageclasses[i] = (int) (cache_prune_min_age * ageclass[i]); + + /* + * nth element in nclass_entries stores the number of cache entries that + * have lived unaccessed for corresponding multiple in ageclass of + * cache_prune_min_age. + */ + memset(stats->nclass_entries, 0, sizeof(int) * SYSCACHE_STATS_NAGECLASSES); + + /* Scan the whole hash */ + for (i = 0; i < cache->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cache->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + /* + * Calculate the duration from the time from the last access to + * the "current" time. Since catcacheclock is not advanced within + * a transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + + j = 0; + while (j < SYSCACHE_STATS_NAGECLASSES - 1 && + entry_age > stats->ageclasses[j]) + j++; + + stats->nclass_entries[j]++; + } + } +} diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c index 2b381782a3..9800bfda34 100644 --- a/src/backend/utils/cache/syscache.c +++ b/src/backend/utils/cache/syscache.c @@ -20,6 +20,9 @@ */ #include "postgres.h" +#include <sys/stat.h> +#include <unistd.h> + #include "access/htup_details.h" #include "access/sysattr.h" #include "catalog/indexing.h" @@ -1529,6 +1532,27 @@ RelationSupportsSysCache(Oid relid) return false; } +/* + * SysCacheGetStats - returns stats of specified syscache + * + * This routine returns the address of its local static memory. + */ +SysCacheStats * +SysCacheGetStats(int cacheId) +{ + static SysCacheStats stats; + + Assert(cacheId >=0 && cacheId < SysCacheSize); + + memset(&stats, 0, sizeof(stats)); + + stats.reloid = cacheinfo[cacheId].reloid; + stats.indoid = cacheinfo[cacheId].indoid; + + CatCacheGetStats(SysCache[cacheId], &stats); + + return &stats; +} /* * OID comparator for pg_qsort diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index 5971310aab..234ae3e157 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false; volatile sig_atomic_t ProcDiePending = false; volatile sig_atomic_t ClientConnectionLost = false; volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false; +volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending = false; volatile sig_atomic_t ConfigReloadPending = false; volatile uint32 InterruptHoldoffCount = 0; volatile uint32 QueryCancelHoldoffCount = 0; diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index 4f1d2a0d28..000f402a03 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg); static void StatementTimeoutHandler(void); static void LockTimeoutHandler(void); static void IdleInTransactionSessionTimeoutHandler(void); +static void IdleSyscacheStatsUpdateTimeoutHandler(void); static bool ThereIsAtLeastOneRole(void); static void process_startup_options(Port *port, bool am_superuser); static void process_settings(Oid databaseid, Oid roleid); @@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler); RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT, IdleInTransactionSessionTimeoutHandler); + RegisterTimeout(IDLE_CATCACHE_UPDATE_TIMEOUT, + IdleSyscacheStatsUpdateTimeoutHandler); } /* @@ -1239,6 +1242,14 @@ IdleInTransactionSessionTimeoutHandler(void) SetLatch(MyLatch); } +static void +IdleSyscacheStatsUpdateTimeoutHandler(void) +{ + IdleSyscacheStatsUpdateTimeoutPending = true; + InterruptPending = true; + SetLatch(MyLatch); +} + /* * Returns true if at least one role is defined in this database cluster. */ diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 1a49d576fa..c4a1616136 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -3077,6 +3077,16 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"track_syscache_usage_interval", PGC_SUSET, STATS_COLLECTOR, + gettext_noop("Sets the interval between syscache usage collection, in milliseconds. Zero disables syscache usagetracking."), + NULL + }, + &pgstat_track_syscache_usage_interval, + 0, 0, INT_MAX / 2, + NULL, NULL, NULL + }, + { {"gin_pending_list_limit", PGC_USERSET, CLIENT_CONN_STATEMENT, gettext_noop("Sets the maximum size of the pending list for GIN index."), diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index c59dd898ac..9b3ccc5e5b 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -514,6 +514,7 @@ #track_io_timing = off #track_functions = none # none, pl, all #track_activity_query_size = 1024 # (change requires restart) +#track_syscache_usage_interval = 0 # zero disables tracking #stats_temp_directory = 'pg_stat_tmp' diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index cff58ed2d8..86c84c7cf4 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -9603,6 +9603,15 @@ proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}', proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}', prosrc => 'pg_get_replication_slots' }, +{ oid => '3423', + descr => 'syscache statistics', + proname => 'pg_get_syscache_stats', prorows => '100', proisstrict => 'f', + proretset => 't', provolatile => 'v', prorettype => 'record', + proargtypes => 'int4', + proallargtypes => '{int4,oid,oid,int8,int8,int8,int8,int8,_int4,_int4,timestamptz}', + proargmodes => '{i,o,o,o,o,o,o,o,o,o,o}', + proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,nentries,last_update}', + prosrc => 'pgstat_get_syscache_stats' }, { oid => '3786', descr => 'set up a logical replication slot', proname => 'pg_create_logical_replication_slot', provolatile => 'v', proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool', diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index 69f356f8cd..c056d9a39f 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -81,6 +81,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending; extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending; extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending; extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending; +extern PGDLLIMPORT volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending; extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending; extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost; diff --git a/src/include/pgstat.h b/src/include/pgstat.h index d59c24ae23..b64bc499e4 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -1133,6 +1133,7 @@ extern bool pgstat_track_activities; extern bool pgstat_track_counts; extern int pgstat_track_functions; extern PGDLLIMPORT int pgstat_track_activity_query_size; +extern int pgstat_track_syscache_usage_interval; extern char *pgstat_stat_directory; extern char *pgstat_stat_tmpname; extern char *pgstat_stat_filename; @@ -1217,7 +1218,8 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id); extern void pgstat_initstats(Relation rel); extern char *pgstat_clip_activity(const char *raw_activity); - +extern void pgstat_get_syscachestat_filename(bool permanent, + bool tempname, int backendid, char *filename, int len); /* ---------- * pgstat_report_wait_start() - * @@ -1352,5 +1354,6 @@ extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid); extern int pgstat_fetch_stat_numbackends(void); extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void); extern PgStat_GlobalStats *pgstat_fetch_global(void); - +extern long pgstat_write_syscache_stats(bool force); +extern void pgstat_track_syscache_assign_hook(int newval, void *extra); #endif /* PGSTAT_H */ diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index ace4178619..721948b4cc 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -65,10 +65,8 @@ typedef struct catcache int cc_tupsize; /* total amount of catcache tuples */ /* - * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS - * doesn't break ABI for other modules + * Statistics entries */ -#ifdef CATCACHE_STATS long cc_searches; /* total # searches against this cache */ long cc_hits; /* # of matches against existing entry */ long cc_neg_hits; /* # of matches against negative entry */ @@ -81,7 +79,6 @@ typedef struct catcache long cc_invals; /* # of entries invalidated from cache */ long cc_lsearches; /* total # list-searches */ long cc_lhits; /* # of matches against existing lists */ -#endif } CatCache; @@ -254,4 +251,8 @@ extern void PrepareToInvalidateCacheTuple(Relation relation, extern void PrintCatCacheLeakWarning(HeapTuple tuple); extern void PrintCatCacheListLeakWarning(CatCList *list); +/* defined in syscache.h */ +typedef struct syscachestats SysCacheStats; +extern void CatCacheGetStats(CatCache *cache, SysCacheStats *syscachestats); + #endif /* CATCACHE_H */ diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h index 4f333586ee..0cd7cc4394 100644 --- a/src/include/utils/syscache.h +++ b/src/include/utils/syscache.h @@ -112,6 +112,24 @@ enum SysCacheIdentifier #define SysCacheSize (USERMAPPINGUSERSERVER + 1) }; +#define SYSCACHE_STATS_NAGECLASSES 6 +/* Struct for catcache tracking information */ +typedef struct syscachestats +{ + Oid reloid; /* target relation */ + Oid indoid; /* index */ + size_t size; /* size of the catcache */ + int ntuples; /* number of tuples resides in the catcache */ + int nsearches; /* number of searches */ + int nhits; /* number of cache hits */ + int nneg_hits; /* number of negative cache hits */ + /* age classes in seconds */ + int ageclasses[SYSCACHE_STATS_NAGECLASSES]; + /* number of tuples fall into the corresponding age class */ + int nclass_entries[SYSCACHE_STATS_NAGECLASSES]; +} SysCacheStats; + + extern void InitCatalogCache(void); extern void InitCatalogCachePhase2(void); @@ -164,6 +182,7 @@ extern void SysCacheInvalidate(int cacheId, uint32 hashValue); extern bool RelationInvalidatesSnapshotsOnly(Oid relid); extern bool RelationHasSysCache(Oid relid); extern bool RelationSupportsSysCache(Oid relid); +extern SysCacheStats *SysCacheGetStats(int cacheId); /* * The use of the macros below rather than direct calls to the corresponding diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h index dcc7307c16..e2a9c33f14 100644 --- a/src/include/utils/timeout.h +++ b/src/include/utils/timeout.h @@ -31,6 +31,7 @@ typedef enum TimeoutId STANDBY_TIMEOUT, STANDBY_LOCK_TIMEOUT, IDLE_IN_TRANSACTION_SESSION_TIMEOUT, + IDLE_CATCACHE_UPDATE_TIMEOUT, /* First user-definable timeout reason */ USER_TIMEOUT, /* Maximum number of timeout reasons */ -- 2.16.3 From 2a7a6744c61a61a8dac2fb54f948b96d58141778 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 16 Oct 2018 21:31:22 +0900 Subject: [PATCH 3/3] Remote GUC setting feature and non-xact GUC config. This adds two features at once. (will be split later). One is non-transactional GUC setting feature. This allows setting GUC variable set by the action GUC_ACTION_NONXACT(the name requires condieration) survive beyond rollback. It is required by remote guc setting to work sanely. Without the feature a remote-set value within a trasction will disappear involved in rollback. The only local interface for the NONXACT action is set_config(name, value, is_local=false, is_nonxact = true). The second is remote guc setting feature. It uses ProcSignal to notify the target server. --- doc/src/sgml/config.sgml | 4 + doc/src/sgml/func.sgml | 30 ++ src/backend/catalog/system_views.sql | 7 +- src/backend/postmaster/pgstat.c | 3 + src/backend/storage/ipc/ipci.c | 2 + src/backend/storage/ipc/procsignal.c | 4 + src/backend/tcop/postgres.c | 10 + src/backend/utils/misc/README | 26 +- src/backend/utils/misc/guc.c | 619 +++++++++++++++++++++++++++++++++-- src/include/catalog/pg_proc.dat | 10 +- src/include/pgstat.h | 3 +- src/include/storage/procsignal.h | 3 + src/include/utils/guc.h | 13 +- src/include/utils/guc_tables.h | 5 +- src/test/regress/expected/guc.out | 223 +++++++++++++ src/test/regress/expected/rules.out | 26 +- src/test/regress/sql/guc.sql | 88 +++++ 17 files changed, 1027 insertions(+), 49 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 976a505205..34f7a08bae 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -281,6 +281,10 @@ UPDATE pg_settings SET setting = reset_val WHERE name = 'configuration_parameter </listitem> </itemizedlist> + <para> + Also values on other sessions can be set using the SQL + function <function>pg_set_backend_setting</function>. + </para> </sect2> <sect2> diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 5193df3366..b97f4e5daa 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -18657,6 +18657,20 @@ SELECT collation for ('foo' COLLATE "de_DE"); <entry><type>text</type></entry> <entry>set parameter and return new value</entry> </row> + <row> + <entry> + <indexterm> + <primary>pg_set_backend_setting</primary> + </indexterm> + <literal><function>pg_set_backend_config( + <parameter>process_id</parameter>, + <parameter>setting_name</parameter>, + <parameter>new_value</parameter>) + </function></literal> + </entry> + <entry><type>bool</type></entry> + <entry>set parameter on another session</entry> + </row> </tbody> </tgroup> </table> @@ -18711,6 +18725,22 @@ SELECT set_config('log_statement_stats', 'off', false); ------------ off (1 row) +</programlisting> + </para> + + <para> + <function>pg_set_backend_config</function> sets the parameter + <parameter>setting_name</parameter> to + <parameter>new_value</parameter> on the other session with PID + <parameter>process_id</parameter>. The setting is always session-local and + returns true if succeeded. An example: +<programlisting> +SELECT pg_set_backend_config(2134, 'work_mem', '16MB'); + +pg_set_backend_config +------------ + t +(1 row) </programlisting> </para> diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 6cd19c8ecb..6403a461e7 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -474,7 +474,7 @@ CREATE VIEW pg_settings AS CREATE RULE pg_settings_u AS ON UPDATE TO pg_settings WHERE new.name = old.name DO - SELECT set_config(old.name, new.setting, 'f'); + SELECT set_config(old.name, new.setting, 'f', 'f'); CREATE RULE pg_settings_n AS ON UPDATE TO pg_settings @@ -1044,6 +1044,11 @@ CREATE OR REPLACE FUNCTION pg_stop_backup ( RETURNS SETOF record STRICT VOLATILE LANGUAGE internal as 'pg_stop_backup_v2' PARALLEL RESTRICTED; +CREATE OR REPLACE FUNCTION set_config ( + setting_name text, new_value text, is_local boolean, is_nonxact boolean DEFAULT false) + RETURNS text STRICT VOLATILE LANGUAGE internal AS 'set_config_by_name' + PARALLEL UNSAFE; + -- legacy definition for compatibility with 9.3 CREATE OR REPLACE FUNCTION json_populate_record(base anyelement, from_json json, use_json_as_text boolean DEFAULT false) diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 572d181b75..80c60eefca 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -3708,6 +3708,9 @@ pgstat_get_wait_ipc(WaitEventIPC w) case WAIT_EVENT_SYNC_REP: event_name = "SyncRep"; break; + case WAIT_EVENT_REMOTE_GUC: + event_name = "RemoteGUC"; + break; /* no default case, so that compiler will warn */ } diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c index 0c86a581c0..03d526d12d 100644 --- a/src/backend/storage/ipc/ipci.c +++ b/src/backend/storage/ipc/ipci.c @@ -150,6 +150,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port) size = add_size(size, SyncScanShmemSize()); size = add_size(size, AsyncShmemSize()); size = add_size(size, BackendRandomShmemSize()); + size = add_size(size, GucShmemSize()); #ifdef EXEC_BACKEND size = add_size(size, ShmemBackendArraySize()); #endif @@ -270,6 +271,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port) SyncScanShmemInit(); AsyncShmemInit(); BackendRandomShmemInit(); + GucShmemInit(); #ifdef EXEC_BACKEND diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c index b0dd7d1b37..b897c36bae 100644 --- a/src/backend/storage/ipc/procsignal.c +++ b/src/backend/storage/ipc/procsignal.c @@ -27,6 +27,7 @@ #include "storage/shmem.h" #include "storage/sinval.h" #include "tcop/tcopprot.h" +#include "utils/guc.h" /* @@ -292,6 +293,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS) if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN)) RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN); + if (CheckProcSignal(PROCSIG_REMOTE_GUC)) + HandleRemoteGucSetInterrupt(); + SetLatch(MyLatch); latch_sigusr1_handler(); diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index c68b857c0e..feee7bdbb1 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -3129,6 +3129,10 @@ ProcessInterrupts(void) if (ParallelMessagePending) HandleParallelMessages(); + + /* We don't want chage GUC variables while running a query */ + if (RemoteGucChangePending && DoingCommandRead) + HandleGucRemoteChanges(); } @@ -4165,6 +4169,12 @@ PostgresMain(int argc, char *argv[], send_ready_for_query = false; } + /* + * (2.5) Process some pending works. + */ + if (RemoteGucChangePending) + HandleGucRemoteChanges(); + /* * (2) Allow asynchronous signals to be executed immediately if they * come in while we are waiting for client input. (This must be diff --git a/src/backend/utils/misc/README b/src/backend/utils/misc/README index 6e294386f7..42ae6c1a8f 100644 --- a/src/backend/utils/misc/README +++ b/src/backend/utils/misc/README @@ -169,10 +169,14 @@ Entry to a function with a SET option: Plain SET command: If no stack entry of current level: - Push new stack entry w/prior value and state SET + Push new stack entry w/prior value and state SET or + push new stack entry w/o value and state NONXACT. else if stack entry's state is SAVE, SET, or LOCAL: change stack state to SET, don't change saved value (here we are forgetting effects of prior set action) + else if stack entry's state is NONXACT: + change stack state to NONXACT_SET, set the current value to + prior. else (entry must have state SET+LOCAL): discard its masked value, change state to SET (here we are forgetting effects of prior SET and SET LOCAL) @@ -185,13 +189,20 @@ SET LOCAL command: else if stack entry's state is SAVE or LOCAL or SET+LOCAL: no change to stack entry (in SAVE case, SET LOCAL will be forgotten at func exit) + else if stack entry's state is NONXACT: + set current value to both prior and masked slots. set state + NONXACT+LOCAL. else (entry must have state SET): put current active into its masked slot, set state SET+LOCAL Now set new value. +Setting by NONXACT action (no command exists): + Always blow away existing stack then create a new NONXACT entry. + Transaction or subtransaction abort: - Pop stack entries, restoring prior value, until top < subxact depth + Pop stack entries, restoring prior value unless the stack entry's + state is NONXACT, until top < subxact depth Transaction or subtransaction commit (incl. successful function exit): @@ -199,9 +210,9 @@ Transaction or subtransaction commit (incl. successful function exit): if entry's state is SAVE: pop, restoring prior value - else if level is 1 and entry's state is SET+LOCAL: + else if level is 1 and entry's state is SET+LOCAL or NONXACT+LOCAL: pop, restoring *masked* value - else if level is 1 and entry's state is SET: + else if level is 1 and entry's state is SET or NONXACT+SET: pop, discarding old value else if level is 1 and entry's state is LOCAL: pop, restoring prior value @@ -210,9 +221,9 @@ Transaction or subtransaction commit (incl. successful function exit): else merge entries of level N-1 and N as specified below -The merged entry will have level N-1 and prior = older prior, so easiest -to keep older entry and free newer. There are 12 possibilities since -we already handled level N state = SAVE: +The merged entry will have level N-1 and prior = older prior, so +easiest to keep older entry and free newer. Disregarding to NONXACT, +here are 12 possibilities since we already handled level N state = SAVE: N-1 N @@ -232,6 +243,7 @@ SET+LOCAL SET discard top prior and second masked, state SET SET+LOCAL LOCAL discard top prior, no change to stack entry SET+LOCAL SET+LOCAL discard top prior, copy masked, state S+L +(TODO: states involving NONXACT) RESET is executed like a SET, but using the reset_val as the desired new value. (We do not provide a RESET LOCAL command, but SET LOCAL TO DEFAULT diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index c4a1616136..2eed732d2a 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -202,6 +202,37 @@ static ConfigVariable *ProcessConfigFileInternal(GucContext context, bool applySettings, int elevel); +/* Enum and struct to command GUC setting to another backend */ +typedef enum +{ + REMGUC_VACANT, + REMGUC_REQUEST, + REMGUC_INPROCESS, + REMGUC_DONE, + REMGUC_CANCELING, + REMGUC_CANCELED, +} remote_guc_status; + +#define GUC_REMOTE_MAX_VALUE_LEN 1024 /* an arbitrary value */ +#define GUC_REMOTE_CANCEL_TIMEOUT 5000 /* in milliseconds */ + +typedef struct +{ + remote_guc_status state; + char name[NAMEDATALEN]; + char value[GUC_REMOTE_MAX_VALUE_LEN]; + int sourcepid; + int targetpid; + Oid userid; + bool success; + volatile Latch *sender_latch; + LWLock lock; +} GucRemoteSetting; + +static GucRemoteSetting *remote_setting; + +volatile bool RemoteGucChangePending = false; + /* * Options for enum values defined in this module. * @@ -3084,7 +3115,7 @@ static struct config_int ConfigureNamesInt[] = }, &pgstat_track_syscache_usage_interval, 0, 0, INT_MAX / 2, - NULL, NULL, NULL + NULL, &pgstat_track_syscache_assign_hook, NULL }, { @@ -4491,7 +4522,6 @@ discard_stack_value(struct config_generic *gconf, config_var_value *val) set_extra_field(gconf, &(val->extra), NULL); } - /* * Fetch the sorted array pointer (exported for help_config.c's use ONLY) */ @@ -5283,6 +5313,22 @@ push_old_value(struct config_generic *gconf, GucAction action) /* Do we already have a stack entry of the current nest level? */ stack = gconf->stack; + + /* NONXACT action make existing stack useles */ + if (action == GUC_ACTION_NONXACT) + { + while (stack) + { + GucStack *prev = stack->prev; + + discard_stack_value(gconf, &stack->prior); + discard_stack_value(gconf, &stack->masked); + pfree(stack); + stack = prev; + } + stack = gconf->stack = NULL; + } + if (stack && stack->nest_level >= GUCNestLevel) { /* Yes, so adjust its state if necessary */ @@ -5290,28 +5336,63 @@ push_old_value(struct config_generic *gconf, GucAction action) switch (action) { case GUC_ACTION_SET: - /* SET overrides any prior action at same nest level */ - if (stack->state == GUC_SET_LOCAL) + if (stack->state == GUC_NONXACT) { - /* must discard old masked value */ - discard_stack_value(gconf, &stack->masked); + /* NONXACT rollbacks to the current value */ + stack->scontext = gconf->scontext; + set_stack_value(gconf, &stack->prior); + stack->state = GUC_NONXACT_SET; } - stack->state = GUC_SET; + else + { + /* SET overrides other prior actions at same nest level */ + if (stack->state == GUC_SET_LOCAL) + { + /* must discard old masked value */ + discard_stack_value(gconf, &stack->masked); + } + stack->state = GUC_SET; + } + break; + case GUC_ACTION_LOCAL: if (stack->state == GUC_SET) { - /* SET followed by SET LOCAL, remember SET's value */ + /* SET followed by SET LOCAL, remember it's value */ stack->masked_scontext = gconf->scontext; set_stack_value(gconf, &stack->masked); stack->state = GUC_SET_LOCAL; } + else if (stack->state == GUC_NONXACT) + { + /* + * NONXACT followed by SET LOCAL, both prior and masked + * are set to the current value + */ + stack->scontext = gconf->scontext; + set_stack_value(gconf, &stack->prior); + stack->masked_scontext = stack->scontext; + stack->masked = stack->prior; + stack->state = GUC_NONXACT_LOCAL; + } + else if (stack->state == GUC_NONXACT_SET) + { + /* NONXACT_SET followed by SET LOCAL, set masked */ + stack->masked_scontext = gconf->scontext; + set_stack_value(gconf, &stack->masked); + stack->state = GUC_NONXACT_LOCAL; + } /* in all other cases, no change to stack entry */ break; case GUC_ACTION_SAVE: /* Could only have a prior SAVE of same variable */ Assert(stack->state == GUC_SAVE); break; + + case GUC_ACTION_NONXACT: + Assert(false); + break; } Assert(guc_dirty); /* must be set already */ return; @@ -5327,6 +5408,7 @@ push_old_value(struct config_generic *gconf, GucAction action) stack->prev = gconf->stack; stack->nest_level = GUCNestLevel; + switch (action) { case GUC_ACTION_SET: @@ -5338,10 +5420,15 @@ push_old_value(struct config_generic *gconf, GucAction action) case GUC_ACTION_SAVE: stack->state = GUC_SAVE; break; + case GUC_ACTION_NONXACT: + stack->state = GUC_NONXACT; + break; } stack->source = gconf->source; stack->scontext = gconf->scontext; - set_stack_value(gconf, &stack->prior); + + if (action != GUC_ACTION_NONXACT) + set_stack_value(gconf, &stack->prior); gconf->stack = stack; @@ -5436,22 +5523,31 @@ AtEOXact_GUC(bool isCommit, int nestLevel) * stack entries to avoid leaking memory. If we do set one of * those flags, unused fields will be cleaned up after restoring. */ - if (!isCommit) /* if abort, always restore prior value */ - restorePrior = true; + if (!isCommit) + { + /* GUC_NONXACT does't rollback */ + if (stack->state != GUC_NONXACT) + restorePrior = true; + } else if (stack->state == GUC_SAVE) restorePrior = true; else if (stack->nest_level == 1) { /* transaction commit */ - if (stack->state == GUC_SET_LOCAL) + if (stack->state == GUC_SET_LOCAL || + stack->state == GUC_NONXACT_LOCAL) restoreMasked = true; - else if (stack->state == GUC_SET) + else if (stack->state == GUC_SET || + stack->state == GUC_NONXACT_SET) { /* we keep the current active value */ discard_stack_value(gconf, &stack->prior); } - else /* must be GUC_LOCAL */ + else if (stack->state != GUC_NONXACT) + { + /* must be GUC_LOCAL */ restorePrior = true; + } } else if (prev == NULL || prev->nest_level < stack->nest_level - 1) @@ -5473,11 +5569,27 @@ AtEOXact_GUC(bool isCommit, int nestLevel) break; case GUC_SET: - /* next level always becomes SET */ - discard_stack_value(gconf, &stack->prior); - if (prev->state == GUC_SET_LOCAL) + if (prev->state == GUC_SET || + prev->state == GUC_NONXACT_SET) + { + discard_stack_value(gconf, &stack->prior); + } + else if (prev->state == GUC_NONXACT) + { + prev->scontext = stack->scontext; + prev->prior = stack->prior; + prev->state = GUC_NONXACT_SET; + } + else if (prev->state == GUC_SET_LOCAL || + prev->state == GUC_NONXACT_LOCAL) + { + discard_stack_value(gconf, &stack->prior); discard_stack_value(gconf, &prev->masked); - prev->state = GUC_SET; + if (prev->state == GUC_SET_LOCAL) + prev->state = GUC_SET; + else + prev->state = GUC_NONXACT_SET; + } break; case GUC_LOCAL: @@ -5488,6 +5600,16 @@ AtEOXact_GUC(bool isCommit, int nestLevel) prev->masked = stack->prior; prev->state = GUC_SET_LOCAL; } + else if (prev->state == GUC_NONXACT) + { + prev->prior = stack->masked; + prev->scontext = stack->masked_scontext; + prev->masked = stack->masked; + prev->masked_scontext = stack->masked_scontext; + discard_stack_value(gconf, &stack->prior); + discard_stack_value(gconf, &stack->masked); + prev->state = GUC_NONXACT_SET; + } else { /* else just forget this stack level */ @@ -5496,15 +5618,32 @@ AtEOXact_GUC(bool isCommit, int nestLevel) break; case GUC_SET_LOCAL: - /* prior state at this level no longer wanted */ - discard_stack_value(gconf, &stack->prior); - /* copy down the masked state */ - prev->masked_scontext = stack->masked_scontext; - if (prev->state == GUC_SET_LOCAL) - discard_stack_value(gconf, &prev->masked); - prev->masked = stack->masked; - prev->state = GUC_SET_LOCAL; + if (prev->state == GUC_NONXACT) + { + prev->prior = stack->prior; + prev->masked = stack->prior; + discard_stack_value(gconf, &stack->prior); + discard_stack_value(gconf, &stack->masked); + prev->state = GUC_NONXACT_SET; + } + else if (prev->state != GUC_NONXACT_SET) + { + /* prior state at this level no longer wanted */ + discard_stack_value(gconf, &stack->prior); + /* copy down the masked state */ + prev->masked_scontext = stack->masked_scontext; + if (prev->state == GUC_SET_LOCAL) + discard_stack_value(gconf, &prev->masked); + prev->masked = stack->masked; + prev->state = GUC_SET_LOCAL; + } break; + case GUC_NONXACT: + case GUC_NONXACT_SET: + case GUC_NONXACT_LOCAL: + Assert(false); + break; + } } @@ -7785,7 +7924,8 @@ set_config_by_name(PG_FUNCTION_ARGS) char *name; char *value; char *new_value; - bool is_local; + int set_action = GUC_ACTION_SET; + if (PG_ARGISNULL(0)) ereport(ERROR, @@ -7805,18 +7945,27 @@ set_config_by_name(PG_FUNCTION_ARGS) * Get the desired state of is_local. Default to false if provided value * is NULL */ - if (PG_ARGISNULL(2)) - is_local = false; - else - is_local = PG_GETARG_BOOL(2); + if (!PG_ARGISNULL(2) && PG_GETARG_BOOL(2)) + set_action = GUC_ACTION_LOCAL; + + /* + * Get the desired state of is_nonxact. Default to false if provided value + * is NULL + */ + if (!PG_ARGISNULL(3) && PG_GETARG_BOOL(3)) + { + if (set_action == GUC_ACTION_LOCAL) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("Only one of is_local and is_nonxact can be true"))); + set_action = GUC_ACTION_NONXACT; + } /* Note SET DEFAULT (argstring == NULL) is equivalent to RESET */ (void) set_config_option(name, value, (superuser() ? PGC_SUSET : PGC_USERSET), - PGC_S_SESSION, - is_local ? GUC_ACTION_LOCAL : GUC_ACTION_SET, - true, 0, false); + PGC_S_SESSION, set_action, true, 0, false); /* get the new current value */ new_value = GetConfigOptionByName(name, NULL, false); @@ -7825,7 +7974,6 @@ set_config_by_name(PG_FUNCTION_ARGS) PG_RETURN_TEXT_P(cstring_to_text(new_value)); } - /* * Common code for DefineCustomXXXVariable subroutines: allocate the * new variable's config struct and fill in generic fields. @@ -8024,6 +8172,13 @@ reapply_stacked_values(struct config_generic *variable, WARNING, false); break; + case GUC_NONXACT: + (void) set_config_option(name, curvalue, + curscontext, cursource, + GUC_ACTION_NONXACT, true, + WARNING, false); + break; + case GUC_LOCAL: (void) set_config_option(name, curvalue, curscontext, cursource, @@ -8043,6 +8198,33 @@ reapply_stacked_values(struct config_generic *variable, GUC_ACTION_LOCAL, true, WARNING, false); break; + + case GUC_NONXACT_SET: + /* first, apply the masked value as SET */ + (void) set_config_option(name, stack->masked.val.stringval, + stack->masked_scontext, PGC_S_SESSION, + GUC_ACTION_NONXACT, true, + WARNING, false); + /* then apply the current value as LOCAL */ + (void) set_config_option(name, curvalue, + curscontext, cursource, + GUC_ACTION_SET, true, + WARNING, false); + break; + + case GUC_NONXACT_LOCAL: + /* first, apply the masked value as SET */ + (void) set_config_option(name, stack->masked.val.stringval, + stack->masked_scontext, PGC_S_SESSION, + GUC_ACTION_NONXACT, true, + WARNING, false); + /* then apply the current value as LOCAL */ + (void) set_config_option(name, curvalue, + curscontext, cursource, + GUC_ACTION_LOCAL, true, + WARNING, false); + break; + } /* If we successfully made a stack entry, adjust its nest level */ @@ -10021,6 +10203,373 @@ GUCArrayReset(ArrayType *array) return newarray; } +Size +GucShmemSize(void) +{ + Size size; + + size = sizeof(GucRemoteSetting); + + return size; +} + +void +GucShmemInit(void) +{ + Size size; + bool found; + + size = sizeof(GucRemoteSetting); + remote_setting = (GucRemoteSetting *) + ShmemInitStruct("GUC remote setting", size, &found); + + if (!found) + { + MemSet(remote_setting, 0, size); + LWLockInitialize(&remote_setting->lock, LWLockNewTrancheId()); + } + + LWLockRegisterTranche(remote_setting->lock.tranche, "guc_remote"); +} + +/* + * set_backend_config: SQL callable function to set GUC variable of remote + * session. + */ +Datum +set_backend_config(PG_FUNCTION_ARGS) +{ + int pid = PG_GETARG_INT32(0); + char *name = text_to_cstring(PG_GETARG_TEXT_P(1)); + char *value = text_to_cstring(PG_GETARG_TEXT_P(2)); + TimestampTz cancel_start; + PgBackendStatus *beentry; + int beid; + int rc; + + if (strlen(name) >= NAMEDATALEN) + ereport(ERROR, + (errcode(ERRCODE_NAME_TOO_LONG), + errmsg("name of GUC variable is too long"))); + if (strlen(value) >= GUC_REMOTE_MAX_VALUE_LEN) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("value is too long"), + errdetail("Maximum acceptable length of value is %d", + GUC_REMOTE_MAX_VALUE_LEN - 1))); + + /* find beentry for given pid */ + beentry = NULL; + for (beid = 1; + (beentry = pgstat_fetch_stat_beentry(beid)) && + beentry->st_procpid != pid ; + beid++); + + /* + * This will be checked out by SendProcSignal but do here to emit + * appropriate message message. + */ + if (!beentry) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("process PID %d not found", pid))); + + /* allow only client backends */ + if (beentry->st_backendType != B_BACKEND) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("not a client backend"))); + + /* + * Wait if someone is sending a request. We need to wait with timeout + * since the current user of the struct doesn't wake me up. + */ + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + while (remote_setting->state != REMGUC_VACANT) + { + LWLockRelease(&remote_setting->lock); + rc = WaitLatch(&MyProc->procLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, + 200, PG_WAIT_ACTIVITY); + + if (rc & WL_POSTMASTER_DEATH) + return (Datum) BoolGetDatum(false); + + CHECK_FOR_INTERRUPTS(); + + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + } + + /* my turn, send a request */ + Assert(remote_setting->state == REMGUC_VACANT); + + remote_setting->state = REMGUC_REQUEST; + remote_setting->sourcepid = MyProcPid; + remote_setting->targetpid = pid; + remote_setting->userid = GetUserId(); + + strncpy(remote_setting->name, name, NAMEDATALEN); + remote_setting->name[NAMEDATALEN - 1] = 0; + strncpy(remote_setting->value, value, GUC_REMOTE_MAX_VALUE_LEN); + remote_setting->value[GUC_REMOTE_MAX_VALUE_LEN - 1] = 0; + remote_setting->sender_latch = MyLatch; + + LWLockRelease(&remote_setting->lock); + + if (SendProcSignal(pid, PROCSIG_REMOTE_GUC, InvalidBackendId) < 0) + { + remote_setting->state = REMGUC_VACANT; + ereport(ERROR, + (errmsg("could not signal backend with PID %d: %m", pid))); + } + + /* + * This request is processed only while idle time of peer so it may take a + * long time before we get a response. + */ + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + while (remote_setting->state != REMGUC_DONE) + { + LWLockRelease(&remote_setting->lock); + rc = WaitLatch(&MyProc->procLatch, + WL_LATCH_SET | WL_POSTMASTER_DEATH, + -1, PG_WAIT_ACTIVITY); + + /* don't care of the state in the case.. */ + if (rc & WL_POSTMASTER_DEATH) + return (Datum) BoolGetDatum(false); + + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + + /* get out if we got a query cancel request */ + if (QueryCancelPending) + break; + } + + /* + * Cancel the requset if possible. We cannot cancel the request in the + * case peer have processed it. We don't see QueryCancelPending but the + * request status so that the case is handled properly. + */ + if (remote_setting->state == REMGUC_REQUEST) + { + Assert(QueryCancelPending); + + remote_setting->state = REMGUC_CANCELING; + LWLockRelease(&remote_setting->lock); + + if (SendProcSignal(pid, + PROCSIG_REMOTE_GUC, InvalidBackendId) < 0) + { + remote_setting->state = REMGUC_VACANT; + ereport(ERROR, + (errmsg("could not signal backend with PID %d: %m", + pid))); + } + + /* Peer must respond shortly, don't sleep for a long time. */ + + cancel_start = GetCurrentTimestamp(); + + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + while (remote_setting->state != REMGUC_CANCELED && + !TimestampDifferenceExceeds(cancel_start, GetCurrentTimestamp(), + GUC_REMOTE_CANCEL_TIMEOUT)) + { + LWLockRelease(&remote_setting->lock); + rc = WaitLatch(&MyProc->procLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, + GUC_REMOTE_CANCEL_TIMEOUT, PG_WAIT_ACTIVITY); + + /* don't care of the state in the case.. */ + if (rc & WL_POSTMASTER_DEATH) + return (Datum) BoolGetDatum(false); + + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + } + + if (remote_setting->state != REMGUC_CANCELED) + { + remote_setting->state = REMGUC_VACANT; + ereport(ERROR, (errmsg("failed cancelling remote GUC request"))); + } + + remote_setting->state = REMGUC_VACANT; + LWLockRelease(&remote_setting->lock); + + ereport(INFO, + (errmsg("remote GUC change request to PID %d is canceled", + pid))); + + return (Datum) BoolGetDatum(false); + } + + Assert (remote_setting->state == REMGUC_DONE); + + /* ereport exits on query cancel, we need this before that */ + remote_setting->state = REMGUC_VACANT; + + if (QueryCancelPending) + ereport(INFO, + (errmsg("remote GUC change request to PID %d already completed", + pid))); + + if (!remote_setting->success) + ereport(ERROR, + (errmsg("%s", remote_setting->value))); + + LWLockRelease(&remote_setting->lock); + + return (Datum) BoolGetDatum(true); +} + + +void +HandleRemoteGucSetInterrupt(void) +{ + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + + /* check if any request is being sent to me */ + if (remote_setting->targetpid == MyProcPid) + { + switch (remote_setting->state) + { + case REMGUC_REQUEST: + InterruptPending = true; + RemoteGucChangePending = true; + break; + case REMGUC_CANCELING: + InterruptPending = true; + RemoteGucChangePending = true; + remote_setting->state = REMGUC_CANCELED; + SetLatch(remote_setting->sender_latch); + break; + default: + break; + } + } + LWLockRelease(&remote_setting->lock); +} + +void +HandleGucRemoteChanges(void) +{ + MemoryContext currentcxt = CurrentMemoryContext; + bool canceling = false; + bool process_request = true; + int saveInterruptHoldoffCount = 0; + int saveQueryCancelHoldoffCount = 0; + + RemoteGucChangePending = false; + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + + /* skip if this request is no longer for me */ + if (remote_setting->targetpid != MyProcPid) + process_request = false; + else + { + switch (remote_setting->state) + { + case REMGUC_REQUEST: + remote_setting->state = REMGUC_INPROCESS; + break; + case REMGUC_CANCELING: + /* + * This request is already canceled but entered this function + * before receiving signal. Cancel the request here. + */ + remote_setting->state = REMGUC_CANCELED; + remote_setting->success = false; + canceling = true; + break; + case REMGUC_VACANT: + case REMGUC_CANCELED: + case REMGUC_INPROCESS: + case REMGUC_DONE: + /* Just ignore the cases */ + process_request = false; + break; + } + } + + LWLockRelease(&remote_setting->lock); + + if (!process_request) + return; + + if (canceling) + { + SetLatch(remote_setting->sender_latch); + return; + } + + + /* Okay, actually modify variable */ + remote_setting->success = true; + + PG_TRY(); + { + bool has_privilege; + bool is_superuser; + bool end_transaction = false; + /* + * XXXX: ERROR resets the following varialbes but we don't want that. + */ + saveInterruptHoldoffCount = InterruptHoldoffCount; + saveQueryCancelHoldoffCount = QueryCancelHoldoffCount; + + /* superuser_arg requires a transaction */ + if (!IsTransactionState()) + { + StartTransactionCommand(); + end_transaction = true; + } + is_superuser = superuser_arg(remote_setting->userid); + has_privilege = is_superuser || + has_privs_of_role(remote_setting->userid, GetUserId()); + + if (end_transaction) + CommitTransactionCommand(); + + if (!has_privilege) + elog(ERROR, "role %u is not allowed to set GUC variables on the session with PID %d", + remote_setting->userid, MyProcPid); + + (void) set_config_option(remote_setting->name, remote_setting->value, + is_superuser ? PGC_SUSET : PGC_USERSET, + PGC_S_SESSION, GUC_ACTION_NONXACT, + true, ERROR, false); + } + PG_CATCH(); + { + ErrorData *errdata; + MemoryContextSwitchTo(currentcxt); + errdata = CopyErrorData(); + remote_setting->success = false; + strncpy(remote_setting->value, errdata->message, + GUC_REMOTE_MAX_VALUE_LEN); + remote_setting->value[GUC_REMOTE_MAX_VALUE_LEN - 1] = 0; + FlushErrorState(); + + /* restore the saved value */ + InterruptHoldoffCount = saveInterruptHoldoffCount ; + QueryCancelHoldoffCount = saveQueryCancelHoldoffCount; + + } + PG_END_TRY(); + + ereport(LOG, + (errmsg("GUC variable \"%s\" is changed to \"%s\" by request from another backend with PID %d", + remote_setting->name, remote_setting->value, + remote_setting->sourcepid))); + + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + remote_setting->state = REMGUC_DONE; + LWLockRelease(&remote_setting->lock); + + SetLatch(remote_setting->sender_latch); +} + /* * Validate a proposed option setting for GUCArrayAdd/Delete/Reset. * diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 86c84c7cf4..cf1c37aa9e 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -5638,8 +5638,8 @@ proargtypes => 'text bool', prosrc => 'show_config_by_name_missing_ok' }, { oid => '2078', descr => 'SET X as a function', proname => 'set_config', proisstrict => 'f', provolatile => 'v', - proparallel => 'u', prorettype => 'text', proargtypes => 'text text bool', - prosrc => 'set_config_by_name' }, + proparallel => 'u', prorettype => 'text', + proargtypes => 'text text bool bool', prosrc => 'set_config_by_name' }, { oid => '2084', descr => 'SHOW ALL as a function', proname => 'pg_show_all_settings', prorows => '1000', proretset => 't', provolatile => 's', prorettype => 'record', proargtypes => '', @@ -9612,6 +9612,12 @@ proargmodes => '{i,o,o,o,o,o,o,o,o,o,o}', proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,nentries,last_update}', prosrc => 'pgstat_get_syscache_stats' }, +{ oid => '3424', + descr => 'set config of another backend', + proname => 'pg_set_backend_config', proisstrict => 'f', + proretset => 'f', provolatile => 'v', proparallel => 'u', + prorettype => 'bool', proargtypes => 'int4 text text', + prosrc => 'set_backend_config' }, { oid => '3786', descr => 'set up a logical replication slot', proname => 'pg_create_logical_replication_slot', provolatile => 'v', proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool', diff --git a/src/include/pgstat.h b/src/include/pgstat.h index b64bc499e4..4e341c93ed 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -832,7 +832,8 @@ typedef enum WAIT_EVENT_REPLICATION_ORIGIN_DROP, WAIT_EVENT_REPLICATION_SLOT_DROP, WAIT_EVENT_SAFE_SNAPSHOT, - WAIT_EVENT_SYNC_REP + WAIT_EVENT_SYNC_REP, + WAIT_EVENT_REMOTE_GUC } WaitEventIPC; /* ---------- diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h index 6db0d69b71..4ad4927d3d 100644 --- a/src/include/storage/procsignal.h +++ b/src/include/storage/procsignal.h @@ -42,6 +42,9 @@ typedef enum PROCSIG_RECOVERY_CONFLICT_BUFFERPIN, PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK, + /* Remote GUC setting */ + PROCSIG_REMOTE_GUC, + NUM_PROCSIGNALS /* Must be last! */ } ProcSignalReason; diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h index f462eabe59..1766e64165 100644 --- a/src/include/utils/guc.h +++ b/src/include/utils/guc.h @@ -193,7 +193,8 @@ typedef enum /* Types of set_config_option actions */ GUC_ACTION_SET, /* regular SET command */ GUC_ACTION_LOCAL, /* SET LOCAL command */ - GUC_ACTION_SAVE /* function SET option, or temp assignment */ + GUC_ACTION_SAVE, /* function SET option, or temp assignment */ + GUC_ACTION_NONXACT /* transactional setting */ } GucAction; #define GUC_QUALIFIER_SEPARATOR '.' @@ -269,6 +270,8 @@ extern int tcp_keepalives_idle; extern int tcp_keepalives_interval; extern int tcp_keepalives_count; +extern volatile bool RemoteGucChangePending; + #ifdef TRACE_SORT extern bool trace_sort; #endif @@ -276,6 +279,11 @@ extern bool trace_sort; /* * Functions exported by guc.c */ +extern Size GucShmemSize(void); +extern void GucShmemInit(void); +extern Datum set_backend_setting(PG_FUNCTION_ARGS); +extern void HandleRemoteGucSetInterrupt(void); +extern void HandleGucRemoteChanges(void); extern void SetConfigOption(const char *name, const char *value, GucContext context, GucSource source); @@ -395,6 +403,9 @@ extern Size EstimateGUCStateSpace(void); extern void SerializeGUCState(Size maxsize, char *start_address); extern void RestoreGUCState(void *gucstate); +/* Remote GUC setting */ +extern void HandleGucRemoteChanges(void); + /* Support for messages reported from GUC check hooks */ extern PGDLLIMPORT char *GUC_check_errmsg_string; diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h index 668d9efd35..7a2396d2f5 100644 --- a/src/include/utils/guc_tables.h +++ b/src/include/utils/guc_tables.h @@ -113,7 +113,10 @@ typedef enum GUC_SAVE, /* entry caused by function SET option */ GUC_SET, /* entry caused by plain SET command */ GUC_LOCAL, /* entry caused by SET LOCAL command */ - GUC_SET_LOCAL /* entry caused by SET then SET LOCAL */ + GUC_NONXACT, /* entry caused by non-transactional ops */ + GUC_SET_LOCAL, /* entry caused by SET then SET LOCAL */ + GUC_NONXACT_SET, /* entry caused by NONXACT then SET */ + GUC_NONXACT_LOCAL /* entry caused by NONXACT then (SET)LOCAL */ } GucStackState; typedef struct guc_stack diff --git a/src/test/regress/expected/guc.out b/src/test/regress/expected/guc.out index 43ac5f5f11..2c074705c7 100644 --- a/src/test/regress/expected/guc.out +++ b/src/test/regress/expected/guc.out @@ -476,6 +476,229 @@ SELECT '2006-08-13 12:34:56'::timestamptz; 2006-08-13 12:34:56-07 (1 row) +-- NONXACT followed by SET, SET LOCAL through COMMIT +BEGIN; +SELECT set_config('work_mem', '128kB', false, true); -- NONXACT + set_config +------------ + 128kB +(1 row) + +SET work_mem to '256kB'; +SET LOCAL work_mem to '512kB'; +SHOW work_mem; -- must see 512kB + work_mem +---------- + 512kB +(1 row) + +COMMIT; +SHOW work_mem; -- must see 256kB + work_mem +---------- + 256kB +(1 row) + +-- NONXACT followed by SET, SET LOCAL through ROLLBACK +BEGIN; +SELECT set_config('work_mem', '128kB', false, true); -- NONXACT + set_config +------------ + 128kB +(1 row) + +SET work_mem to '256kB'; +SET LOCAL work_mem to '512kB'; +SHOW work_mem; -- must see 512kB + work_mem +---------- + 512kB +(1 row) + +ROLLBACK; +SHOW work_mem; -- must see 128kB + work_mem +---------- + 128kB +(1 row) + +-- SET, SET LOCAL followed by NONXACT through COMMIT +BEGIN; +SET work_mem to '256kB'; +SET LOCAL work_mem to '512kB'; +SELECT set_config('work_mem', '128kB', false, true); -- NONXACT + set_config +------------ + 128kB +(1 row) + +SHOW work_mem; -- must see 128kB + work_mem +---------- + 128kB +(1 row) + +COMMIT; +SHOW work_mem; -- must see 128kB + work_mem +---------- + 128kB +(1 row) + +-- SET, SET LOCAL followed by NONXACT through ROLLBACK +BEGIN; +SET work_mem to '256kB'; +SET LOCAL work_mem to '512kB'; +SELECT set_config('work_mem', '128kB', false, true); -- NONXACT + set_config +------------ + 128kB +(1 row) + +SHOW work_mem; -- must see 128kB + work_mem +---------- + 128kB +(1 row) + +ROLLBACK; +SHOW work_mem; -- must see 128kB + work_mem +---------- + 128kB +(1 row) + +-- NONXACT and SAVEPOINT +SET work_mem TO '64kB'; +BEGIN; +SET work_mem TO '128kB'; +SAVEPOINT a; +SELECT set_config('work_mem', '256kB', false, true); -- NONXACT + set_config +------------ + 256kB +(1 row) + +SHOW work_mem; + work_mem +---------- + 256kB +(1 row) + +SET LOCAL work_mem TO '384kB'; +RELEASE SAVEPOINT a; +SHOW work_mem; -- will see 384kB + work_mem +---------- + 384kB +(1 row) + +COMMIT; +SHOW work_mem; -- will see 256kB + work_mem +---------- + 256kB +(1 row) + +-- +SET work_mem TO '64kB'; +BEGIN; +SET work_mem TO '128kB'; +SAVEPOINT a; +SELECT set_config('work_mem', '256kB', false, true); -- NONXACT + set_config +------------ + 256kB +(1 row) + +SHOW work_mem; + work_mem +---------- + 256kB +(1 row) + +SET LOCAL work_mem TO '384kB'; +ROLLBACK TO SAVEPOINT a; +SHOW work_mem; -- will see 256kB + work_mem +---------- + 256kB +(1 row) + +ROLLBACK; +SHOW work_mem; -- will see 256kB + work_mem +---------- + 256kB +(1 row) + +-- +SET work_mem TO '64kB'; +BEGIN; +SET work_mem TO '128kB'; +SET LOCAL work_mem TO '384kB'; +SAVEPOINT a; +SELECT set_config('work_mem', '256kB', false, true); -- NONXACT + set_config +------------ + 256kB +(1 row) + +SHOW work_mem; + work_mem +---------- + 256kB +(1 row) + +SET LOCAL work_mem TO '384kB'; +RELEASE SAVEPOINT a; +SHOW work_mem; -- will see 384kB + work_mem +---------- + 384kB +(1 row) + +ROLLBACK; +SHOW work_mem; -- will see 256kB + work_mem +---------- + 256kB +(1 row) + +-- +SET work_mem TO '64kB'; +BEGIN; +SET work_mem TO '128kB'; +SET LOCAL work_mem TO '384kB'; +SAVEPOINT a; +SELECT set_config('work_mem', '256kB', false, true); -- NONXACT + set_config +------------ + 256kB +(1 row) + +SHOW work_mem; + work_mem +---------- + 256kB +(1 row) + +SET LOCAL work_mem TO '384kB'; +ROLLBACK TO SAVEPOINT a; +SHOW work_mem; -- will see 256kB + work_mem +---------- + 256kB +(1 row) + +COMMIT; +SHOW work_mem; -- will see 256kB + work_mem +---------- + 256kB +(1 row) + +SET work_mem TO DEFAULT; -- -- Test RESET. We use datestyle because the reset value is forced by -- pg_regress, so it doesn't depend on the installation's configuration. diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out index 735dd37acf..3569edc22d 100644 --- a/src/test/regress/expected/rules.out +++ b/src/test/regress/expected/rules.out @@ -1918,6 +1918,30 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid, pg_stat_all_tables.autoanalyze_count FROM pg_stat_all_tables WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname~ '^pg_toast'::text)); +pg_stat_syscache| SELECT s.pid, + (s.relid)::regclass AS relname, + (s.indid)::regclass AS cache_name, + s.size, + s.ntup AS ntuples, + s.searches, + s.hits, + s.neg_hits, + s.ageclass, + s.nentries, + s.last_update + FROM (pg_stat_activity a + JOIN LATERAL ( SELECT a.pid, + pg_get_syscache_stats.relid, + pg_get_syscache_stats.indid, + pg_get_syscache_stats.size, + pg_get_syscache_stats.ntup, + pg_get_syscache_stats.searches, + pg_get_syscache_stats.hits, + pg_get_syscache_stats.neg_hits, + pg_get_syscache_stats.ageclass, + pg_get_syscache_stats.nentries, + pg_get_syscache_stats.last_update + FROM pg_get_syscache_stats(a.pid) pg_get_syscache_stats(relid, indid, size, ntup, searches, hits, neg_hits, ageclass,nentries, last_update)) s ON ((a.pid = s.pid))); pg_stat_user_functions| SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, @@ -2349,7 +2373,7 @@ pg_settings|pg_settings_n|CREATE RULE pg_settings_n AS ON UPDATE TO pg_catalog.pg_settings DO INSTEAD NOTHING; pg_settings|pg_settings_u|CREATE RULE pg_settings_u AS ON UPDATE TO pg_catalog.pg_settings - WHERE (new.name = old.name) DO SELECT set_config(old.name, new.setting, false) AS set_config; + WHERE (new.name = old.name) DO SELECT set_config(old.name, new.setting, false, false) AS set_config; rtest_emp|rtest_emp_del|CREATE RULE rtest_emp_del AS ON DELETE TO public.rtest_emp DO INSERT INTO rtest_emplog (ename, who, action, newsal, oldsal) VALUES (old.ename, CURRENT_USER, 'fired'::bpchar, '$0.00'::money, old.salary); diff --git a/src/test/regress/sql/guc.sql b/src/test/regress/sql/guc.sql index 23e5029780..2fb23caafe 100644 --- a/src/test/regress/sql/guc.sql +++ b/src/test/regress/sql/guc.sql @@ -133,6 +133,94 @@ SHOW vacuum_cost_delay; SHOW datestyle; SELECT '2006-08-13 12:34:56'::timestamptz; +-- NONXACT followed by SET, SET LOCAL through COMMIT +BEGIN; +SELECT set_config('work_mem', '128kB', false, true); -- NONXACT +SET work_mem to '256kB'; +SET LOCAL work_mem to '512kB'; +SHOW work_mem; -- must see 512kB +COMMIT; +SHOW work_mem; -- must see 256kB + +-- NONXACT followed by SET, SET LOCAL through ROLLBACK +BEGIN; +SELECT set_config('work_mem', '128kB', false, true); -- NONXACT +SET work_mem to '256kB'; +SET LOCAL work_mem to '512kB'; +SHOW work_mem; -- must see 512kB +ROLLBACK; +SHOW work_mem; -- must see 128kB + +-- SET, SET LOCAL followed by NONXACT through COMMIT +BEGIN; +SET work_mem to '256kB'; +SET LOCAL work_mem to '512kB'; +SELECT set_config('work_mem', '128kB', false, true); -- NONXACT +SHOW work_mem; -- must see 128kB +COMMIT; +SHOW work_mem; -- must see 128kB + +-- SET, SET LOCAL followed by NONXACT through ROLLBACK +BEGIN; +SET work_mem to '256kB'; +SET LOCAL work_mem to '512kB'; +SELECT set_config('work_mem', '128kB', false, true); -- NONXACT +SHOW work_mem; -- must see 128kB +ROLLBACK; +SHOW work_mem; -- must see 128kB + +-- NONXACT and SAVEPOINT +SET work_mem TO '64kB'; +BEGIN; +SET work_mem TO '128kB'; +SAVEPOINT a; +SELECT set_config('work_mem', '256kB', false, true); -- NONXACT +SHOW work_mem; +SET LOCAL work_mem TO '384kB'; +RELEASE SAVEPOINT a; +SHOW work_mem; -- will see 384kB +COMMIT; +SHOW work_mem; -- will see 256kB +-- +SET work_mem TO '64kB'; +BEGIN; +SET work_mem TO '128kB'; +SAVEPOINT a; +SELECT set_config('work_mem', '256kB', false, true); -- NONXACT +SHOW work_mem; +SET LOCAL work_mem TO '384kB'; +ROLLBACK TO SAVEPOINT a; +SHOW work_mem; -- will see 256kB +ROLLBACK; +SHOW work_mem; -- will see 256kB +-- +SET work_mem TO '64kB'; +BEGIN; +SET work_mem TO '128kB'; +SET LOCAL work_mem TO '384kB'; +SAVEPOINT a; +SELECT set_config('work_mem', '256kB', false, true); -- NONXACT +SHOW work_mem; +SET LOCAL work_mem TO '384kB'; +RELEASE SAVEPOINT a; +SHOW work_mem; -- will see 384kB +ROLLBACK; +SHOW work_mem; -- will see 256kB +-- +SET work_mem TO '64kB'; +BEGIN; +SET work_mem TO '128kB'; +SET LOCAL work_mem TO '384kB'; +SAVEPOINT a; +SELECT set_config('work_mem', '256kB', false, true); -- NONXACT +SHOW work_mem; +SET LOCAL work_mem TO '384kB'; +ROLLBACK TO SAVEPOINT a; +SHOW work_mem; -- will see 256kB +COMMIT; +SHOW work_mem; -- will see 256kB + +SET work_mem TO DEFAULT; -- -- Test RESET. We use datestyle because the reset value is forced by -- pg_regress, so it doesn't depend on the installation's configuration. -- 2.16.3
Hello, thank you for updating the patch. >From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] >At Thu, 4 Oct 2018 04:27:04 +0000, "Ideriha, Takeshi" ><ideriha.takeshi@jp.fujitsu.com> wrote in ><4E72940DA2BF16479384A86D54D0988A6F1BCB6F@G01JPEXMBKW04> >> >As a *PoC*, in the attached patch (which applies to current master), >> >size of CTups are counted as the catcache size. >> > >> >It also provides pg_catcache_size system view just to give a rough >> >idea of how such view looks. I'll consider more on that but do you have any opinion >on this? >> > >... >> Great! I like this view. >> One of the extreme idea would be adding all the members printed by >> CatCachePrintStats(), which is only enabled with -DCATCACHE_STATS at this >moment. >> All of the members seems too much for customers who tries to change >> the cache limit size But it may be some of the members are useful >> because for example cc_hits would indicate that current cache limit size is too small. > >The attached introduces four features below. (But the features on relcache and >plancache are omitted). I haven't looked into the code but I'm going to do it later. Right now It seems to me that focusing on catalog cache invalidation and its stats a quick route to commit this feature. >1. syscache stats collector (in 0002) > >Records syscache status consists of the same columns above and "ageclass" >information. We could somehow triggering a stats report with signal but we don't want >take/send/write the statistics in signal handler. Instead, it is turned on by setting >track_syscache_usage_interval to a positive number in milliseconds. I agreed. Agecalss is important to tweak the prune_min_age. Collecting stats is heavy at every stats change >2. pg_stat_syscache view. (in 0002) > >This view shows catcache statistics. Statistics is taken only on the backends where >syscache tracking is active. > >> pid | application_name | relname | cache_name >| size | ageclass | nentries >> >------+------------------+----------------+-----------------------------------+---------- >+-------------------------+--------------------------- >> 9984 | psql | pg_statistic | pg_statistic_relid_att_inh_index | >12676096 | {30,60,600,1200,1800,0} | {17660,17310,55870,0,0,0} > >Age class is the basis of catcache truncation mechanism and shows the distribution >based on elapsed time since last access. As I didn't came up an appropriate way, it is >represented as two arrays. Ageclass stores maximum age for each class in seconds. >Nentries holds entry numbers correnponding to the same element in ageclass. In the >above example, > > age class : # of entries in the cache > up to 30s : 17660 > up to 60s : 17310 > up to 600s : 55870 > up to 1200s : 0 > up to 1800s : 0 > more longer : 0 > > The ageclass is {0, 0.05, 0.1, 1, 2, 3}th multiples of cache_prune_min_age on the >backend. I just thought that the pair of ageclass and nentries can be represented as json or multi-dimensional array but in virtual they are all same and can be converted each other using some functions. So I'm not sure which representaion is better one. >3. non-transactional GUC setting (in 0003) > >It allows setting GUC variable set by the action GUC_ACTION_NONXACT(the name >requires condieration) survive beyond rollback. It is required by remote guc setting to >work sanely. Without the feature a remote-set value within a trasction will disappear >involved in rollback. The only local interface for the NONXACT action is >set_config(name, value, is_local=false, is_nonxact = true). pg_set_backend_guc() >below works on this feature. TBH, I'm not familiar with around this and I may be missing something. In order to change the other backend's GUC value, is ignoring transactional behevior always necessary? When transaction of GUC setting is failed and rollbacked, if the error message is supposeed to be reported I thought just trying the transaction again is enough. >4. pg_set_backend_guc() function. > >Of course syscache statistics recording consumes significant amount of time so it >cannot be turned on usually. On the other hand since this feature is turned on by GUC, >it is needed to grab the active client connection to turn on/off the feature(but we >cannot). Instead, I provided a means to change GUC variables in another backend. > >pg_set_backend_guc(pid, name, value) sets the GUC variable "name" >on the backend "pid" to "value". > > > >With the above tools, we can inspect catcache statistics of seemingly bloated process. > >A. Find a bloated process pid using ps or something. > >B. Turn on syscache stats on the process. > =# select pg_set_backend_guc(9984, 'track_syscache_usage_interval', '10000'); > >C. Examine the statitics. > >=# select pid, relname, cache_name, size from pg_stat_syscache order by size desc >limit 3; > pid | relname | cache_name | size >------+--------------+----------------------------------+---------- > 9984 | pg_statistic | pg_statistic_relid_att_inh_index | 32154112 > 9984 | pg_cast | pg_cast_source_target_index | 4096 > 9984 | pg_operator | pg_operator_oprname_l_r_n_index | 4096 > > >=# select * from pg_stat_syscache where cache_name = >'pg_statistic_relid_att_inh_index'::regclass; >-[ RECORD 1 ]--------------------------------- >pid | 9984 >relname | pg_statistic >cache_name | pg_statistic_relid_att_inh_index >size | 11026176 >ntuples | 77950 >searches | 77950 >hits | 0 >neg_hits | 0 >ageclass | {30,60,600,1200,1800,0} >nentries | {17630,16950,43370,0,0,0} >last_update | 2018-10-17 15:58:19.738164+09 The output of this view seems good to me. I can imagine this use case. Does the use case of setting GUC locally never happen? I mean can the setting be locally changed? Regards, Takeshi Ideriha
>From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com] >I haven't looked into the code but I'm going to do it later. Hi, I've taken a look at 0001 patch. Reviewing the rest of patch will be later. if (!IsParallelWorker()) + { stmtStartTimestamp = GetCurrentTimestamp(); + + /* Set this timestamp as aproximated current time */ + SetCatCacheClock(stmtStartTimestamp); + } else Just confirmation. At first I thought that when parallel worker is active catcacheclock is not updated. But when parallel worker is active catcacheclock is updated by the parent and no problem is occurred. + int tupsize = 0; /* negative entries have no tuple associated */ if (ntp) { int i; + int tupsize; + ct->size = tupsize; @@ -1906,17 +2051,24 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; + ct->size = tupsize; tupsize is declared twice inside and outiside of if scope but it doesn't seem you need to do so. And ct->size = tupsize is executed twice at if block and outside of if-else block. +static inline TimestampTz +GetCatCacheClock(void) This function is not called by anyone in this version of patch. In previous version, this one is called by plancache. Will further patch focus only on catcache? In this case this one can be removed. There are some typos. + int size; /* palloc'ed size off this tuple */ typo: off->of + /* Set this timestamp as aproximated current time */ typo: aproximated->approximated + * GUC variable to define the minimum size of hash to cosider entry eviction. typo: cosider -> consider + /* initilize catcache reference clock if haven't done yet */ typo:initilize -> initialize Regards, Takeshi Ideriha
Thank you for reviewing. At Thu, 15 Nov 2018 11:02:10 +0000, "Ideriha, Takeshi" <ideriha.takeshi@jp.fujitsu.com> wrote in <4E72940DA2BF16479384A86D54D0988A6F1F4165@G01JPEXMBKW04> > Hello, thank you for updating the patch. > > > >From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] > >At Thu, 4 Oct 2018 04:27:04 +0000, "Ideriha, Takeshi" > ><ideriha.takeshi@jp.fujitsu.com> wrote in > ><4E72940DA2BF16479384A86D54D0988A6F1BCB6F@G01JPEXMBKW04> > >> >As a *PoC*, in the attached patch (which applies to current master), > >> >size of CTups are counted as the catcache size. > >> > > >> >It also provides pg_catcache_size system view just to give a rough > >> >idea of how such view looks. I'll consider more on that but do you have any opinion > >on this? > >> > > >... > >> Great! I like this view. > >> One of the extreme idea would be adding all the members printed by > >> CatCachePrintStats(), which is only enabled with -DCATCACHE_STATS at this > >moment. > >> All of the members seems too much for customers who tries to change > >> the cache limit size But it may be some of the members are useful > >> because for example cc_hits would indicate that current cache limit size is too small. > > > >The attached introduces four features below. (But the features on relcache and > >plancache are omitted). > I haven't looked into the code but I'm going to do it later. > > Right now It seems to me that focusing on catalog cache invalidation and its stats a quick route > to commit this feature. > > >1. syscache stats collector (in 0002) > > > >Records syscache status consists of the same columns above and "ageclass" > >information. We could somehow triggering a stats report with signal but we don't want > >take/send/write the statistics in signal handler. Instead, it is turned on by setting > >track_syscache_usage_interval to a positive number in milliseconds. > > I agreed. Agecalss is important to tweak the prune_min_age. > Collecting stats is heavy at every stats change > > >2. pg_stat_syscache view. (in 0002) > > > >This view shows catcache statistics. Statistics is taken only on the backends where > >syscache tracking is active. > > > >> pid | application_name | relname | cache_name > >| size | ageclass | nentries > >> > >------+------------------+----------------+-----------------------------------+---------- > >+-------------------------+--------------------------- > >> 9984 | psql | pg_statistic | pg_statistic_relid_att_inh_index | > >12676096 | {30,60,600,1200,1800,0} | {17660,17310,55870,0,0,0} > > > >Age class is the basis of catcache truncation mechanism and shows the distribution > >based on elapsed time since last access. As I didn't came up an appropriate way, it is > >represented as two arrays. Ageclass stores maximum age for each class in seconds. > >Nentries holds entry numbers correnponding to the same element in ageclass. In the > >above example, > > > > age class : # of entries in the cache > > up to 30s : 17660 > > up to 60s : 17310 > > up to 600s : 55870 > > up to 1200s : 0 > > up to 1800s : 0 > > more longer : 0 > > > > The ageclass is {0, 0.05, 0.1, 1, 2, 3}th multiples of cache_prune_min_age on the > >backend. > > I just thought that the pair of ageclass and nentries can be represented as > json or multi-dimensional array but in virtual they are all same and can be converted each other > using some functions. So I'm not sure which representaion is better one. Multi dimentional array in any style sounds reasonable. Maybe array is preferable in system views as it is a basic type than JSON. In the attached, it looks like the follows: =# select * from pg_stat_syscache where ntuples > 100; -[ RECORD 1 ]-------------------------------------------------- pid | 1817 relname | pg_class cache_name | pg_class_oid_index size | 2048 ntuples | 189 searches | 1620 hits | 1431 neg_hits | 0 ageclass | {{30,189},{60,0},{600,0},{1200,0},{1800,0},{0,0}} last_update | 2018-11-27 19:22:00.74026+09 > >3. non-transactional GUC setting (in 0003) > > > >It allows setting GUC variable set by the action GUC_ACTION_NONXACT(the name > >requires condieration) survive beyond rollback. It is required by remote guc setting to > >work sanely. Without the feature a remote-set value within a trasction will disappear > >involved in rollback. The only local interface for the NONXACT action is > >set_config(name, value, is_local=false, is_nonxact = true). pg_set_backend_guc() > >below works on this feature. > > TBH, I'm not familiar with around this and I may be missing something. > In order to change the other backend's GUC value, > is ignoring transactional behevior always necessary? When transaction of GUC setting > is failed and rollbacked, if the error message is supposeed to be reported I thought > just trying the transaction again is enough. The target backend can be running frequent transaction. The invoker backend cannot know whether the remote change happend during a transaction and whether the transaction if any is committed or aborted, no error message sent to invoker backend. We could wait for the end of a trasaction but that doesn't work with long transactions. Maybe we don't need the feature in GUC system but adding another similar feature doesn't seem reasonable. This would be useful for some other tracking features. > >4. pg_set_backend_guc() function. > > > >Of course syscache statistics recording consumes significant amount of time so it > >cannot be turned on usually. On the other hand since this feature is turned on by GUC, > >it is needed to grab the active client connection to turn on/off the feature(but we > >cannot). Instead, I provided a means to change GUC variables in another backend. > > > >pg_set_backend_guc(pid, name, value) sets the GUC variable "name" > >on the backend "pid" to "value". > > > > > > > >With the above tools, we can inspect catcache statistics of seemingly bloated process. > > > >A. Find a bloated process pid using ps or something. > > > >B. Turn on syscache stats on the process. > > =# select pg_set_backend_guc(9984, 'track_syscache_usage_interval', '10000'); > > > >C. Examine the statitics. > > > >=# select pid, relname, cache_name, size from pg_stat_syscache order by size desc > >limit 3; > > pid | relname | cache_name | size > >------+--------------+----------------------------------+---------- > > 9984 | pg_statistic | pg_statistic_relid_att_inh_index | 32154112 > > 9984 | pg_cast | pg_cast_source_target_index | 4096 > > 9984 | pg_operator | pg_operator_oprname_l_r_n_index | 4096 > > > > > >=# select * from pg_stat_syscache where cache_name = > >'pg_statistic_relid_att_inh_index'::regclass; > >-[ RECORD 1 ]--------------------------------- > >pid | 9984 > >relname | pg_statistic > >cache_name | pg_statistic_relid_att_inh_index > >size | 11026176 > >ntuples | 77950 > >searches | 77950 > >hits | 0 > >neg_hits | 0 > >ageclass | {30,60,600,1200,1800,0} > >nentries | {17630,16950,43370,0,0,0} > >last_update | 2018-10-17 15:58:19.738164+09 > > The output of this view seems good to me. > > I can imagine this use case. Does the use case of setting GUC locally never happen? > I mean can the setting be locally changed? Syscahe grows through a life of a backend/session. No other client cannot connect to it at the same time. So the variable must be set at the start of a backend using ALTER USER/DATABASE, or the client itself is obliged to deliberitely turn on the feature at a convenient time. I suppose that in most use cases one wants to turn on this feature after he sees another session is eating memory more and more. The attached is the rebased version that has multidimentional ageclass. Thank you for the comments in the next mail but sorry that I'll address them later. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 647334b5cb15926db460560c2e1cedbf33715a73 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 16 Oct 2018 13:04:30 +0900 Subject: [PATCH 1/3] Remove entries that haven't been used for a certain time Catcache entries can be left alone for several reasons. It is not desirable that they eat up memory. With this patch, This adds consideration of removal of entries that haven't been used for a certain time before enlarging the hash array. --- doc/src/sgml/config.sgml | 38 ++++++ src/backend/access/transam/xact.c | 5 + src/backend/utils/cache/catcache.c | 166 ++++++++++++++++++++++++-- src/backend/utils/misc/guc.c | 23 ++++ src/backend/utils/misc/postgresql.conf.sample | 2 + src/include/utils/catcache.h | 28 ++++- 6 files changed, 254 insertions(+), 8 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index db1a2d4e74..4f4654120e 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1662,6 +1662,44 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target"> + <term><varname>syscache_memory_target</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_memory_target</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the maximum amount of memory to which syscache is expanded + without pruning. The value defaults to 0, indicating that pruning is + always considered. After exceeding this size, syscache pruning is + considered according to + <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep + certain amount of syscache entries with intermittent usage, try + increase this setting. + </para> + </listitem> + </varlistentry> + + <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age"> + <term><varname>syscache_prune_min_age</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the minimum amount of unused time in seconds at which a + syscache entry is considered to be removed. -1 indicates that syscache + pruning is disabled at all. The value defaults to 600 seconds + (<literal>10 minutes</literal>). The syscache entries that are not + used for the duration can be removed to prevent syscache bloat. This + behavior is suppressed until the size of syscache exceeds + <xref linkend="guc-syscache-memory-target"/>. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth"> <term><varname>max_stack_depth</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index d967400384..71ae0daf17 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -733,7 +733,12 @@ void SetCurrentStatementStartTimestamp(void) { if (!IsParallelWorker()) + { stmtStartTimestamp = GetCurrentTimestamp(); + + /* Set this timestamp as aproximated current time */ + SetCatCacheClock(stmtStartTimestamp); + } else Assert(stmtStartTimestamp != 0); } diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index b31fd5acea..09d5a9a520 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -71,9 +71,24 @@ #define CACHE6_elog(a,b,c,d,e,f,g) #endif +/* + * GUC variable to define the minimum size of hash to cosider entry eviction. + * This variable is shared among various cache mechanisms. + */ +int cache_memory_target = 0; + +/* GUC variable to define the minimum age of entries that will be cosidered to + * be evicted in seconds. This variable is shared among various cache + * mechanisms. + */ +int cache_prune_min_age = 600; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Timestamp used for any operation on caches. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -490,6 +505,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys, cache->cc_keyno, ct->keys); + cache->cc_tupsize -= ct->size; pfree(ct); --cache->cc_ntup; @@ -841,6 +857,7 @@ InitCatCache(int id, cp->cc_nkeys = nkeys; for (i = 0; i < nkeys; ++i) cp->cc_keyno[i] = key[i]; + cp->cc_tupsize = 0; /* * new cache is initialized as far as we can go for now. print some @@ -858,9 +875,129 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initilize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries can be left alone for several reasons. We remove them if + * they are not accessed for a certain time to prevent catcache from + * bloating. The eviction is performed with the similar algorithm with buffer + * eviction using access counter. Entries that are accessed several times can + * live longer than those that have had no access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int i; + int nremoved = 0; + size_t hash_size; +#ifdef CATCACHE_STATS + /* These variables are only for debugging purpose */ + int ntotal = 0; + /* + * nth element in nentries stores the number of cache entries that have + * lived unaccessed for corresponding multiple in ageclass of + * cache_prune_min_age. The index of nremoved_entry is the value of the + * clock-sweep counter, which takes from 0 up to 2. + */ + double ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; + int nentries[] = {0, 0, 0, 0, 0, 0}; + int nremoved_entry[3] = {0, 0, 0}; + int j; +#endif + + /* Return immediately if no pruning is wanted */ + if (cache_prune_min_age < 0) + return false; + + /* + * Return without pruning if the size of the hash is below the target. + */ + hash_size = cp->cc_nbuckets * sizeof(dlist_head); + if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L) + return false; + + /* Search the whole hash for entries to remove */ + for (i = 0; i < cp->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + + /* + * Calculate the duration from the time of the last access to the + * "current" time. Since catcacheclock is not advanced within a + * transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + +#ifdef CATCACHE_STATS + /* count catcache entries for each age class */ + ntotal++; + for (j = 0 ; + ageclass[j] != 0.0 && + entry_age > cache_prune_min_age * ageclass[j] ; + j++); + if (ageclass[j] == 0.0) j--; + nentries[j]++; +#endif + + /* + * Try to remove entries older than cache_prune_min_age seconds. + * Entries that are not accessed after last pruning are removed in + * that seconds, and that has been accessed several times are + * removed after leaving alone for up to three times of the + * duration. We don't try shrink buckets since pruning effectively + * caps catcache expansion in the long term. + */ + if (entry_age > cache_prune_min_age) + { +#ifdef CATCACHE_STATS + Assert (ct->naccess >= 0 && ct->naccess <= 2); + nremoved_entry[ct->naccess]++; +#endif + if (ct->naccess > 0) + ct->naccess--; + else + { + if (!ct->c_list || ct->c_list->refcount == 0) + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + } + } + } + } + } + +#ifdef CATCACHE_STATS + ereport(DEBUG1, + (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d, 2:%d)", + nremoved, ntotal, + ageclass[0] * cache_prune_min_age, nentries[0], + ageclass[1] * cache_prune_min_age, nentries[1], + ageclass[2] * cache_prune_min_age, nentries[2], + ageclass[3] * cache_prune_min_age, nentries[3], + ageclass[4] * cache_prune_min_age, nentries[4], + nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]), + errhidestmt(true))); +#endif + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -1274,6 +1411,11 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* Update access information for pruning */ + if (ct->naccess < 2) + ct->naccess++; + ct->lastaccess = catcacheclock; + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1819,11 +1961,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, CatCTup *ct; HeapTuple dtp; MemoryContext oldcxt; + int tupsize = 0; /* negative entries have no tuple associated */ if (ntp) { int i; + int tupsize; Assert(!negative); @@ -1842,13 +1986,14 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, /* Allocate memory for CatCTup and the cached tuple in one go */ oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - ct = (CatCTup *) palloc(sizeof(CatCTup) + - MAXIMUM_ALIGNOF + dtp->t_len); + tupsize = sizeof(CatCTup) + MAXIMUM_ALIGNOF + dtp->t_len; + ct = (CatCTup *) palloc(tupsize); ct->tuple.t_len = dtp->t_len; ct->tuple.t_self = dtp->t_self; ct->tuple.t_tableOid = dtp->t_tableOid; ct->tuple.t_data = (HeapTupleHeader) MAXALIGN(((char *) ct) + sizeof(CatCTup)); + ct->size = tupsize; /* copy tuple contents */ memcpy((char *) ct->tuple.t_data, (const char *) dtp->t_data, @@ -1876,8 +2021,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, { Assert(negative); oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - ct = (CatCTup *) palloc(sizeof(CatCTup)); - + tupsize = sizeof(CatCTup); + ct = (CatCTup *) palloc(tupsize); /* * Store keys - they'll point into separately allocated memory if not * by-value. @@ -1898,17 +2043,24 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; + ct->size = tupsize; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); cache->cc_ntup++; CacheHdr->ch_ntup++; + cache->cc_tupsize += tupsize; /* - * If the hash table has become too full, enlarge the buckets array. Quite - * arbitrarily, we enlarge when fill factor > 2. + * If the hash table has become too full, try cleanup by removing + * infrequently used entries to make a room for the new entry. If it + * failed, enlarge the bucket array instead. Quite arbitrarily, we try + * this when fill factor > 2. */ - if (cache->cc_ntup > cache->cc_nbuckets * 2) + if (cache->cc_ntup > cache->cc_nbuckets * 2 && + !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); return ct; diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 6497393c03..28af4c8795 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -80,6 +80,7 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/guc_tables.h" #include "utils/float.h" #include "utils/memutils.h" @@ -2167,6 +2168,28 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"cache_memory_target", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum syscache size to keep."), + gettext_noop("Cache is not pruned before exceeding this size."), + GUC_UNIT_KB + }, + &cache_memory_target, + 0, 0, MAX_KILOBYTES, + NULL, NULL, NULL + }, + + { + {"cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum unused duration of cache entries before removal."), + gettext_noop("Cache entries that live unused for longer than this seconds are considered to be removed."), + GUC_UNIT_S + }, + &cache_prune_min_age, + 600, -1, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index ee9ec6a120..6bc1fc3e61 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -128,6 +128,8 @@ #work_mem = 4MB # min 64kB #maintenance_work_mem = 64MB # min 1MB #autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem +#cache_memory_target = 0kB # in kB +#cache_prune_min_age = 600s # -1 disables pruning #max_stack_depth = 2MB # min 100kB #dynamic_shared_memory_type = posix # the default is the first option # supported by the operating system: diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 7b22f9c7bc..ace4178619 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -61,6 +62,7 @@ typedef struct catcache slist_node cc_next; /* list link */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ + int cc_tupsize; /* total amount of catcache tuples */ /* * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS @@ -119,7 +121,9 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ - + int naccess; /* # of access to this entry, up to 2 */ + TimestampTz lastaccess; /* approx. timestamp of the last usage */ + int size; /* palloc'ed size off this tuple */ /* * The tuple may also be a member of at most one CatCList. (If a single * catcache is list-searched with varying numbers of keys, we may have to @@ -189,6 +193,28 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int cache_prune_min_age; +extern int cache_memory_target; + +/* to use as access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* + * SetCatCacheClock - set timestamp for catcache access record + */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + +static inline TimestampTz +GetCatCacheClock(void) +{ + return catcacheclock; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, -- 2.16.3 From 0b16d3cbfb6957e61c484fdb2794c49d69d78c9c Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 16 Oct 2018 15:48:28 +0900 Subject: [PATCH 2/3] Syscache usage tracking feature. Collects syscache usage statictics and show it using the view pg_stat_syscache. The feature is controlled by the GUC variable track_syscache_usage_interval. --- doc/src/sgml/config.sgml | 15 ++ src/backend/catalog/system_views.sql | 17 +++ src/backend/postmaster/pgstat.c | 206 ++++++++++++++++++++++++-- src/backend/tcop/postgres.c | 23 +++ src/backend/utils/adt/pgstatfuncs.c | 134 +++++++++++++++++ src/backend/utils/cache/catcache.c | 115 ++++++++++---- src/backend/utils/cache/syscache.c | 24 +++ src/backend/utils/init/globals.c | 1 + src/backend/utils/init/postinit.c | 11 ++ src/backend/utils/misc/guc.c | 10 ++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/catalog/pg_proc.dat | 9 ++ src/include/miscadmin.h | 1 + src/include/pgstat.h | 7 +- src/include/utils/catcache.h | 9 +- src/include/utils/syscache.h | 19 +++ src/include/utils/timeout.h | 1 + 17 files changed, 559 insertions(+), 44 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 4f4654120e..b8a91d954d 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -6634,6 +6634,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; </listitem> </varlistentry> + <varlistentry id="guc-track-syscache-usage-interval" xreflabel="track_syscache_usage_interval"> + <term><varname>track_syscache_usage_interval</varname> (<type>integer</type>) + <indexterm> + <primary><varname>track_syscache_usage_interval</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the interval to collect system cache usage statistics in + milliseconds. This parameter is 0 by default, which means disabled. + Only superusers can change this setting. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-track-io-timing" xreflabel="track_io_timing"> <term><varname>track_io_timing</varname> (<type>boolean</type>) <indexterm> diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 715995dd88..4f7e12463e 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -903,6 +903,22 @@ CREATE VIEW pg_stat_progress_vacuum AS FROM pg_stat_get_progress_info('VACUUM') AS S LEFT JOIN pg_database D ON S.datid = D.oid; +CREATE VIEW pg_stat_syscache AS + SELECT + S.pid AS pid, + S.relid::regclass AS relname, + S.indid::regclass AS cache_name, + S.size AS size, + S.ntup AS ntuples, + S.searches AS searches, + S.hits AS hits, + S.neg_hits AS neg_hits, + S.ageclass AS ageclass, + S.last_update AS last_update + FROM pg_stat_activity A + JOIN LATERAL (SELECT A.pid, * FROM pg_get_syscache_stats(A.pid)) S + ON (A.pid = S.pid); + CREATE VIEW pg_user_mappings AS SELECT U.oid AS umid, @@ -1182,6 +1198,7 @@ GRANT EXECUTE ON FUNCTION pg_ls_waldir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_archive_statusdir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_tmpdir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_tmpdir(oid) TO pg_monitor; +GRANT EXECUTE ON FUNCTION pg_get_syscache_stats(int) TO pg_monitor; GRANT pg_read_all_settings TO pg_monitor; GRANT pg_read_all_stats TO pg_monitor; diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 8676088e57..fc50f10cbb 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -66,6 +66,7 @@ #include "utils/ps_status.h" #include "utils/rel.h" #include "utils/snapmgr.h" +#include "utils/syscache.h" #include "utils/timestamp.h" #include "utils/tqual.h" @@ -125,6 +126,7 @@ bool pgstat_track_activities = false; bool pgstat_track_counts = false; int pgstat_track_functions = TRACK_FUNC_OFF; +int pgstat_track_syscache_usage_interval = 0; int pgstat_track_activity_query_size = 1024; /* ---------- @@ -237,6 +239,11 @@ typedef struct TwoPhasePgStatRecord bool t_truncated; /* was the relation truncated? */ } TwoPhasePgStatRecord; +/* bitmap symbols to specify target file types remove */ +#define PGSTAT_REMFILE_DBSTAT 1 /* remove only databsae stats files */ +#define PGSTAT_REMFILE_SYSCACHE 2 /* remove only syscache stats files */ +#define PGSTAT_REMFILE_ALL 3 /* remove both type of files */ + /* * Info about current "snapshot" of stats file */ @@ -631,10 +638,13 @@ startup_failed: } /* - * subroutine for pgstat_reset_all + * remove stats files + * + * clean up stats files in specified directory. target is one of + * PGSTAT_REFILE_DBSTAT/SYSCACHE/ALL and restricts files to remove. */ static void -pgstat_reset_remove_files(const char *directory) +pgstat_reset_remove_files(const char *directory, int target) { DIR *dir; struct dirent *entry; @@ -645,25 +655,39 @@ pgstat_reset_remove_files(const char *directory) { int nchars; Oid tmp_oid; + int filetype = 0; /* * Skip directory entries that don't match the file names we write. * See get_dbstat_filename for the database-specific pattern. */ if (strncmp(entry->d_name, "global.", 7) == 0) + { + filetype = PGSTAT_REMFILE_DBSTAT; nchars = 7; + } else { + char head[2]; + nchars = 0; - (void) sscanf(entry->d_name, "db_%u.%n", - &tmp_oid, &nchars); - if (nchars <= 0) - continue; + (void) sscanf(entry->d_name, "%c%c_%u.%n", + head, head + 1, &tmp_oid, &nchars); + /* %u allows leading whitespace, so reject that */ - if (strchr("0123456789", entry->d_name[3]) == NULL) + if (nchars < 3 || !isdigit(entry->d_name[3])) continue; + + if (strncmp(head, "db", 2) == 0) + filetype = PGSTAT_REMFILE_DBSTAT; + else if (strncmp(head, "cc", 2) == 0) + filetype = PGSTAT_REMFILE_SYSCACHE; } + /* skip if this is not a target */ + if ((filetype & target) == 0) + continue; + if (strcmp(entry->d_name + nchars, "tmp") != 0 && strcmp(entry->d_name + nchars, "stat") != 0) continue; @@ -684,8 +708,9 @@ pgstat_reset_remove_files(const char *directory) void pgstat_reset_all(void) { - pgstat_reset_remove_files(pgstat_stat_directory); - pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY); + pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_ALL); + pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY, + PGSTAT_REMFILE_ALL); } #ifdef EXEC_BACKEND @@ -4286,6 +4311,9 @@ PgstatCollectorMain(int argc, char *argv[]) pgStatRunningInCollector = true; pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true); + /* Remove left-over syscache stats files */ + pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_SYSCACHE); + /* * Loop to process messages until we get SIGQUIT or detect ungraceful * death of our parent postmaster. @@ -6376,3 +6404,163 @@ pgstat_clip_activity(const char *raw_activity) return activity; } + +/* + * return the filename for a syscache stat file; filename is the output + * buffer, of length len. + */ +void +pgstat_get_syscachestat_filename(bool permanent, bool tempname, int backendid, + char *filename, int len) +{ + int printed; + + /* NB -- pgstat_reset_remove_files knows about the pattern this uses */ + printed = snprintf(filename, len, "%s/cc_%u.%s", + permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY : + pgstat_stat_directory, + backendid, + tempname ? "tmp" : "stat"); + if (printed >= len) + elog(ERROR, "overlength pgstat path"); +} + +/* + * pgstat_write_syscache_stats() - + * Write the syscache statistics files. + * + * If 'force' is false, this function skips writing a file and resturns the + * time remaining in the current interval in milliseconds. If'force' is true, + * writes a file regardless of the remaining time and reset the interval. + */ +long +pgstat_write_syscache_stats(bool force) +{ + static TimestampTz last_report = 0; + TimestampTz now; + long elapsed; + long secs; + int usecs; + int cacheId; + FILE *fpout; + char statfile[MAXPGPATH]; + char tmpfile[MAXPGPATH]; + + /* Return if we don't want it */ + if (!force && pgstat_track_syscache_usage_interval <= 0) + return 0; + + + /* Check aginst the in*/ + now = GetCurrentTransactionStopTimestamp(); + TimestampDifference(last_report, now, &secs, &usecs); + elapsed = secs * 1000 + usecs / 1000; + + if (!force && elapsed < pgstat_track_syscache_usage_interval) + { + /* not yet the time, inform the remaining time to the caller */ + return pgstat_track_syscache_usage_interval - elapsed; + } + + /* now write the file */ + last_report = now; + + pgstat_get_syscachestat_filename(false, true, + MyBackendId, tmpfile, MAXPGPATH); + pgstat_get_syscachestat_filename(false, false, + MyBackendId, statfile, MAXPGPATH); + + /* + * This function can be called from ProcessInterrupts(). Inhibit recursive + * interrupts to avoid recursive entry. + */ + HOLD_INTERRUPTS(); + + fpout = AllocateFile(tmpfile, PG_BINARY_W); + if (fpout == NULL) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not open temporary statistics file \"%s\": %m", + tmpfile))); + /* + * Failure writing this file is not critical. Just skip this time and + * tell caller to wait for the next interval. + */ + RESUME_INTERRUPTS(); + return pgstat_track_syscache_usage_interval; + } + + /* write out every catcache stats */ + for (cacheId = 0 ; cacheId < SysCacheSize ; cacheId++) + { + SysCacheStats *stats; + + stats = SysCacheGetStats(cacheId); + Assert (stats); + + /* write error is checked later using ferror() */ + fputc('T', fpout); + (void)fwrite(&cacheId, sizeof(int), 1, fpout); + (void)fwrite(&last_report, sizeof(TimestampTz), 1, fpout); + (void)fwrite(stats, sizeof(*stats), 1, fpout); + } + fputc('E', fpout); + + if (ferror(fpout)) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not write syscache statistics file \"%s\": %m", + tmpfile))); + FreeFile(fpout); + unlink(tmpfile); + } + else if (FreeFile(fpout) < 0) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not close syscache statistics file \"%s\": %m", + tmpfile))); + unlink(tmpfile); + } + else if (rename(tmpfile, statfile) < 0) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not rename syscache statistics file \"%s\" to \"%s\": %m", + tmpfile, statfile))); + unlink(tmpfile); + } + + RESUME_INTERRUPTS(); + return 0; +} + +/* + * GUC assignment callback for track_syscache_usage_interval. + * + * Make a statistics file immedately when syscache statistics is turned + * on. Remove it as soon as turned off as well. + */ +void +pgstat_track_syscache_assign_hook(int newval, void *extra) +{ + if (newval > 0) + { + /* + * Immediately create a stats file. It's safe since we're not midst + * accessing syscache. + */ + pgstat_write_syscache_stats(true); + } + else + { + /* Turned off, immediately remove the statsfile */ + char fname[MAXPGPATH]; + + pgstat_get_syscachestat_filename(false, false, MyBackendId, + fname, MAXPGPATH); + unlink(fname); /* don't care of the result */ + } +} diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index a3b9757565..f2573fecbd 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -3144,6 +3144,12 @@ ProcessInterrupts(void) } + if (IdleSyscacheStatsUpdateTimeoutPending) + { + IdleSyscacheStatsUpdateTimeoutPending = false; + pgstat_write_syscache_stats(true); + } + if (ParallelMessagePending) HandleParallelMessages(); } @@ -3720,6 +3726,7 @@ PostgresMain(int argc, char *argv[], sigjmp_buf local_sigjmp_buf; volatile bool send_ready_for_query = true; bool disable_idle_in_transaction_timeout = false; + bool disable_idle_catcache_update_timeout = false; /* Initialize startup process environment if necessary. */ if (!IsUnderPostmaster) @@ -4160,9 +4167,19 @@ PostgresMain(int argc, char *argv[], } else { + long timeout; + ProcessCompletedNotifies(); pgstat_report_stat(false); + timeout = pgstat_write_syscache_stats(false); + + if (timeout > 0) + { + disable_idle_catcache_update_timeout = true; + enable_timeout_after(IDLE_CATCACHE_UPDATE_TIMEOUT, + timeout); + } set_ps_display("idle", false); pgstat_report_activity(STATE_IDLE, NULL); } @@ -4205,6 +4222,12 @@ PostgresMain(int argc, char *argv[], disable_idle_in_transaction_timeout = false; } + if (disable_idle_catcache_update_timeout) + { + disable_timeout(IDLE_CATCACHE_UPDATE_TIMEOUT, false); + disable_idle_catcache_update_timeout = false; + } + /* * (6) check for any other interesting events that happened while we * slept. diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index f955f1912a..68e713f254 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -14,6 +14,8 @@ */ #include "postgres.h" +#include <sys/stat.h> + #include "access/htup_details.h" #include "catalog/pg_authid.h" #include "catalog/pg_type.h" @@ -28,6 +30,7 @@ #include "utils/acl.h" #include "utils/builtins.h" #include "utils/inet.h" +#include "utils/syscache.h" #include "utils/timestamp.h" #define UINT32_ACCESS_ONCE(var) ((uint32)(*((volatile uint32 *)&(var)))) @@ -1882,3 +1885,134 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS) PG_RETURN_DATUM(HeapTupleGetDatum( heap_form_tuple(tupdesc, values, nulls))); } + +Datum +pgstat_get_syscache_stats(PG_FUNCTION_ARGS) +{ +#define PG_GET_SYSCACHE_SIZE 9 + int pid = PG_GETARG_INT32(0); + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; + TupleDesc tupdesc; + Tuplestorestate *tupstore; + MemoryContext per_query_ctx; + MemoryContext oldcontext; + PgBackendStatus *beentry; + int beid; + char fname[MAXPGPATH]; + FILE *fpin; + char c; + + if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("set-valued function called in context that cannot accept a set"))); + if (!(rsinfo->allowedModes & SFRM_Materialize)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("materialize mode required, but it is not " \ + "allowed in this context"))); + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + + per_query_ctx = rsinfo->econtext->ecxt_per_query_memory; + + oldcontext = MemoryContextSwitchTo(per_query_ctx); + tupstore = tuplestore_begin_heap(true, false, work_mem); + rsinfo->returnMode = SFRM_Materialize; + rsinfo->setResult = tupstore; + rsinfo->setDesc = tupdesc; + + MemoryContextSwitchTo(oldcontext); + + /* find beentry for given pid*/ + beentry = NULL; + for (beid = 1; + (beentry = pgstat_fetch_stat_beentry(beid)) && + beentry->st_procpid != pid ; + beid++); + + /* + * we silently return empty result on failure or insufficient privileges + */ + if (!beentry || + (!has_privs_of_role(GetUserId(), beentry->st_userid) && + !is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS))) + goto no_data; + + pgstat_get_syscachestat_filename(false, false, beid, fname, MAXPGPATH); + + if ((fpin = AllocateFile(fname, PG_BINARY_R)) == NULL) + { + if (errno != ENOENT) + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not open statistics file \"%s\": %m", + fname))); + /* also return empty on no statistics file */ + goto no_data; + } + + /* read the statistics file into tuplestore */ + while ((c = fgetc(fpin)) == 'T') + { + TimestampTz last_update; + SysCacheStats stats; + int cacheid; + Datum values[PG_GET_SYSCACHE_SIZE]; + bool nulls[PG_GET_SYSCACHE_SIZE] = {0}; + Datum datums[SYSCACHE_STATS_NAGECLASSES * 2]; + bool arrnulls[SYSCACHE_STATS_NAGECLASSES * 2] = {0}; + int dims[] = {SYSCACHE_STATS_NAGECLASSES, 2}; + int lbs[] = {1, 1}; + ArrayType *arr; + int i, j; + + fread(&cacheid, sizeof(int), 1, fpin); + fread(&last_update, sizeof(TimestampTz), 1, fpin); + if (fread(&stats, 1, sizeof(stats), fpin) != sizeof(stats)) + { + ereport(WARNING, + (errmsg("corrupted syscache statistics file \"%s\"", + fname))); + goto no_data; + } + + i = 0; + values[i++] = ObjectIdGetDatum(stats.reloid); + values[i++] = ObjectIdGetDatum(stats.indoid); + values[i++] = Int64GetDatum(stats.size); + values[i++] = Int64GetDatum(stats.ntuples); + values[i++] = Int64GetDatum(stats.nsearches); + values[i++] = Int64GetDatum(stats.nhits); + values[i++] = Int64GetDatum(stats.nneg_hits); + + for (j = 0 ; j < SYSCACHE_STATS_NAGECLASSES ; j++) + { + datums[j * 2] = Int32GetDatum((int32) stats.ageclasses[j]); + datums[j * 2 + 1] = Int32GetDatum((int32) stats.nclass_entries[j]); + } + + arr = construct_md_array(datums, arrnulls, 2, dims, lbs, + INT4OID, sizeof(int32), true, 'i'); + values[i++] = PointerGetDatum(arr); + + values[i++] = TimestampTzGetDatum(last_update); + + Assert (i == PG_GET_SYSCACHE_SIZE); + + tuplestore_putvalues(tupstore, tupdesc, values, nulls); + } + + /* check for the end of file. abandon the result if file is broken */ + if (c != 'E' || fgetc(fpin) != EOF) + tuplestore_clear(tupstore); + + FreeFile(fpin); + +no_data: + tuplestore_donestoring(tupstore); + return (Datum) 0; +} diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 09d5a9a520..50288d444c 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -89,6 +89,10 @@ static CatCacheHeader *CacheHdr = NULL; /* Timestamp used for any operation on caches. */ TimestampTz catcacheclock = 0; +/* age classes for pruning */ +static double ageclass[SYSCACHE_STATS_NAGECLASSES] + = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -619,9 +623,7 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue) else CatCacheRemoveCTup(cache, ct); CACHE1_elog(DEBUG2, "CatCacheInvalidate: invalidated"); -#ifdef CATCACHE_STATS cache->cc_invals++; -#endif /* could be multiple matches, so keep looking! */ } } @@ -697,9 +699,7 @@ ResetCatalogCache(CatCache *cache) } else CatCacheRemoveCTup(cache, ct); -#ifdef CATCACHE_STATS cache->cc_invals++; -#endif } } } @@ -906,10 +906,11 @@ CatCacheCleanupOldEntries(CatCache *cp) * cache_prune_min_age. The index of nremoved_entry is the value of the * clock-sweep counter, which takes from 0 up to 2. */ - double ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; - int nentries[] = {0, 0, 0, 0, 0, 0}; + int nentries[SYSCACHE_STATS_NAGECLASSES] = {0, 0, 0, 0, 0, 0}; int nremoved_entry[3] = {0, 0, 0}; int j; + + Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0); #endif /* Return immediately if no pruning is wanted */ @@ -923,7 +924,11 @@ CatCacheCleanupOldEntries(CatCache *cp) if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L) return false; - /* Search the whole hash for entries to remove */ + /* + * Search the whole hash for entries to remove. This is a quite time + * consuming task during catcache lookup, but accetable since now we are + * going to expand the hash table. + */ for (i = 0; i < cp->cc_nbuckets; i++) { dlist_mutable_iter iter; @@ -936,21 +941,21 @@ CatCacheCleanupOldEntries(CatCache *cp) /* - * Calculate the duration from the time of the last access to the - * "current" time. Since catcacheclock is not advanced within a - * transaction, the entries that are accessed within the current - * transaction won't be pruned. + * Calculate the duration from the time from the last access to + * the "current" time. Since catcacheclock is not advanced within + * a transaction, the entries that are accessed within the current + * transaction always get 0 as the result. */ TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); #ifdef CATCACHE_STATS /* count catcache entries for each age class */ ntotal++; - for (j = 0 ; - ageclass[j] != 0.0 && - entry_age > cache_prune_min_age * ageclass[j] ; - j++); - if (ageclass[j] == 0.0) j--; + + j = 0; + while (j < SYSCACHE_STATS_NAGECLASSES - 1 && + entry_age > cache_prune_min_age * ageclass[j]) + j++; nentries[j]++; #endif @@ -983,14 +988,17 @@ CatCacheCleanupOldEntries(CatCache *cp) } #ifdef CATCACHE_STATS + StaticAssertStmt(SYSCACHE_STATS_NAGECLASSES == 6, + "number of syscache age class must be 6"); ereport(DEBUG1, - (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d, 2:%d)", + (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d, rest:%d) naccessed(0:%d,1:%d, 2:%d)", nremoved, ntotal, ageclass[0] * cache_prune_min_age, nentries[0], ageclass[1] * cache_prune_min_age, nentries[1], ageclass[2] * cache_prune_min_age, nentries[2], ageclass[3] * cache_prune_min_age, nentries[3], ageclass[4] * cache_prune_min_age, nentries[4], + nentries[5], nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]), errhidestmt(true))); #endif @@ -1367,9 +1375,7 @@ SearchCatCacheInternal(CatCache *cache, if (unlikely(cache->cc_tupdesc == NULL)) CatalogCacheInitializeCache(cache); -#ifdef CATCACHE_STATS cache->cc_searches++; -#endif /* Initialize local parameter array */ arguments[0] = v1; @@ -1429,9 +1435,7 @@ SearchCatCacheInternal(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_hits++; -#endif return &ct->tuple; } @@ -1440,9 +1444,7 @@ SearchCatCacheInternal(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_neg_hits++; -#endif return NULL; } @@ -1570,9 +1572,7 @@ SearchCatCacheMiss(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): put in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_newloads++; -#endif return &ct->tuple; } @@ -1683,9 +1683,7 @@ SearchCatCacheList(CatCache *cache, Assert(nkeys > 0 && nkeys < cache->cc_nkeys); -#ifdef CATCACHE_STATS cache->cc_lsearches++; -#endif /* Initialize local parameter array */ arguments[0] = v1; @@ -1742,9 +1740,7 @@ SearchCatCacheList(CatCache *cache, CACHE2_elog(DEBUG2, "SearchCatCacheList(%s): found list", cache->cc_relname); -#ifdef CATCACHE_STATS cache->cc_lhits++; -#endif return cl; } @@ -2252,3 +2248,64 @@ PrintCatCacheListLeakWarning(CatCList *list) list->my_cache->cc_relname, list->my_cache->id, list, list->refcount); } + +/* + * CatCacheGetStats - fill in SysCacheStats struct. + * + * This is a support routine for SysCacheGetStats, substantially fills in the + * result. The classification here is based on the same criteria to + * CatCacheCleanupOldEntries(). + */ +void +CatCacheGetStats(CatCache *cache, SysCacheStats *stats) +{ + int i, j; + + Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0); + + /* fill in the stats struct */ + stats->size = cache->cc_tupsize + cache->cc_nbuckets * sizeof(dlist_head); + stats->ntuples = cache->cc_ntup; + stats->nsearches = cache->cc_searches; + stats->nhits = cache->cc_hits; + stats->nneg_hits = cache->cc_neg_hits; + + /* cache_prune_min_age can be changed on-session, fill it every time */ + for (i = 0 ; i < SYSCACHE_STATS_NAGECLASSES ; i++) + stats->ageclasses[i] = (int) (cache_prune_min_age * ageclass[i]); + + /* + * nth element in nclass_entries stores the number of cache entries that + * have lived unaccessed for corresponding multiple in ageclass of + * cache_prune_min_age. + */ + memset(stats->nclass_entries, 0, sizeof(int) * SYSCACHE_STATS_NAGECLASSES); + + /* Scan the whole hash */ + for (i = 0; i < cache->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cache->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + /* + * Calculate the duration from the time from the last access to + * the "current" time. Since catcacheclock is not advanced within + * a transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + + j = 0; + while (j < SYSCACHE_STATS_NAGECLASSES - 1 && + entry_age > stats->ageclasses[j]) + j++; + + stats->nclass_entries[j]++; + } + } +} diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c index c26808a833..c06ab2a798 100644 --- a/src/backend/utils/cache/syscache.c +++ b/src/backend/utils/cache/syscache.c @@ -20,6 +20,9 @@ */ #include "postgres.h" +#include <sys/stat.h> +#include <unistd.h> + #include "access/htup_details.h" #include "access/sysattr.h" #include "catalog/indexing.h" @@ -1534,6 +1537,27 @@ RelationSupportsSysCache(Oid relid) return false; } +/* + * SysCacheGetStats - returns stats of specified syscache + * + * This routine returns the address of its local static memory. + */ +SysCacheStats * +SysCacheGetStats(int cacheId) +{ + static SysCacheStats stats; + + Assert(cacheId >=0 && cacheId < SysCacheSize); + + memset(&stats, 0, sizeof(stats)); + + stats.reloid = cacheinfo[cacheId].reloid; + stats.indoid = cacheinfo[cacheId].indoid; + + CatCacheGetStats(SysCache[cacheId], &stats); + + return &stats; +} /* * OID comparator for pg_qsort diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index c6939779b9..5d2276b90c 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false; volatile sig_atomic_t ProcDiePending = false; volatile sig_atomic_t ClientConnectionLost = false; volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false; +volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending = false; volatile sig_atomic_t ConfigReloadPending = false; volatile uint32 InterruptHoldoffCount = 0; volatile uint32 QueryCancelHoldoffCount = 0; diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index b636b1e262..0f57e1a91f 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg); static void StatementTimeoutHandler(void); static void LockTimeoutHandler(void); static void IdleInTransactionSessionTimeoutHandler(void); +static void IdleSyscacheStatsUpdateTimeoutHandler(void); static bool ThereIsAtLeastOneRole(void); static void process_startup_options(Port *port, bool am_superuser); static void process_settings(Oid databaseid, Oid roleid); @@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler); RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT, IdleInTransactionSessionTimeoutHandler); + RegisterTimeout(IDLE_CATCACHE_UPDATE_TIMEOUT, + IdleSyscacheStatsUpdateTimeoutHandler); } /* @@ -1239,6 +1242,14 @@ IdleInTransactionSessionTimeoutHandler(void) SetLatch(MyLatch); } +static void +IdleSyscacheStatsUpdateTimeoutHandler(void) +{ + IdleSyscacheStatsUpdateTimeoutPending = true; + InterruptPending = true; + SetLatch(MyLatch); +} + /* * Returns true if at least one role is defined in this database cluster. */ diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 28af4c8795..ba0e65f6fb 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -3130,6 +3130,16 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"track_syscache_usage_interval", PGC_SUSET, STATS_COLLECTOR, + gettext_noop("Sets the interval between syscache usage collection, in milliseconds. Zero disables syscache usagetracking."), + NULL + }, + &pgstat_track_syscache_usage_interval, + 0, 0, INT_MAX / 2, + NULL, NULL, NULL + }, + { {"gin_pending_list_limit", PGC_USERSET, CLIENT_CONN_STATEMENT, gettext_noop("Sets the maximum size of the pending list for GIN index."), diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 6bc1fc3e61..e36ab26bd7 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -552,6 +552,7 @@ #track_io_timing = off #track_functions = none # none, pl, all #track_activity_query_size = 1024 # (change requires restart) +#track_syscache_usage_interval = 0 # zero disables tracking #stats_temp_directory = 'pg_stat_tmp' diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 034a41eb55..4de9fdee44 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -9616,6 +9616,15 @@ proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}', proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}', prosrc => 'pg_get_replication_slots' }, +{ oid => '3425', + descr => 'syscache statistics', + proname => 'pg_get_syscache_stats', prorows => '100', proisstrict => 'f', + proretset => 't', provolatile => 'v', prorettype => 'record', + proargtypes => 'int4', + proallargtypes => '{int4,oid,oid,int8,int8,int8,int8,int8,_int4,timestamptz}', + proargmodes => '{i,o,o,o,o,o,o,o,o,o}', + proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,last_update}', + prosrc => 'pgstat_get_syscache_stats' }, { oid => '3786', descr => 'set up a logical replication slot', proname => 'pg_create_logical_replication_slot', provolatile => 'v', proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool', diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index d6b32c070c..15f4d23f0c 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending; extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending; extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending; extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending; +extern PGDLLIMPORT volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending; extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending; extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost; diff --git a/src/include/pgstat.h b/src/include/pgstat.h index f1c10d16b8..20add5052c 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -1134,6 +1134,7 @@ extern bool pgstat_track_activities; extern bool pgstat_track_counts; extern int pgstat_track_functions; extern PGDLLIMPORT int pgstat_track_activity_query_size; +extern int pgstat_track_syscache_usage_interval; extern char *pgstat_stat_directory; extern char *pgstat_stat_tmpname; extern char *pgstat_stat_filename; @@ -1218,7 +1219,8 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id); extern void pgstat_initstats(Relation rel); extern char *pgstat_clip_activity(const char *raw_activity); - +extern void pgstat_get_syscachestat_filename(bool permanent, + bool tempname, int backendid, char *filename, int len); /* ---------- * pgstat_report_wait_start() - * @@ -1353,5 +1355,6 @@ extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid); extern int pgstat_fetch_stat_numbackends(void); extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void); extern PgStat_GlobalStats *pgstat_fetch_global(void); - +extern long pgstat_write_syscache_stats(bool force); +extern void pgstat_track_syscache_assign_hook(int newval, void *extra); #endif /* PGSTAT_H */ diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index ace4178619..721948b4cc 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -65,10 +65,8 @@ typedef struct catcache int cc_tupsize; /* total amount of catcache tuples */ /* - * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS - * doesn't break ABI for other modules + * Statistics entries */ -#ifdef CATCACHE_STATS long cc_searches; /* total # searches against this cache */ long cc_hits; /* # of matches against existing entry */ long cc_neg_hits; /* # of matches against negative entry */ @@ -81,7 +79,6 @@ typedef struct catcache long cc_invals; /* # of entries invalidated from cache */ long cc_lsearches; /* total # list-searches */ long cc_lhits; /* # of matches against existing lists */ -#endif } CatCache; @@ -254,4 +251,8 @@ extern void PrepareToInvalidateCacheTuple(Relation relation, extern void PrintCatCacheLeakWarning(HeapTuple tuple); extern void PrintCatCacheListLeakWarning(CatCList *list); +/* defined in syscache.h */ +typedef struct syscachestats SysCacheStats; +extern void CatCacheGetStats(CatCache *cache, SysCacheStats *syscachestats); + #endif /* CATCACHE_H */ diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h index 6f290c7214..c6b10850a9 100644 --- a/src/include/utils/syscache.h +++ b/src/include/utils/syscache.h @@ -112,6 +112,24 @@ enum SysCacheIdentifier #define SysCacheSize (USERMAPPINGUSERSERVER + 1) }; +#define SYSCACHE_STATS_NAGECLASSES 6 +/* Struct for catcache tracking information */ +typedef struct syscachestats +{ + Oid reloid; /* target relation */ + Oid indoid; /* index */ + size_t size; /* size of the catcache */ + int ntuples; /* number of tuples resides in the catcache */ + int nsearches; /* number of searches */ + int nhits; /* number of cache hits */ + int nneg_hits; /* number of negative cache hits */ + /* age classes in seconds */ + int ageclasses[SYSCACHE_STATS_NAGECLASSES]; + /* number of tuples fall into the corresponding age class */ + int nclass_entries[SYSCACHE_STATS_NAGECLASSES]; +} SysCacheStats; + + extern void InitCatalogCache(void); extern void InitCatalogCachePhase2(void); @@ -164,6 +182,7 @@ extern void SysCacheInvalidate(int cacheId, uint32 hashValue); extern bool RelationInvalidatesSnapshotsOnly(Oid relid); extern bool RelationHasSysCache(Oid relid); extern bool RelationSupportsSysCache(Oid relid); +extern SysCacheStats *SysCacheGetStats(int cacheId); /* * The use of the macros below rather than direct calls to the corresponding diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h index dcc7307c16..e2a9c33f14 100644 --- a/src/include/utils/timeout.h +++ b/src/include/utils/timeout.h @@ -31,6 +31,7 @@ typedef enum TimeoutId STANDBY_TIMEOUT, STANDBY_LOCK_TIMEOUT, IDLE_IN_TRANSACTION_SESSION_TIMEOUT, + IDLE_CATCACHE_UPDATE_TIMEOUT, /* First user-definable timeout reason */ USER_TIMEOUT, /* Maximum number of timeout reasons */ -- 2.16.3 From 50662c1d37e70c1b357ecebab261275e286ca49a Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 16 Oct 2018 21:31:22 +0900 Subject: [PATCH 3/3] Remote GUC setting feature and non-xact GUC config. This adds two features at once. (will be split later). One is non-transactional GUC setting feature. This allows setting GUC variable set by the action GUC_ACTION_NONXACT(the name requires condieration) survive beyond rollback. It is required by remote guc setting to work sanely. Without the feature a remote-set value within a trasction will disappear involved in rollback. The only local interface for the NONXACT action is set_config(name, value, is_local=false, is_nonxact = true). The second is remote guc setting feature. It uses ProcSignal to notify the target server. --- doc/src/sgml/config.sgml | 4 + doc/src/sgml/func.sgml | 30 ++ src/backend/catalog/system_views.sql | 7 +- src/backend/postmaster/pgstat.c | 3 + src/backend/storage/ipc/ipci.c | 2 + src/backend/storage/ipc/procsignal.c | 4 + src/backend/tcop/postgres.c | 10 + src/backend/utils/misc/README | 26 +- src/backend/utils/misc/guc.c | 619 +++++++++++++++++++++++++++++++++-- src/include/catalog/pg_proc.dat | 10 +- src/include/pgstat.h | 3 +- src/include/storage/procsignal.h | 3 + src/include/utils/guc.h | 13 +- src/include/utils/guc_tables.h | 5 +- src/test/regress/expected/guc.out | 223 +++++++++++++ src/test/regress/expected/rules.out | 26 +- src/test/regress/sql/guc.sql | 88 +++++ 17 files changed, 1027 insertions(+), 49 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index b8a91d954d..029642cddb 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -281,6 +281,10 @@ UPDATE pg_settings SET setting = reset_val WHERE name = 'configuration_parameter </listitem> </itemizedlist> + <para> + Also values on other sessions can be set using the SQL + function <function>pg_set_backend_setting</function>. + </para> </sect2> <sect2> diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 09c77db045..f3e4c8f592 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -18694,6 +18694,20 @@ SELECT collation for ('foo' COLLATE "de_DE"); <entry><type>text</type></entry> <entry>set parameter and return new value</entry> </row> + <row> + <entry> + <indexterm> + <primary>pg_set_backend_setting</primary> + </indexterm> + <literal><function>pg_set_backend_config( + <parameter>process_id</parameter>, + <parameter>setting_name</parameter>, + <parameter>new_value</parameter>) + </function></literal> + </entry> + <entry><type>bool</type></entry> + <entry>set parameter on another session</entry> + </row> </tbody> </tgroup> </table> @@ -18748,6 +18762,22 @@ SELECT set_config('log_statement_stats', 'off', false); ------------ off (1 row) +</programlisting> + </para> + + <para> + <function>pg_set_backend_config</function> sets the parameter + <parameter>setting_name</parameter> to + <parameter>new_value</parameter> on the other session with PID + <parameter>process_id</parameter>. The setting is always session-local and + returns true if succeeded. An example: +<programlisting> +SELECT pg_set_backend_config(2134, 'work_mem', '16MB'); + +pg_set_backend_config +------------ + t +(1 row) </programlisting> </para> diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 4f7e12463e..642b7e28d4 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -474,7 +474,7 @@ CREATE VIEW pg_settings AS CREATE RULE pg_settings_u AS ON UPDATE TO pg_settings WHERE new.name = old.name DO - SELECT set_config(old.name, new.setting, 'f'); + SELECT set_config(old.name, new.setting, 'f', 'f'); CREATE RULE pg_settings_n AS ON UPDATE TO pg_settings @@ -1048,6 +1048,11 @@ CREATE OR REPLACE FUNCTION RETURNS boolean STRICT VOLATILE LANGUAGE INTERNAL AS 'pg_promote' PARALLEL SAFE; +CREATE OR REPLACE FUNCTION set_config ( + setting_name text, new_value text, is_local boolean, is_nonxact boolean DEFAULT false) + RETURNS text STRICT VOLATILE LANGUAGE internal AS 'set_config_by_name' + PARALLEL UNSAFE; + -- legacy definition for compatibility with 9.3 CREATE OR REPLACE FUNCTION json_populate_record(base anyelement, from_json json, use_json_as_text boolean DEFAULT false) diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index fc50f10cbb..a1e21f2696 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -3707,6 +3707,9 @@ pgstat_get_wait_ipc(WaitEventIPC w) case WAIT_EVENT_SYNC_REP: event_name = "SyncRep"; break; + case WAIT_EVENT_REMOTE_GUC: + event_name = "RemoteGUC"; + break; /* no default case, so that compiler will warn */ } diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c index 0c86a581c0..03d526d12d 100644 --- a/src/backend/storage/ipc/ipci.c +++ b/src/backend/storage/ipc/ipci.c @@ -150,6 +150,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port) size = add_size(size, SyncScanShmemSize()); size = add_size(size, AsyncShmemSize()); size = add_size(size, BackendRandomShmemSize()); + size = add_size(size, GucShmemSize()); #ifdef EXEC_BACKEND size = add_size(size, ShmemBackendArraySize()); #endif @@ -270,6 +271,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port) SyncScanShmemInit(); AsyncShmemInit(); BackendRandomShmemInit(); + GucShmemInit(); #ifdef EXEC_BACKEND diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c index b0dd7d1b37..b897c36bae 100644 --- a/src/backend/storage/ipc/procsignal.c +++ b/src/backend/storage/ipc/procsignal.c @@ -27,6 +27,7 @@ #include "storage/shmem.h" #include "storage/sinval.h" #include "tcop/tcopprot.h" +#include "utils/guc.h" /* @@ -292,6 +293,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS) if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN)) RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN); + if (CheckProcSignal(PROCSIG_REMOTE_GUC)) + HandleRemoteGucSetInterrupt(); + SetLatch(MyLatch); latch_sigusr1_handler(); diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index f2573fecbd..a891935528 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -3152,6 +3152,10 @@ ProcessInterrupts(void) if (ParallelMessagePending) HandleParallelMessages(); + + /* We don't want chage GUC variables while running a query */ + if (RemoteGucChangePending && DoingCommandRead) + HandleGucRemoteChanges(); } @@ -4188,6 +4192,12 @@ PostgresMain(int argc, char *argv[], send_ready_for_query = false; } + /* + * (2.5) Process some pending works. + */ + if (RemoteGucChangePending) + HandleGucRemoteChanges(); + /* * (2) Allow asynchronous signals to be executed immediately if they * come in while we are waiting for client input. (This must be diff --git a/src/backend/utils/misc/README b/src/backend/utils/misc/README index 6e294386f7..42ae6c1a8f 100644 --- a/src/backend/utils/misc/README +++ b/src/backend/utils/misc/README @@ -169,10 +169,14 @@ Entry to a function with a SET option: Plain SET command: If no stack entry of current level: - Push new stack entry w/prior value and state SET + Push new stack entry w/prior value and state SET or + push new stack entry w/o value and state NONXACT. else if stack entry's state is SAVE, SET, or LOCAL: change stack state to SET, don't change saved value (here we are forgetting effects of prior set action) + else if stack entry's state is NONXACT: + change stack state to NONXACT_SET, set the current value to + prior. else (entry must have state SET+LOCAL): discard its masked value, change state to SET (here we are forgetting effects of prior SET and SET LOCAL) @@ -185,13 +189,20 @@ SET LOCAL command: else if stack entry's state is SAVE or LOCAL or SET+LOCAL: no change to stack entry (in SAVE case, SET LOCAL will be forgotten at func exit) + else if stack entry's state is NONXACT: + set current value to both prior and masked slots. set state + NONXACT+LOCAL. else (entry must have state SET): put current active into its masked slot, set state SET+LOCAL Now set new value. +Setting by NONXACT action (no command exists): + Always blow away existing stack then create a new NONXACT entry. + Transaction or subtransaction abort: - Pop stack entries, restoring prior value, until top < subxact depth + Pop stack entries, restoring prior value unless the stack entry's + state is NONXACT, until top < subxact depth Transaction or subtransaction commit (incl. successful function exit): @@ -199,9 +210,9 @@ Transaction or subtransaction commit (incl. successful function exit): if entry's state is SAVE: pop, restoring prior value - else if level is 1 and entry's state is SET+LOCAL: + else if level is 1 and entry's state is SET+LOCAL or NONXACT+LOCAL: pop, restoring *masked* value - else if level is 1 and entry's state is SET: + else if level is 1 and entry's state is SET or NONXACT+SET: pop, discarding old value else if level is 1 and entry's state is LOCAL: pop, restoring prior value @@ -210,9 +221,9 @@ Transaction or subtransaction commit (incl. successful function exit): else merge entries of level N-1 and N as specified below -The merged entry will have level N-1 and prior = older prior, so easiest -to keep older entry and free newer. There are 12 possibilities since -we already handled level N state = SAVE: +The merged entry will have level N-1 and prior = older prior, so +easiest to keep older entry and free newer. Disregarding to NONXACT, +here are 12 possibilities since we already handled level N state = SAVE: N-1 N @@ -232,6 +243,7 @@ SET+LOCAL SET discard top prior and second masked, state SET SET+LOCAL LOCAL discard top prior, no change to stack entry SET+LOCAL SET+LOCAL discard top prior, copy masked, state S+L +(TODO: states involving NONXACT) RESET is executed like a SET, but using the reset_val as the desired new value. (We do not provide a RESET LOCAL command, but SET LOCAL TO DEFAULT diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index ba0e65f6fb..15c6e2889d 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -216,6 +216,37 @@ static ConfigVariable *ProcessConfigFileInternal(GucContext context, bool applySettings, int elevel); +/* Enum and struct to command GUC setting to another backend */ +typedef enum +{ + REMGUC_VACANT, + REMGUC_REQUEST, + REMGUC_INPROCESS, + REMGUC_DONE, + REMGUC_CANCELING, + REMGUC_CANCELED, +} remote_guc_status; + +#define GUC_REMOTE_MAX_VALUE_LEN 1024 /* an arbitrary value */ +#define GUC_REMOTE_CANCEL_TIMEOUT 5000 /* in milliseconds */ + +typedef struct +{ + remote_guc_status state; + char name[NAMEDATALEN]; + char value[GUC_REMOTE_MAX_VALUE_LEN]; + int sourcepid; + int targetpid; + Oid userid; + bool success; + volatile Latch *sender_latch; + LWLock lock; +} GucRemoteSetting; + +static GucRemoteSetting *remote_setting; + +volatile bool RemoteGucChangePending = false; + /* * Options for enum values defined in this module. * @@ -3137,7 +3168,7 @@ static struct config_int ConfigureNamesInt[] = }, &pgstat_track_syscache_usage_interval, 0, 0, INT_MAX / 2, - NULL, NULL, NULL + NULL, &pgstat_track_syscache_assign_hook, NULL }, { @@ -4695,7 +4726,6 @@ discard_stack_value(struct config_generic *gconf, config_var_value *val) set_extra_field(gconf, &(val->extra), NULL); } - /* * Fetch the sorted array pointer (exported for help_config.c's use ONLY) */ @@ -5487,6 +5517,22 @@ push_old_value(struct config_generic *gconf, GucAction action) /* Do we already have a stack entry of the current nest level? */ stack = gconf->stack; + + /* NONXACT action make existing stack useles */ + if (action == GUC_ACTION_NONXACT) + { + while (stack) + { + GucStack *prev = stack->prev; + + discard_stack_value(gconf, &stack->prior); + discard_stack_value(gconf, &stack->masked); + pfree(stack); + stack = prev; + } + stack = gconf->stack = NULL; + } + if (stack && stack->nest_level >= GUCNestLevel) { /* Yes, so adjust its state if necessary */ @@ -5494,28 +5540,63 @@ push_old_value(struct config_generic *gconf, GucAction action) switch (action) { case GUC_ACTION_SET: - /* SET overrides any prior action at same nest level */ - if (stack->state == GUC_SET_LOCAL) + if (stack->state == GUC_NONXACT) { - /* must discard old masked value */ - discard_stack_value(gconf, &stack->masked); + /* NONXACT rollbacks to the current value */ + stack->scontext = gconf->scontext; + set_stack_value(gconf, &stack->prior); + stack->state = GUC_NONXACT_SET; } - stack->state = GUC_SET; + else + { + /* SET overrides other prior actions at same nest level */ + if (stack->state == GUC_SET_LOCAL) + { + /* must discard old masked value */ + discard_stack_value(gconf, &stack->masked); + } + stack->state = GUC_SET; + } + break; + case GUC_ACTION_LOCAL: if (stack->state == GUC_SET) { - /* SET followed by SET LOCAL, remember SET's value */ + /* SET followed by SET LOCAL, remember it's value */ stack->masked_scontext = gconf->scontext; set_stack_value(gconf, &stack->masked); stack->state = GUC_SET_LOCAL; } + else if (stack->state == GUC_NONXACT) + { + /* + * NONXACT followed by SET LOCAL, both prior and masked + * are set to the current value + */ + stack->scontext = gconf->scontext; + set_stack_value(gconf, &stack->prior); + stack->masked_scontext = stack->scontext; + stack->masked = stack->prior; + stack->state = GUC_NONXACT_LOCAL; + } + else if (stack->state == GUC_NONXACT_SET) + { + /* NONXACT_SET followed by SET LOCAL, set masked */ + stack->masked_scontext = gconf->scontext; + set_stack_value(gconf, &stack->masked); + stack->state = GUC_NONXACT_LOCAL; + } /* in all other cases, no change to stack entry */ break; case GUC_ACTION_SAVE: /* Could only have a prior SAVE of same variable */ Assert(stack->state == GUC_SAVE); break; + + case GUC_ACTION_NONXACT: + Assert(false); + break; } Assert(guc_dirty); /* must be set already */ return; @@ -5531,6 +5612,7 @@ push_old_value(struct config_generic *gconf, GucAction action) stack->prev = gconf->stack; stack->nest_level = GUCNestLevel; + switch (action) { case GUC_ACTION_SET: @@ -5542,10 +5624,15 @@ push_old_value(struct config_generic *gconf, GucAction action) case GUC_ACTION_SAVE: stack->state = GUC_SAVE; break; + case GUC_ACTION_NONXACT: + stack->state = GUC_NONXACT; + break; } stack->source = gconf->source; stack->scontext = gconf->scontext; - set_stack_value(gconf, &stack->prior); + + if (action != GUC_ACTION_NONXACT) + set_stack_value(gconf, &stack->prior); gconf->stack = stack; @@ -5640,22 +5727,31 @@ AtEOXact_GUC(bool isCommit, int nestLevel) * stack entries to avoid leaking memory. If we do set one of * those flags, unused fields will be cleaned up after restoring. */ - if (!isCommit) /* if abort, always restore prior value */ - restorePrior = true; + if (!isCommit) + { + /* GUC_NONXACT does't rollback */ + if (stack->state != GUC_NONXACT) + restorePrior = true; + } else if (stack->state == GUC_SAVE) restorePrior = true; else if (stack->nest_level == 1) { /* transaction commit */ - if (stack->state == GUC_SET_LOCAL) + if (stack->state == GUC_SET_LOCAL || + stack->state == GUC_NONXACT_LOCAL) restoreMasked = true; - else if (stack->state == GUC_SET) + else if (stack->state == GUC_SET || + stack->state == GUC_NONXACT_SET) { /* we keep the current active value */ discard_stack_value(gconf, &stack->prior); } - else /* must be GUC_LOCAL */ + else if (stack->state != GUC_NONXACT) + { + /* must be GUC_LOCAL */ restorePrior = true; + } } else if (prev == NULL || prev->nest_level < stack->nest_level - 1) @@ -5677,11 +5773,27 @@ AtEOXact_GUC(bool isCommit, int nestLevel) break; case GUC_SET: - /* next level always becomes SET */ - discard_stack_value(gconf, &stack->prior); - if (prev->state == GUC_SET_LOCAL) + if (prev->state == GUC_SET || + prev->state == GUC_NONXACT_SET) + { + discard_stack_value(gconf, &stack->prior); + } + else if (prev->state == GUC_NONXACT) + { + prev->scontext = stack->scontext; + prev->prior = stack->prior; + prev->state = GUC_NONXACT_SET; + } + else if (prev->state == GUC_SET_LOCAL || + prev->state == GUC_NONXACT_LOCAL) + { + discard_stack_value(gconf, &stack->prior); discard_stack_value(gconf, &prev->masked); - prev->state = GUC_SET; + if (prev->state == GUC_SET_LOCAL) + prev->state = GUC_SET; + else + prev->state = GUC_NONXACT_SET; + } break; case GUC_LOCAL: @@ -5692,6 +5804,16 @@ AtEOXact_GUC(bool isCommit, int nestLevel) prev->masked = stack->prior; prev->state = GUC_SET_LOCAL; } + else if (prev->state == GUC_NONXACT) + { + prev->prior = stack->masked; + prev->scontext = stack->masked_scontext; + prev->masked = stack->masked; + prev->masked_scontext = stack->masked_scontext; + discard_stack_value(gconf, &stack->prior); + discard_stack_value(gconf, &stack->masked); + prev->state = GUC_NONXACT_SET; + } else { /* else just forget this stack level */ @@ -5700,15 +5822,32 @@ AtEOXact_GUC(bool isCommit, int nestLevel) break; case GUC_SET_LOCAL: - /* prior state at this level no longer wanted */ - discard_stack_value(gconf, &stack->prior); - /* copy down the masked state */ - prev->masked_scontext = stack->masked_scontext; - if (prev->state == GUC_SET_LOCAL) - discard_stack_value(gconf, &prev->masked); - prev->masked = stack->masked; - prev->state = GUC_SET_LOCAL; + if (prev->state == GUC_NONXACT) + { + prev->prior = stack->prior; + prev->masked = stack->prior; + discard_stack_value(gconf, &stack->prior); + discard_stack_value(gconf, &stack->masked); + prev->state = GUC_NONXACT_SET; + } + else if (prev->state != GUC_NONXACT_SET) + { + /* prior state at this level no longer wanted */ + discard_stack_value(gconf, &stack->prior); + /* copy down the masked state */ + prev->masked_scontext = stack->masked_scontext; + if (prev->state == GUC_SET_LOCAL) + discard_stack_value(gconf, &prev->masked); + prev->masked = stack->masked; + prev->state = GUC_SET_LOCAL; + } break; + case GUC_NONXACT: + case GUC_NONXACT_SET: + case GUC_NONXACT_LOCAL: + Assert(false); + break; + } } @@ -7989,7 +8128,8 @@ set_config_by_name(PG_FUNCTION_ARGS) char *name; char *value; char *new_value; - bool is_local; + int set_action = GUC_ACTION_SET; + if (PG_ARGISNULL(0)) ereport(ERROR, @@ -8009,18 +8149,27 @@ set_config_by_name(PG_FUNCTION_ARGS) * Get the desired state of is_local. Default to false if provided value * is NULL */ - if (PG_ARGISNULL(2)) - is_local = false; - else - is_local = PG_GETARG_BOOL(2); + if (!PG_ARGISNULL(2) && PG_GETARG_BOOL(2)) + set_action = GUC_ACTION_LOCAL; + + /* + * Get the desired state of is_nonxact. Default to false if provided value + * is NULL + */ + if (!PG_ARGISNULL(3) && PG_GETARG_BOOL(3)) + { + if (set_action == GUC_ACTION_LOCAL) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("Only one of is_local and is_nonxact can be true"))); + set_action = GUC_ACTION_NONXACT; + } /* Note SET DEFAULT (argstring == NULL) is equivalent to RESET */ (void) set_config_option(name, value, (superuser() ? PGC_SUSET : PGC_USERSET), - PGC_S_SESSION, - is_local ? GUC_ACTION_LOCAL : GUC_ACTION_SET, - true, 0, false); + PGC_S_SESSION, set_action, true, 0, false); /* get the new current value */ new_value = GetConfigOptionByName(name, NULL, false); @@ -8029,7 +8178,6 @@ set_config_by_name(PG_FUNCTION_ARGS) PG_RETURN_TEXT_P(cstring_to_text(new_value)); } - /* * Common code for DefineCustomXXXVariable subroutines: allocate the * new variable's config struct and fill in generic fields. @@ -8228,6 +8376,13 @@ reapply_stacked_values(struct config_generic *variable, WARNING, false); break; + case GUC_NONXACT: + (void) set_config_option(name, curvalue, + curscontext, cursource, + GUC_ACTION_NONXACT, true, + WARNING, false); + break; + case GUC_LOCAL: (void) set_config_option(name, curvalue, curscontext, cursource, @@ -8247,6 +8402,33 @@ reapply_stacked_values(struct config_generic *variable, GUC_ACTION_LOCAL, true, WARNING, false); break; + + case GUC_NONXACT_SET: + /* first, apply the masked value as SET */ + (void) set_config_option(name, stack->masked.val.stringval, + stack->masked_scontext, PGC_S_SESSION, + GUC_ACTION_NONXACT, true, + WARNING, false); + /* then apply the current value as LOCAL */ + (void) set_config_option(name, curvalue, + curscontext, cursource, + GUC_ACTION_SET, true, + WARNING, false); + break; + + case GUC_NONXACT_LOCAL: + /* first, apply the masked value as SET */ + (void) set_config_option(name, stack->masked.val.stringval, + stack->masked_scontext, PGC_S_SESSION, + GUC_ACTION_NONXACT, true, + WARNING, false); + /* then apply the current value as LOCAL */ + (void) set_config_option(name, curvalue, + curscontext, cursource, + GUC_ACTION_LOCAL, true, + WARNING, false); + break; + } /* If we successfully made a stack entry, adjust its nest level */ @@ -10225,6 +10407,373 @@ GUCArrayReset(ArrayType *array) return newarray; } +Size +GucShmemSize(void) +{ + Size size; + + size = sizeof(GucRemoteSetting); + + return size; +} + +void +GucShmemInit(void) +{ + Size size; + bool found; + + size = sizeof(GucRemoteSetting); + remote_setting = (GucRemoteSetting *) + ShmemInitStruct("GUC remote setting", size, &found); + + if (!found) + { + MemSet(remote_setting, 0, size); + LWLockInitialize(&remote_setting->lock, LWLockNewTrancheId()); + } + + LWLockRegisterTranche(remote_setting->lock.tranche, "guc_remote"); +} + +/* + * set_backend_config: SQL callable function to set GUC variable of remote + * session. + */ +Datum +set_backend_config(PG_FUNCTION_ARGS) +{ + int pid = PG_GETARG_INT32(0); + char *name = text_to_cstring(PG_GETARG_TEXT_P(1)); + char *value = text_to_cstring(PG_GETARG_TEXT_P(2)); + TimestampTz cancel_start; + PgBackendStatus *beentry; + int beid; + int rc; + + if (strlen(name) >= NAMEDATALEN) + ereport(ERROR, + (errcode(ERRCODE_NAME_TOO_LONG), + errmsg("name of GUC variable is too long"))); + if (strlen(value) >= GUC_REMOTE_MAX_VALUE_LEN) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("value is too long"), + errdetail("Maximum acceptable length of value is %d", + GUC_REMOTE_MAX_VALUE_LEN - 1))); + + /* find beentry for given pid */ + beentry = NULL; + for (beid = 1; + (beentry = pgstat_fetch_stat_beentry(beid)) && + beentry->st_procpid != pid ; + beid++); + + /* + * This will be checked out by SendProcSignal but do here to emit + * appropriate message message. + */ + if (!beentry) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("process PID %d not found", pid))); + + /* allow only client backends */ + if (beentry->st_backendType != B_BACKEND) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("not a client backend"))); + + /* + * Wait if someone is sending a request. We need to wait with timeout + * since the current user of the struct doesn't wake me up. + */ + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + while (remote_setting->state != REMGUC_VACANT) + { + LWLockRelease(&remote_setting->lock); + rc = WaitLatch(&MyProc->procLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, + 200, PG_WAIT_ACTIVITY); + + if (rc & WL_POSTMASTER_DEATH) + return (Datum) BoolGetDatum(false); + + CHECK_FOR_INTERRUPTS(); + + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + } + + /* my turn, send a request */ + Assert(remote_setting->state == REMGUC_VACANT); + + remote_setting->state = REMGUC_REQUEST; + remote_setting->sourcepid = MyProcPid; + remote_setting->targetpid = pid; + remote_setting->userid = GetUserId(); + + strncpy(remote_setting->name, name, NAMEDATALEN); + remote_setting->name[NAMEDATALEN - 1] = 0; + strncpy(remote_setting->value, value, GUC_REMOTE_MAX_VALUE_LEN); + remote_setting->value[GUC_REMOTE_MAX_VALUE_LEN - 1] = 0; + remote_setting->sender_latch = MyLatch; + + LWLockRelease(&remote_setting->lock); + + if (SendProcSignal(pid, PROCSIG_REMOTE_GUC, InvalidBackendId) < 0) + { + remote_setting->state = REMGUC_VACANT; + ereport(ERROR, + (errmsg("could not signal backend with PID %d: %m", pid))); + } + + /* + * This request is processed only while idle time of peer so it may take a + * long time before we get a response. + */ + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + while (remote_setting->state != REMGUC_DONE) + { + LWLockRelease(&remote_setting->lock); + rc = WaitLatch(&MyProc->procLatch, + WL_LATCH_SET | WL_POSTMASTER_DEATH, + -1, PG_WAIT_ACTIVITY); + + /* don't care of the state in the case.. */ + if (rc & WL_POSTMASTER_DEATH) + return (Datum) BoolGetDatum(false); + + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + + /* get out if we got a query cancel request */ + if (QueryCancelPending) + break; + } + + /* + * Cancel the requset if possible. We cannot cancel the request in the + * case peer have processed it. We don't see QueryCancelPending but the + * request status so that the case is handled properly. + */ + if (remote_setting->state == REMGUC_REQUEST) + { + Assert(QueryCancelPending); + + remote_setting->state = REMGUC_CANCELING; + LWLockRelease(&remote_setting->lock); + + if (SendProcSignal(pid, + PROCSIG_REMOTE_GUC, InvalidBackendId) < 0) + { + remote_setting->state = REMGUC_VACANT; + ereport(ERROR, + (errmsg("could not signal backend with PID %d: %m", + pid))); + } + + /* Peer must respond shortly, don't sleep for a long time. */ + + cancel_start = GetCurrentTimestamp(); + + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + while (remote_setting->state != REMGUC_CANCELED && + !TimestampDifferenceExceeds(cancel_start, GetCurrentTimestamp(), + GUC_REMOTE_CANCEL_TIMEOUT)) + { + LWLockRelease(&remote_setting->lock); + rc = WaitLatch(&MyProc->procLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, + GUC_REMOTE_CANCEL_TIMEOUT, PG_WAIT_ACTIVITY); + + /* don't care of the state in the case.. */ + if (rc & WL_POSTMASTER_DEATH) + return (Datum) BoolGetDatum(false); + + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + } + + if (remote_setting->state != REMGUC_CANCELED) + { + remote_setting->state = REMGUC_VACANT; + ereport(ERROR, (errmsg("failed cancelling remote GUC request"))); + } + + remote_setting->state = REMGUC_VACANT; + LWLockRelease(&remote_setting->lock); + + ereport(INFO, + (errmsg("remote GUC change request to PID %d is canceled", + pid))); + + return (Datum) BoolGetDatum(false); + } + + Assert (remote_setting->state == REMGUC_DONE); + + /* ereport exits on query cancel, we need this before that */ + remote_setting->state = REMGUC_VACANT; + + if (QueryCancelPending) + ereport(INFO, + (errmsg("remote GUC change request to PID %d already completed", + pid))); + + if (!remote_setting->success) + ereport(ERROR, + (errmsg("%s", remote_setting->value))); + + LWLockRelease(&remote_setting->lock); + + return (Datum) BoolGetDatum(true); +} + + +void +HandleRemoteGucSetInterrupt(void) +{ + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + + /* check if any request is being sent to me */ + if (remote_setting->targetpid == MyProcPid) + { + switch (remote_setting->state) + { + case REMGUC_REQUEST: + InterruptPending = true; + RemoteGucChangePending = true; + break; + case REMGUC_CANCELING: + InterruptPending = true; + RemoteGucChangePending = true; + remote_setting->state = REMGUC_CANCELED; + SetLatch(remote_setting->sender_latch); + break; + default: + break; + } + } + LWLockRelease(&remote_setting->lock); +} + +void +HandleGucRemoteChanges(void) +{ + MemoryContext currentcxt = CurrentMemoryContext; + bool canceling = false; + bool process_request = true; + int saveInterruptHoldoffCount = 0; + int saveQueryCancelHoldoffCount = 0; + + RemoteGucChangePending = false; + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + + /* skip if this request is no longer for me */ + if (remote_setting->targetpid != MyProcPid) + process_request = false; + else + { + switch (remote_setting->state) + { + case REMGUC_REQUEST: + remote_setting->state = REMGUC_INPROCESS; + break; + case REMGUC_CANCELING: + /* + * This request is already canceled but entered this function + * before receiving signal. Cancel the request here. + */ + remote_setting->state = REMGUC_CANCELED; + remote_setting->success = false; + canceling = true; + break; + case REMGUC_VACANT: + case REMGUC_CANCELED: + case REMGUC_INPROCESS: + case REMGUC_DONE: + /* Just ignore the cases */ + process_request = false; + break; + } + } + + LWLockRelease(&remote_setting->lock); + + if (!process_request) + return; + + if (canceling) + { + SetLatch(remote_setting->sender_latch); + return; + } + + + /* Okay, actually modify variable */ + remote_setting->success = true; + + PG_TRY(); + { + bool has_privilege; + bool is_superuser; + bool end_transaction = false; + /* + * XXXX: ERROR resets the following varialbes but we don't want that. + */ + saveInterruptHoldoffCount = InterruptHoldoffCount; + saveQueryCancelHoldoffCount = QueryCancelHoldoffCount; + + /* superuser_arg requires a transaction */ + if (!IsTransactionState()) + { + StartTransactionCommand(); + end_transaction = true; + } + is_superuser = superuser_arg(remote_setting->userid); + has_privilege = is_superuser || + has_privs_of_role(remote_setting->userid, GetUserId()); + + if (end_transaction) + CommitTransactionCommand(); + + if (!has_privilege) + elog(ERROR, "role %u is not allowed to set GUC variables on the session with PID %d", + remote_setting->userid, MyProcPid); + + (void) set_config_option(remote_setting->name, remote_setting->value, + is_superuser ? PGC_SUSET : PGC_USERSET, + PGC_S_SESSION, GUC_ACTION_NONXACT, + true, ERROR, false); + } + PG_CATCH(); + { + ErrorData *errdata; + MemoryContextSwitchTo(currentcxt); + errdata = CopyErrorData(); + remote_setting->success = false; + strncpy(remote_setting->value, errdata->message, + GUC_REMOTE_MAX_VALUE_LEN); + remote_setting->value[GUC_REMOTE_MAX_VALUE_LEN - 1] = 0; + FlushErrorState(); + + /* restore the saved value */ + InterruptHoldoffCount = saveInterruptHoldoffCount ; + QueryCancelHoldoffCount = saveQueryCancelHoldoffCount; + + } + PG_END_TRY(); + + ereport(LOG, + (errmsg("GUC variable \"%s\" is changed to \"%s\" by request from another backend with PID %d", + remote_setting->name, remote_setting->value, + remote_setting->sourcepid))); + + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + remote_setting->state = REMGUC_DONE; + LWLockRelease(&remote_setting->lock); + + SetLatch(remote_setting->sender_latch); +} + /* * Validate a proposed option setting for GUCArrayAdd/Delete/Reset. * diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 4de9fdee44..62a64db022 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -5647,8 +5647,8 @@ proargtypes => 'text bool', prosrc => 'show_config_by_name_missing_ok' }, { oid => '2078', descr => 'SET X as a function', proname => 'set_config', proisstrict => 'f', provolatile => 'v', - proparallel => 'u', prorettype => 'text', proargtypes => 'text text bool', - prosrc => 'set_config_by_name' }, + proparallel => 'u', prorettype => 'text', + proargtypes => 'text text bool bool', prosrc => 'set_config_by_name' }, { oid => '2084', descr => 'SHOW ALL as a function', proname => 'pg_show_all_settings', prorows => '1000', proretset => 't', provolatile => 's', prorettype => 'record', proargtypes => '', @@ -9625,6 +9625,12 @@ proargmodes => '{i,o,o,o,o,o,o,o,o,o}', proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,last_update}', prosrc => 'pgstat_get_syscache_stats' }, +{ oid => '3424', + descr => 'set config of another backend', + proname => 'pg_set_backend_config', proisstrict => 'f', + proretset => 'f', provolatile => 'v', proparallel => 'u', + prorettype => 'bool', proargtypes => 'int4 text text', + prosrc => 'set_backend_config' }, { oid => '3786', descr => 'set up a logical replication slot', proname => 'pg_create_logical_replication_slot', provolatile => 'v', proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool', diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 20add5052c..198fa42f80 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -833,7 +833,8 @@ typedef enum WAIT_EVENT_REPLICATION_ORIGIN_DROP, WAIT_EVENT_REPLICATION_SLOT_DROP, WAIT_EVENT_SAFE_SNAPSHOT, - WAIT_EVENT_SYNC_REP + WAIT_EVENT_SYNC_REP, + WAIT_EVENT_REMOTE_GUC } WaitEventIPC; /* ---------- diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h index 6db0d69b71..4ad4927d3d 100644 --- a/src/include/storage/procsignal.h +++ b/src/include/storage/procsignal.h @@ -42,6 +42,9 @@ typedef enum PROCSIG_RECOVERY_CONFLICT_BUFFERPIN, PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK, + /* Remote GUC setting */ + PROCSIG_REMOTE_GUC, + NUM_PROCSIGNALS /* Must be last! */ } ProcSignalReason; diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h index df2e556b02..0f3498fc6d 100644 --- a/src/include/utils/guc.h +++ b/src/include/utils/guc.h @@ -193,7 +193,8 @@ typedef enum /* Types of set_config_option actions */ GUC_ACTION_SET, /* regular SET command */ GUC_ACTION_LOCAL, /* SET LOCAL command */ - GUC_ACTION_SAVE /* function SET option, or temp assignment */ + GUC_ACTION_SAVE, /* function SET option, or temp assignment */ + GUC_ACTION_NONXACT /* transactional setting */ } GucAction; #define GUC_QUALIFIER_SEPARATOR '.' @@ -268,6 +269,8 @@ extern int tcp_keepalives_idle; extern int tcp_keepalives_interval; extern int tcp_keepalives_count; +extern volatile bool RemoteGucChangePending; + #ifdef TRACE_SORT extern bool trace_sort; #endif @@ -275,6 +278,11 @@ extern bool trace_sort; /* * Functions exported by guc.c */ +extern Size GucShmemSize(void); +extern void GucShmemInit(void); +extern Datum set_backend_setting(PG_FUNCTION_ARGS); +extern void HandleRemoteGucSetInterrupt(void); +extern void HandleGucRemoteChanges(void); extern void SetConfigOption(const char *name, const char *value, GucContext context, GucSource source); @@ -394,6 +402,9 @@ extern Size EstimateGUCStateSpace(void); extern void SerializeGUCState(Size maxsize, char *start_address); extern void RestoreGUCState(void *gucstate); +/* Remote GUC setting */ +extern void HandleGucRemoteChanges(void); + /* Support for messages reported from GUC check hooks */ extern PGDLLIMPORT char *GUC_check_errmsg_string; diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h index 6f9fdb6a5f..4980a01c97 100644 --- a/src/include/utils/guc_tables.h +++ b/src/include/utils/guc_tables.h @@ -115,7 +115,10 @@ typedef enum GUC_SAVE, /* entry caused by function SET option */ GUC_SET, /* entry caused by plain SET command */ GUC_LOCAL, /* entry caused by SET LOCAL command */ - GUC_SET_LOCAL /* entry caused by SET then SET LOCAL */ + GUC_NONXACT, /* entry caused by non-transactional ops */ + GUC_SET_LOCAL, /* entry caused by SET then SET LOCAL */ + GUC_NONXACT_SET, /* entry caused by NONXACT then SET */ + GUC_NONXACT_LOCAL /* entry caused by NONXACT then (SET)LOCAL */ } GucStackState; typedef struct guc_stack diff --git a/src/test/regress/expected/guc.out b/src/test/regress/expected/guc.out index 43ac5f5f11..2c074705c7 100644 --- a/src/test/regress/expected/guc.out +++ b/src/test/regress/expected/guc.out @@ -476,6 +476,229 @@ SELECT '2006-08-13 12:34:56'::timestamptz; 2006-08-13 12:34:56-07 (1 row) +-- NONXACT followed by SET, SET LOCAL through COMMIT +BEGIN; +SELECT set_config('work_mem', '128kB', false, true); -- NONXACT + set_config +------------ + 128kB +(1 row) + +SET work_mem to '256kB'; +SET LOCAL work_mem to '512kB'; +SHOW work_mem; -- must see 512kB + work_mem +---------- + 512kB +(1 row) + +COMMIT; +SHOW work_mem; -- must see 256kB + work_mem +---------- + 256kB +(1 row) + +-- NONXACT followed by SET, SET LOCAL through ROLLBACK +BEGIN; +SELECT set_config('work_mem', '128kB', false, true); -- NONXACT + set_config +------------ + 128kB +(1 row) + +SET work_mem to '256kB'; +SET LOCAL work_mem to '512kB'; +SHOW work_mem; -- must see 512kB + work_mem +---------- + 512kB +(1 row) + +ROLLBACK; +SHOW work_mem; -- must see 128kB + work_mem +---------- + 128kB +(1 row) + +-- SET, SET LOCAL followed by NONXACT through COMMIT +BEGIN; +SET work_mem to '256kB'; +SET LOCAL work_mem to '512kB'; +SELECT set_config('work_mem', '128kB', false, true); -- NONXACT + set_config +------------ + 128kB +(1 row) + +SHOW work_mem; -- must see 128kB + work_mem +---------- + 128kB +(1 row) + +COMMIT; +SHOW work_mem; -- must see 128kB + work_mem +---------- + 128kB +(1 row) + +-- SET, SET LOCAL followed by NONXACT through ROLLBACK +BEGIN; +SET work_mem to '256kB'; +SET LOCAL work_mem to '512kB'; +SELECT set_config('work_mem', '128kB', false, true); -- NONXACT + set_config +------------ + 128kB +(1 row) + +SHOW work_mem; -- must see 128kB + work_mem +---------- + 128kB +(1 row) + +ROLLBACK; +SHOW work_mem; -- must see 128kB + work_mem +---------- + 128kB +(1 row) + +-- NONXACT and SAVEPOINT +SET work_mem TO '64kB'; +BEGIN; +SET work_mem TO '128kB'; +SAVEPOINT a; +SELECT set_config('work_mem', '256kB', false, true); -- NONXACT + set_config +------------ + 256kB +(1 row) + +SHOW work_mem; + work_mem +---------- + 256kB +(1 row) + +SET LOCAL work_mem TO '384kB'; +RELEASE SAVEPOINT a; +SHOW work_mem; -- will see 384kB + work_mem +---------- + 384kB +(1 row) + +COMMIT; +SHOW work_mem; -- will see 256kB + work_mem +---------- + 256kB +(1 row) + +-- +SET work_mem TO '64kB'; +BEGIN; +SET work_mem TO '128kB'; +SAVEPOINT a; +SELECT set_config('work_mem', '256kB', false, true); -- NONXACT + set_config +------------ + 256kB +(1 row) + +SHOW work_mem; + work_mem +---------- + 256kB +(1 row) + +SET LOCAL work_mem TO '384kB'; +ROLLBACK TO SAVEPOINT a; +SHOW work_mem; -- will see 256kB + work_mem +---------- + 256kB +(1 row) + +ROLLBACK; +SHOW work_mem; -- will see 256kB + work_mem +---------- + 256kB +(1 row) + +-- +SET work_mem TO '64kB'; +BEGIN; +SET work_mem TO '128kB'; +SET LOCAL work_mem TO '384kB'; +SAVEPOINT a; +SELECT set_config('work_mem', '256kB', false, true); -- NONXACT + set_config +------------ + 256kB +(1 row) + +SHOW work_mem; + work_mem +---------- + 256kB +(1 row) + +SET LOCAL work_mem TO '384kB'; +RELEASE SAVEPOINT a; +SHOW work_mem; -- will see 384kB + work_mem +---------- + 384kB +(1 row) + +ROLLBACK; +SHOW work_mem; -- will see 256kB + work_mem +---------- + 256kB +(1 row) + +-- +SET work_mem TO '64kB'; +BEGIN; +SET work_mem TO '128kB'; +SET LOCAL work_mem TO '384kB'; +SAVEPOINT a; +SELECT set_config('work_mem', '256kB', false, true); -- NONXACT + set_config +------------ + 256kB +(1 row) + +SHOW work_mem; + work_mem +---------- + 256kB +(1 row) + +SET LOCAL work_mem TO '384kB'; +ROLLBACK TO SAVEPOINT a; +SHOW work_mem; -- will see 256kB + work_mem +---------- + 256kB +(1 row) + +COMMIT; +SHOW work_mem; -- will see 256kB + work_mem +---------- + 256kB +(1 row) + +SET work_mem TO DEFAULT; -- -- Test RESET. We use datestyle because the reset value is forced by -- pg_regress, so it doesn't depend on the installation's configuration. diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out index 735dd37acf..3569edc22d 100644 --- a/src/test/regress/expected/rules.out +++ b/src/test/regress/expected/rules.out @@ -1918,6 +1918,30 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid, pg_stat_all_tables.autoanalyze_count FROM pg_stat_all_tables WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname~ '^pg_toast'::text)); +pg_stat_syscache| SELECT s.pid, + (s.relid)::regclass AS relname, + (s.indid)::regclass AS cache_name, + s.size, + s.ntup AS ntuples, + s.searches, + s.hits, + s.neg_hits, + s.ageclass, + s.nentries, + s.last_update + FROM (pg_stat_activity a + JOIN LATERAL ( SELECT a.pid, + pg_get_syscache_stats.relid, + pg_get_syscache_stats.indid, + pg_get_syscache_stats.size, + pg_get_syscache_stats.ntup, + pg_get_syscache_stats.searches, + pg_get_syscache_stats.hits, + pg_get_syscache_stats.neg_hits, + pg_get_syscache_stats.ageclass, + pg_get_syscache_stats.nentries, + pg_get_syscache_stats.last_update + FROM pg_get_syscache_stats(a.pid) pg_get_syscache_stats(relid, indid, size, ntup, searches, hits, neg_hits, ageclass,nentries, last_update)) s ON ((a.pid = s.pid))); pg_stat_user_functions| SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, @@ -2349,7 +2373,7 @@ pg_settings|pg_settings_n|CREATE RULE pg_settings_n AS ON UPDATE TO pg_catalog.pg_settings DO INSTEAD NOTHING; pg_settings|pg_settings_u|CREATE RULE pg_settings_u AS ON UPDATE TO pg_catalog.pg_settings - WHERE (new.name = old.name) DO SELECT set_config(old.name, new.setting, false) AS set_config; + WHERE (new.name = old.name) DO SELECT set_config(old.name, new.setting, false, false) AS set_config; rtest_emp|rtest_emp_del|CREATE RULE rtest_emp_del AS ON DELETE TO public.rtest_emp DO INSERT INTO rtest_emplog (ename, who, action, newsal, oldsal) VALUES (old.ename, CURRENT_USER, 'fired'::bpchar, '$0.00'::money, old.salary); diff --git a/src/test/regress/sql/guc.sql b/src/test/regress/sql/guc.sql index 23e5029780..2fb23caafe 100644 --- a/src/test/regress/sql/guc.sql +++ b/src/test/regress/sql/guc.sql @@ -133,6 +133,94 @@ SHOW vacuum_cost_delay; SHOW datestyle; SELECT '2006-08-13 12:34:56'::timestamptz; +-- NONXACT followed by SET, SET LOCAL through COMMIT +BEGIN; +SELECT set_config('work_mem', '128kB', false, true); -- NONXACT +SET work_mem to '256kB'; +SET LOCAL work_mem to '512kB'; +SHOW work_mem; -- must see 512kB +COMMIT; +SHOW work_mem; -- must see 256kB + +-- NONXACT followed by SET, SET LOCAL through ROLLBACK +BEGIN; +SELECT set_config('work_mem', '128kB', false, true); -- NONXACT +SET work_mem to '256kB'; +SET LOCAL work_mem to '512kB'; +SHOW work_mem; -- must see 512kB +ROLLBACK; +SHOW work_mem; -- must see 128kB + +-- SET, SET LOCAL followed by NONXACT through COMMIT +BEGIN; +SET work_mem to '256kB'; +SET LOCAL work_mem to '512kB'; +SELECT set_config('work_mem', '128kB', false, true); -- NONXACT +SHOW work_mem; -- must see 128kB +COMMIT; +SHOW work_mem; -- must see 128kB + +-- SET, SET LOCAL followed by NONXACT through ROLLBACK +BEGIN; +SET work_mem to '256kB'; +SET LOCAL work_mem to '512kB'; +SELECT set_config('work_mem', '128kB', false, true); -- NONXACT +SHOW work_mem; -- must see 128kB +ROLLBACK; +SHOW work_mem; -- must see 128kB + +-- NONXACT and SAVEPOINT +SET work_mem TO '64kB'; +BEGIN; +SET work_mem TO '128kB'; +SAVEPOINT a; +SELECT set_config('work_mem', '256kB', false, true); -- NONXACT +SHOW work_mem; +SET LOCAL work_mem TO '384kB'; +RELEASE SAVEPOINT a; +SHOW work_mem; -- will see 384kB +COMMIT; +SHOW work_mem; -- will see 256kB +-- +SET work_mem TO '64kB'; +BEGIN; +SET work_mem TO '128kB'; +SAVEPOINT a; +SELECT set_config('work_mem', '256kB', false, true); -- NONXACT +SHOW work_mem; +SET LOCAL work_mem TO '384kB'; +ROLLBACK TO SAVEPOINT a; +SHOW work_mem; -- will see 256kB +ROLLBACK; +SHOW work_mem; -- will see 256kB +-- +SET work_mem TO '64kB'; +BEGIN; +SET work_mem TO '128kB'; +SET LOCAL work_mem TO '384kB'; +SAVEPOINT a; +SELECT set_config('work_mem', '256kB', false, true); -- NONXACT +SHOW work_mem; +SET LOCAL work_mem TO '384kB'; +RELEASE SAVEPOINT a; +SHOW work_mem; -- will see 384kB +ROLLBACK; +SHOW work_mem; -- will see 256kB +-- +SET work_mem TO '64kB'; +BEGIN; +SET work_mem TO '128kB'; +SET LOCAL work_mem TO '384kB'; +SAVEPOINT a; +SELECT set_config('work_mem', '256kB', false, true); -- NONXACT +SHOW work_mem; +SET LOCAL work_mem TO '384kB'; +ROLLBACK TO SAVEPOINT a; +SHOW work_mem; -- will see 256kB +COMMIT; +SHOW work_mem; -- will see 256kB + +SET work_mem TO DEFAULT; -- -- Test RESET. We use datestyle because the reset value is forced by -- pg_regress, so it doesn't depend on the installation's configuration. -- 2.16.3
> On Tue, Nov 27, 2018 at 11:40 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > The attached is the rebased version that has multidimentional > ageclass. Thank you, Just for the information, cfbot complains about this patch because: pgstatfuncs.c: In function ‘pgstat_get_syscache_stats’: pgstatfuncs.c:1973:8: error: ignoring return value of ‘fread’, declared with attribute warn_unused_result [-Werror=unused-result] fread(&cacheid, sizeof(int), 1, fpin); ^ pgstatfuncs.c:1974:8: error: ignoring return value of ‘fread’, declared with attribute warn_unused_result [-Werror=unused-result] fread(&last_update, sizeof(TimestampTz), 1, fpin); ^ I'm moving it to the next CF as "Waiting on author", since as far as I understood you want to address more commentaries from the reviewer.
Hello, Sorry for delay. The detailed comments for the source code will be provided later. >> I just thought that the pair of ageclass and nentries can be >> represented as json or multi-dimensional array but in virtual they are >> all same and can be converted each other using some functions. So I'm not sure >which representaion is better one. > >Multi dimentional array in any style sounds reasonable. Maybe array is preferable in >system views as it is a basic type than JSON. In the attached, it looks like the follows: > >=# select * from pg_stat_syscache where ntuples > 100; -[ RECORD >1 ]-------------------------------------------------- >pid | 1817 >relname | pg_class >cache_name | pg_class_oid_index >size | 2048 >ntuples | 189 >searches | 1620 >hits | 1431 >neg_hits | 0 >ageclass | {{30,189},{60,0},{600,0},{1200,0},{1800,0},{0,0}} >last_update | 2018-11-27 19:22:00.74026+09 Thanks, cool. That seems better to me. > >> >3. non-transactional GUC setting (in 0003) >> > >> >It allows setting GUC variable set by the action >> >GUC_ACTION_NONXACT(the name requires condieration) survive beyond >> >rollback. It is required by remote guc setting to work sanely. >> >Without the feature a remote-set value within a trasction will >> >disappear involved in rollback. The only local interface for the >> >NONXACT action is set_config(name, value, is_local=false, is_nonxact = true). >pg_set_backend_guc() below works on this feature. >> >> TBH, I'm not familiar with around this and I may be missing something. >> In order to change the other backend's GUC value, is ignoring >> transactional behevior always necessary? When transaction of GUC >> setting is failed and rollbacked, if the error message is supposeed to >> be reported I thought just trying the transaction again is enough. > >The target backend can be running frequent transaction. The invoker backend cannot >know whether the remote change happend during a transaction and whether the >transaction if any is committed or aborted, no error message sent to invoker backend. >We could wait for the end of a trasaction but that doesn't work with long transactions. > >Maybe we don't need the feature in GUC system but adding another similar feature >doesn't seem reasonable. This would be useful for some other tracking features. Thank you for the clarification. >> >4. pg_set_backend_guc() function. >> > >> >Of course syscache statistics recording consumes significant amount >> >of time so it cannot be turned on usually. On the other hand since >> >this feature is turned on by GUC, it is needed to grab the active >> >client connection to turn on/off the feature(but we cannot). Instead, I provided a >means to change GUC variables in another backend. >> > >> >pg_set_backend_guc(pid, name, value) sets the GUC variable "name" >> >on the backend "pid" to "value". >> > >> > >> > >> >With the above tools, we can inspect catcache statistics of seemingly bloated >process. >> > >> >A. Find a bloated process pid using ps or something. >> > >> >B. Turn on syscache stats on the process. >> > =# select pg_set_backend_guc(9984, 'track_syscache_usage_interval', >> >'10000'); >> > >> >C. Examine the statitics. >> > >> >=# select pid, relname, cache_name, size from pg_stat_syscache order >> >by size desc limit 3; >> > pid | relname | cache_name | size >> >------+--------------+----------------------------------+---------- >> > 9984 | pg_statistic | pg_statistic_relid_att_inh_index | 32154112 >> > 9984 | pg_cast | pg_cast_source_target_index | 4096 >> > 9984 | pg_operator | pg_operator_oprname_l_r_n_index | 4096 >> > >> > >> >=# select * from pg_stat_syscache where cache_name = >> >'pg_statistic_relid_att_inh_index'::regclass; >> >-[ RECORD 1 ]--------------------------------- >> >pid | 9984 >> >relname | pg_statistic >> >cache_name | pg_statistic_relid_att_inh_index >> >size | 11026176 >> >ntuples | 77950 >> >searches | 77950 >> >hits | 0 >> >neg_hits | 0 >> >ageclass | {30,60,600,1200,1800,0} >> >nentries | {17630,16950,43370,0,0,0} >> >last_update | 2018-10-17 15:58:19.738164+09 >> >> The output of this view seems good to me. >> >> I can imagine this use case. Does the use case of setting GUC locally never happen? >> I mean can the setting be locally changed? > >Syscahe grows through a life of a backend/session. No other client cannot connect to >it at the same time. So the variable must be set at the start of a backend using ALTER >USER/DATABASE, or the client itself is obliged to deliberitely turn on the feature at a >convenient time. I suppose that in most use cases one wants to turn on this feature >after he sees another session is eating memory more and more. > >The attached is the rebased version that has multidimentional ageclass. Thank you! That's convenient. How about splitting this non-xact guc and remote guc setting feature as another commit fest entry? I'm planning to review 001 and 002 patch in more detail and hopefully turn it to 'ready for committer' and review remote guc feature later. Related to the feature division why have you discarded pruning of relcache and plancache? Personally I want relcache one as well as catcache because regarding memory bloat there is some correlation between them. Regards, Takeshi Ideriha
>From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com] >The detailed comments for the source code will be provided later. Hi, I'm adding some comments to 0001 and 0002 one. [0001 patch] + /* + * Calculate the duration from the time of the last access to the + * "current" time. Since catcacheclock is not advanced within a + * transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + /* + * Try to remove entries older than cache_prune_min_age seconds. + if (entry_age > cache_prune_min_age) Can you change this comparison between entry_age and cache_prune_min_age to "entry_age >= cache_prune_min_age"? That is, I want the feature that entries that are accessed even within the transaction is pruned in case of cache_prune_min_age = 0 I can hit upon some of my customer who want to always keep memory usage below certain limit as strictly as possible. This kind of strict user would set cache_prune_min_age to 0 and would not want to exceed the memory target even if within a transaction. As I put miscellaneous comments about 0001 patch in some previous email, so please take a look at it. [0002 patch] I haven't looked into every detail but here are some comments. Maybe you would also need to add some sentences to this page: https://www.postgresql.org/docs/current/monitoring-stats.html +pgstat_get_syscache_stats(PG_FUNCTION_ARGS) Function name like 'pg_stat_XXX' would match surrounding code. When applying patch I found trailing whitespace warning: ../patch/horiguchi_cache/v6_stats/0002-Syscache-usage-tracking-feature.patch:157: trailing whitespace. ../patch/horiguchi_cache/v6_stats/0002-Syscache-usage-tracking-feature.patch:256: trailing whitespace. ../patch/horiguchi_cache/v6_stats/0002-Syscache-usage-tracking-feature.patch:301: trailing whitespace. ../patch/horiguchi_cache/v6_stats/0002-Syscache-usage-tracking-feature.patch:483: trailing whitespace. ../patch/horiguchi_cache/v6_stats/0002-Syscache-usage-tracking-feature.patch:539: trailing whitespace. Regards, Takeshi Ideriha
I'm really disappointed by the direction this thread is going in. The latest patches add an enormous amount of mechanism, and user-visible complexity, to do something that we learned was a bad idea decades ago. Putting a limit on the size of the syscaches doesn't accomplish anything except to add cycles if your cache working set is below the limit, or make performance fall off a cliff if it's above the limit. I don't think there's any reason to believe that making it more complicated will avoid that problem. What does seem promising is something similar to Horiguchi-san's original patches all the way back at https://www.postgresql.org/message-id/20161219.201505.11562604.horiguchi.kyotaro@lab.ntt.co.jp That is, identify usage patterns in which we tend to fill the caches with provably no-longer-useful entries, and improve those particular cases. Horiguchi-san identified one such case in that message: negative entries in the STATRELATTINH cache, caused by the planner probing for stats that aren't there, and then not cleared when the relevant table gets dropped (since, by definition, they don't match any pg_statistic entry that gets deleted). We saw another recent report of the same problem at https://www.postgresql.org/message-id/flat/2114009259.1866365.1544469996900%40mail.yahoo.com so I'd been thinking about ways to fix that case in particular. I came up with a fix that I think is simpler and a bit more efficient than what Horiguchi-san proposed originally: rather than trying to reverse- engineer what to do in low-level cache callbacks, let's have the catalog manipulation code explicitly send out invalidation commands when the relevant situations arise. In the attached, heap.c's RemoveStatistics sends out an sinval message commanding deletion of negative STATRELATTINH entries that match the OID of the table being deleted. We could use the same infrastructure to clean out dead RELNAMENSP entries after a schema deletion, as per Horiguchi-san's second original suggestion; although I haven't done so here because I'm not really convinced that that's got an attractive cost-benefit ratio. (In both my patch and Horiguchi-san's, we have to traverse all entries in the affected cache, so sending out one of these messages is potentially not real cheap.) To do this we need to adjust the representation of sinval messages so that we can have two different kinds of messages that include a cache ID. Fortunately, because there's padding space available, that's not costly. 0001 below is a simple refactoring patch that converts the message type ID into a plain enum field that's separate from the cache ID if any. (I'm inclined to apply this whether or not people like 0002: it makes the code clearer, more maintainable, and probably a shade faster thanks to replacing an if-then-else chain with a switch.) Then 0002 adds the feature of an sinval message type saying "delete negative entries in cache X that have OID Y in key column Z", and teaches RemoveStatistics to use that. Thoughts? regards, tom lane diff --git a/src/backend/access/rmgrdesc/standbydesc.c b/src/backend/access/rmgrdesc/standbydesc.c index c295358..a7f367c 100644 *** a/src/backend/access/rmgrdesc/standbydesc.c --- b/src/backend/access/rmgrdesc/standbydesc.c *************** standby_desc_invalidations(StringInfo bu *** 111,131 **** { SharedInvalidationMessage *msg = &msgs[i]; ! if (msg->id >= 0) ! appendStringInfo(buf, " catcache %d", msg->id); ! else if (msg->id == SHAREDINVALCATALOG_ID) ! appendStringInfo(buf, " catalog %u", msg->cat.catId); ! else if (msg->id == SHAREDINVALRELCACHE_ID) ! appendStringInfo(buf, " relcache %u", msg->rc.relId); ! /* not expected, but print something anyway */ ! else if (msg->id == SHAREDINVALSMGR_ID) ! appendStringInfoString(buf, " smgr"); ! /* not expected, but print something anyway */ ! else if (msg->id == SHAREDINVALRELMAP_ID) ! appendStringInfo(buf, " relmap db %u", msg->rm.dbId); ! else if (msg->id == SHAREDINVALSNAPSHOT_ID) ! appendStringInfo(buf, " snapshot %u", msg->sn.relId); ! else ! appendStringInfo(buf, " unrecognized id %d", msg->id); } } --- 111,141 ---- { SharedInvalidationMessage *msg = &msgs[i]; ! switch ((SharedInvalMsgType) msg->id) ! { ! case SharedInvalCatcache: ! appendStringInfo(buf, " catcache %d", msg->cc.cacheId); ! break; ! case SharedInvalCatalog: ! appendStringInfo(buf, " catalog %u", msg->cat.catId); ! break; ! case SharedInvalRelcache: ! appendStringInfo(buf, " relcache %u", msg->rc.relId); ! break; ! case SharedInvalSmgr: ! /* not expected, but print something anyway */ ! appendStringInfoString(buf, " smgr"); ! break; ! case SharedInvalRelmap: ! /* not expected, but print something anyway */ ! appendStringInfo(buf, " relmap db %u", msg->rm.dbId); ! break; ! case SharedInvalSnapshot: ! appendStringInfo(buf, " snapshot %u", msg->sn.relId); ! break; ! default: ! appendStringInfo(buf, " unrecognized id %d", msg->id); ! break; ! } } } diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c index 80d7a76..5bc08b0 100644 *** a/src/backend/utils/cache/inval.c --- b/src/backend/utils/cache/inval.c *************** AddCatcacheInvalidationMessage(Invalidat *** 340,346 **** SharedInvalidationMessage msg; Assert(id < CHAR_MAX); ! msg.cc.id = (int8) id; msg.cc.dbId = dbId; msg.cc.hashValue = hashValue; --- 340,347 ---- SharedInvalidationMessage msg; Assert(id < CHAR_MAX); ! msg.cc.id = SharedInvalCatcache; ! msg.cc.cacheId = (int8) id; msg.cc.dbId = dbId; msg.cc.hashValue = hashValue; *************** AddCatalogInvalidationMessage(Invalidati *** 367,373 **** { SharedInvalidationMessage msg; ! msg.cat.id = SHAREDINVALCATALOG_ID; msg.cat.dbId = dbId; msg.cat.catId = catId; /* check AddCatcacheInvalidationMessage() for an explanation */ --- 368,374 ---- { SharedInvalidationMessage msg; ! msg.cat.id = SharedInvalCatalog; msg.cat.dbId = dbId; msg.cat.catId = catId; /* check AddCatcacheInvalidationMessage() for an explanation */ *************** AddRelcacheInvalidationMessage(Invalidat *** 391,403 **** * don't need to add individual ones when it is present. */ ProcessMessageList(hdr->rclist, ! if (msg->rc.id == SHAREDINVALRELCACHE_ID && (msg->rc.relId == relId || msg->rc.relId == InvalidOid)) return); /* OK, add the item */ ! msg.rc.id = SHAREDINVALRELCACHE_ID; msg.rc.dbId = dbId; msg.rc.relId = relId; /* check AddCatcacheInvalidationMessage() for an explanation */ --- 392,404 ---- * don't need to add individual ones when it is present. */ ProcessMessageList(hdr->rclist, ! if (msg->rc.id == SharedInvalRelcache && (msg->rc.relId == relId || msg->rc.relId == InvalidOid)) return); /* OK, add the item */ ! msg.rc.id = SharedInvalRelcache; msg.rc.dbId = dbId; msg.rc.relId = relId; /* check AddCatcacheInvalidationMessage() for an explanation */ *************** AddSnapshotInvalidationMessage(Invalidat *** 418,429 **** /* Don't add a duplicate item */ /* We assume dbId need not be checked because it will never change */ ProcessMessageList(hdr->rclist, ! if (msg->sn.id == SHAREDINVALSNAPSHOT_ID && msg->sn.relId == relId) return); /* OK, add the item */ ! msg.sn.id = SHAREDINVALSNAPSHOT_ID; msg.sn.dbId = dbId; msg.sn.relId = relId; /* check AddCatcacheInvalidationMessage() for an explanation */ --- 419,430 ---- /* Don't add a duplicate item */ /* We assume dbId need not be checked because it will never change */ ProcessMessageList(hdr->rclist, ! if (msg->sn.id == SharedInvalSnapshot && msg->sn.relId == relId) return); /* OK, add the item */ ! msg.sn.id = SharedInvalSnapshot; msg.sn.dbId = dbId; msg.sn.relId = relId; /* check AddCatcacheInvalidationMessage() for an explanation */ *************** RegisterSnapshotInvalidation(Oid dbId, O *** 553,629 **** void LocalExecuteInvalidationMessage(SharedInvalidationMessage *msg) { ! if (msg->id >= 0) { ! if (msg->cc.dbId == MyDatabaseId || msg->cc.dbId == InvalidOid) ! { ! InvalidateCatalogSnapshot(); ! SysCacheInvalidate(msg->cc.id, msg->cc.hashValue); ! CallSyscacheCallbacks(msg->cc.id, msg->cc.hashValue); ! } ! } ! else if (msg->id == SHAREDINVALCATALOG_ID) ! { ! if (msg->cat.dbId == MyDatabaseId || msg->cat.dbId == InvalidOid) ! { ! InvalidateCatalogSnapshot(); ! CatalogCacheFlushCatalog(msg->cat.catId); ! /* CatalogCacheFlushCatalog calls CallSyscacheCallbacks as needed */ ! } ! } ! else if (msg->id == SHAREDINVALRELCACHE_ID) ! { ! if (msg->rc.dbId == MyDatabaseId || msg->rc.dbId == InvalidOid) ! { ! int i; ! if (msg->rc.relId == InvalidOid) ! RelationCacheInvalidate(); ! else ! RelationCacheInvalidateEntry(msg->rc.relId); ! for (i = 0; i < relcache_callback_count; i++) ! { ! struct RELCACHECALLBACK *ccitem = relcache_callback_list + i; ! ccitem->function(ccitem->arg, msg->rc.relId); } ! } ! } ! else if (msg->id == SHAREDINVALSMGR_ID) ! { ! /* ! * We could have smgr entries for relations of other databases, so no ! * short-circuit test is possible here. ! */ ! RelFileNodeBackend rnode; ! rnode.node = msg->sm.rnode; ! rnode.backend = (msg->sm.backend_hi << 16) | (int) msg->sm.backend_lo; ! smgrclosenode(rnode); ! } ! else if (msg->id == SHAREDINVALRELMAP_ID) ! { ! /* We only care about our own database and shared catalogs */ ! if (msg->rm.dbId == InvalidOid) ! RelationMapInvalidate(true); ! else if (msg->rm.dbId == MyDatabaseId) ! RelationMapInvalidate(false); ! } ! else if (msg->id == SHAREDINVALSNAPSHOT_ID) ! { ! /* We only care about our own database and shared catalogs */ ! if (msg->rm.dbId == InvalidOid) ! InvalidateCatalogSnapshot(); ! else if (msg->rm.dbId == MyDatabaseId) ! InvalidateCatalogSnapshot(); } - else - elog(FATAL, "unrecognized SI message ID: %d", msg->id); } /* --- 554,633 ---- void LocalExecuteInvalidationMessage(SharedInvalidationMessage *msg) { ! switch ((SharedInvalMsgType) msg->id) { ! case SharedInvalCatcache: ! if (msg->cc.dbId == MyDatabaseId || msg->cc.dbId == InvalidOid) ! { ! InvalidateCatalogSnapshot(); ! SysCacheInvalidate(msg->cc.cacheId, msg->cc.hashValue); ! CallSyscacheCallbacks(msg->cc.cacheId, msg->cc.hashValue); ! } ! break; ! case SharedInvalCatalog: ! if (msg->cat.dbId == MyDatabaseId || msg->cat.dbId == InvalidOid) ! { ! InvalidateCatalogSnapshot(); ! CatalogCacheFlushCatalog(msg->cat.catId); ! /* ! * CatalogCacheFlushCatalog calls CallSyscacheCallbacks as ! * needed ! */ ! } ! break; ! case SharedInvalRelcache: ! if (msg->rc.dbId == MyDatabaseId || msg->rc.dbId == InvalidOid) ! { ! int i; ! if (msg->rc.relId == InvalidOid) ! RelationCacheInvalidate(); ! else ! RelationCacheInvalidateEntry(msg->rc.relId); ! for (i = 0; i < relcache_callback_count; i++) ! { ! struct RELCACHECALLBACK *ccitem = relcache_callback_list + i; ! ccitem->function(ccitem->arg, msg->rc.relId); ! } } ! break; ! case SharedInvalSmgr: ! { ! /* ! * We could have smgr entries for relations of other ! * databases, so no short-circuit test is possible here. ! */ ! RelFileNodeBackend rnode; ! rnode.node = msg->sm.rnode; ! rnode.backend = (msg->sm.backend_hi << 16) | (int) msg->sm.backend_lo; ! smgrclosenode(rnode); ! break; ! } ! case SharedInvalRelmap: ! /* We only care about our own database and shared catalogs */ ! if (msg->rm.dbId == InvalidOid) ! RelationMapInvalidate(true); ! else if (msg->rm.dbId == MyDatabaseId) ! RelationMapInvalidate(false); ! break; ! case SharedInvalSnapshot: ! /* We only care about our own database and shared catalogs */ ! if (msg->rm.dbId == InvalidOid) ! InvalidateCatalogSnapshot(); ! else if (msg->rm.dbId == MyDatabaseId) ! InvalidateCatalogSnapshot(); ! break; ! default: ! elog(FATAL, "unrecognized SI message ID: %d", msg->id); ! break; } } /* *************** CacheInvalidateSmgr(RelFileNodeBackend r *** 1351,1357 **** { SharedInvalidationMessage msg; ! msg.sm.id = SHAREDINVALSMGR_ID; msg.sm.backend_hi = rnode.backend >> 16; msg.sm.backend_lo = rnode.backend & 0xffff; msg.sm.rnode = rnode.node; --- 1355,1361 ---- { SharedInvalidationMessage msg; ! msg.sm.id = SharedInvalSmgr; msg.sm.backend_hi = rnode.backend >> 16; msg.sm.backend_lo = rnode.backend & 0xffff; msg.sm.rnode = rnode.node; *************** CacheInvalidateRelmap(Oid databaseId) *** 1381,1387 **** { SharedInvalidationMessage msg; ! msg.rm.id = SHAREDINVALRELMAP_ID; msg.rm.dbId = databaseId; /* check AddCatcacheInvalidationMessage() for an explanation */ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg)); --- 1385,1391 ---- { SharedInvalidationMessage msg; ! msg.rm.id = SharedInvalRelmap; msg.rm.dbId = databaseId; /* check AddCatcacheInvalidationMessage() for an explanation */ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg)); diff --git a/src/include/storage/sinval.h b/src/include/storage/sinval.h index 635acda..d0d9ece 100644 *** a/src/include/storage/sinval.h --- b/src/include/storage/sinval.h *************** *** 28,36 **** * * invalidate the mapped-relation mapping for a given database * * invalidate any saved snapshot that might be used to scan a given relation * More types could be added if needed. The message type is identified by ! * the first "int8" field of the message struct. Zero or positive means a ! * specific-catcache inval message (and also serves as the catcache ID field). ! * Negative values identify the other message types, as per codes below. * * Catcache inval events are initially driven by detecting tuple inserts, * updates and deletions in system catalogs (see CacheInvalidateHeapTuple). --- 28,34 ---- * * invalidate the mapped-relation mapping for a given database * * invalidate any saved snapshot that might be used to scan a given relation * More types could be added if needed. The message type is identified by ! * the first "int8" field of the message struct. * * Catcache inval events are initially driven by detecting tuple inserts, * updates and deletions in system catalogs (see CacheInvalidateHeapTuple). *************** *** 57,71 **** * sent immediately when the underlying file change is made. */ typedef struct { ! int8 id; /* cache ID --- must be first */ Oid dbId; /* database ID, or 0 if a shared relation */ uint32 hashValue; /* hash value of key for this catcache */ } SharedInvalCatcacheMsg; - #define SHAREDINVALCATALOG_ID (-1) - typedef struct { int8 id; /* type field --- must be first */ --- 55,78 ---- * sent immediately when the underlying file change is made. */ + typedef enum SharedInvalMsgType + { + SharedInvalCatcache, + SharedInvalCatalog, + SharedInvalRelcache, + SharedInvalSmgr, + SharedInvalRelmap, + SharedInvalSnapshot + } SharedInvalMsgType; + typedef struct { ! int8 id; /* type field --- must be first */ ! int8 cacheId; /* cache ID */ Oid dbId; /* database ID, or 0 if a shared relation */ uint32 hashValue; /* hash value of key for this catcache */ } SharedInvalCatcacheMsg; typedef struct { int8 id; /* type field --- must be first */ *************** typedef struct *** 73,80 **** Oid catId; /* ID of catalog whose contents are invalid */ } SharedInvalCatalogMsg; - #define SHAREDINVALRELCACHE_ID (-2) - typedef struct { int8 id; /* type field --- must be first */ --- 80,85 ---- *************** typedef struct *** 82,89 **** Oid relId; /* relation ID, or 0 if whole relcache */ } SharedInvalRelcacheMsg; - #define SHAREDINVALSMGR_ID (-3) - typedef struct { /* note: field layout chosen to pack into 16 bytes */ --- 87,92 ---- *************** typedef struct *** 93,108 **** RelFileNode rnode; /* spcNode, dbNode, relNode */ } SharedInvalSmgrMsg; - #define SHAREDINVALRELMAP_ID (-4) - typedef struct { int8 id; /* type field --- must be first */ Oid dbId; /* database ID, or 0 for shared catalogs */ } SharedInvalRelmapMsg; - #define SHAREDINVALSNAPSHOT_ID (-5) - typedef struct { int8 id; /* type field --- must be first */ --- 96,107 ---- diff --git a/src/backend/access/rmgrdesc/standbydesc.c b/src/backend/access/rmgrdesc/standbydesc.c index a7f367c..79823c2 100644 *** a/src/backend/access/rmgrdesc/standbydesc.c --- b/src/backend/access/rmgrdesc/standbydesc.c *************** standby_desc_invalidations(StringInfo bu *** 113,120 **** switch ((SharedInvalMsgType) msg->id) { ! case SharedInvalCatcache: ! appendStringInfo(buf, " catcache %d", msg->cc.cacheId); break; case SharedInvalCatalog: appendStringInfo(buf, " catalog %u", msg->cat.catId); --- 113,123 ---- switch ((SharedInvalMsgType) msg->id) { ! case SharedInvalCatcacheHash: ! appendStringInfo(buf, " catcache %d by hash", msg->cch.cacheId); ! break; ! case SharedInvalCatcacheOid: ! appendStringInfo(buf, " catcache %d by OID", msg->cco.cacheId); break; case SharedInvalCatalog: appendStringInfo(buf, " catalog %u", msg->cat.catId); diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c index 472285d..ebf4321 100644 *** a/src/backend/catalog/heap.c --- b/src/backend/catalog/heap.c *************** RemoveStatistics(Oid relid, AttrNumber a *** 3025,3030 **** --- 3025,3048 ---- systable_endscan(scan); + /* + * Aside from removing the catalog entries, issue sinval messages to + * remove any negative catcache entries for stats that weren't present. + * (Positive entries will get flushed as a consequence of deleting the + * catalog entries.) Without this, repeatedly creating and dropping temp + * tables tends to lead to catcache bloat, since any negative catcache + * entries created by planner lookups won't get dropped. + * + * We only bother with this for the whole-table case, since (a) it's less + * likely to be a problem for DROP COLUMN, and (b) the sinval + * infrastructure only supports matching an OID cache key column. + * (Alternatively, we could issue the sinval message always, accepting the + * collateral damage of losing negative catcache entries for other columns + * to be sure we get rid of entries for this one.) + */ + if (attnum == 0) + CacheInvalidateCatcacheByOid(STATRELATTINH, false, 1, relid); + heap_close(pgstatistic, RowExclusiveLock); } diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 8152f7e..81c01f6 100644 *** a/src/backend/utils/cache/catcache.c --- b/src/backend/utils/cache/catcache.c *************** CatCacheRemoveCList(CatCache *cache, Cat *** 540,546 **** /* ! * CatCacheInvalidate * * Invalidate entries in the specified cache, given a hash value. * --- 540,546 ---- /* ! * CatCacheInvalidateByHash * * Invalidate entries in the specified cache, given a hash value. * *************** CatCacheRemoveCList(CatCache *cache, Cat *** 558,569 **** * This routine is only quasi-public: it should only be used by inval.c. */ void ! CatCacheInvalidate(CatCache *cache, uint32 hashValue) { Index hashIndex; dlist_mutable_iter iter; ! CACHE1_elog(DEBUG2, "CatCacheInvalidate: called"); /* * We don't bother to check whether the cache has finished initialization --- 558,569 ---- * This routine is only quasi-public: it should only be used by inval.c. */ void ! CatCacheInvalidateByHash(CatCache *cache, uint32 hashValue) { Index hashIndex; dlist_mutable_iter iter; ! CACHE1_elog(DEBUG2, "CatCacheInvalidateByHash: called"); /* * We don't bother to check whether the cache has finished initialization *************** CatCacheInvalidate(CatCache *cache, uint *** 603,609 **** } else CatCacheRemoveCTup(cache, ct); ! CACHE1_elog(DEBUG2, "CatCacheInvalidate: invalidated"); #ifdef CATCACHE_STATS cache->cc_invals++; #endif --- 603,609 ---- } else CatCacheRemoveCTup(cache, ct); ! CACHE1_elog(DEBUG2, "CatCacheInvalidateByHash: invalidated"); #ifdef CATCACHE_STATS cache->cc_invals++; #endif *************** CatCacheInvalidate(CatCache *cache, uint *** 612,617 **** --- 612,683 ---- } } + /* + * CatCacheInvalidateByOid + * + * Invalidate negative entries in the specified cache, given a target OID. + * + * We delete negative cache entries that have that OID value in column ckey. + * While we could also examine positive entries, there's no need to do so in + * current usage: any relevant positive entries should have been flushed by + * CatCacheInvalidateByHash calls due to deletions of those catalog entries. + * + * This routine is only quasi-public: it should only be used by inval.c. + */ + void + CatCacheInvalidateByOid(CatCache *cache, int ckey, Oid oid) + { + dlist_mutable_iter iter; + int i; + + CACHE1_elog(DEBUG2, "CatCacheInvalidateByOid: called"); + + /* If the cache hasn't finished initialization, there's nothing to do */ + if (cache->cc_tupdesc == NULL) + return; + + /* Assert that an OID column has been targeted */ + Assert(TupleDescAttr(cache->cc_tupdesc, + cache->cc_keyno[ckey - 1] - 1)->atttypid == OIDOID); + + /* + * There seems no need to flush CatCLists; removal of negative entries + * shouldn't affect the validity of searches. + */ + + /* + * Scan the whole cache for matches + */ + for (i = 0; i < cache->cc_nbuckets; i++) + { + dlist_head *bucket = &cache->cc_bucket[i]; + + dlist_foreach_modify(iter, bucket) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + + /* We only care about live negative entries */ + if (ct->dead || !ct->negative) + continue; + /* Negative entries won't be in clists */ + Assert(ct->c_list == NULL); + + if (oid == DatumGetObjectId(ct->keys[ckey - 1])) + { + if (ct->refcount > 0) + ct->dead = true; + else + CatCacheRemoveCTup(cache, ct); + CACHE1_elog(DEBUG2, "CatCacheInvalidateByOid: invalidated"); + #ifdef CATCACHE_STATS + cache->cc_invals++; + #endif + /* could be multiple matches, so keep looking! */ + } + } + } + } + /* ---------------------------------------------------------------- * public functions * ---------------------------------------------------------------- *************** CatCacheCopyKeys(TupleDesc tupdesc, int *** 1995,2001 **** * the specified relation, find all catcaches it could be in, compute the * correct hash value for each such catcache, and call the specified * function to record the cache id and hash value in inval.c's lists. ! * SysCacheInvalidate will be called later, if appropriate, * using the recorded information. * * For an insert or delete, tuple is the target tuple and newtuple is NULL. --- 2061,2067 ---- * the specified relation, find all catcaches it could be in, compute the * correct hash value for each such catcache, and call the specified * function to record the cache id and hash value in inval.c's lists. ! * SysCacheInvalidateByHash will be called later, if appropriate, * using the recorded information. * * For an insert or delete, tuple is the target tuple and newtuple is NULL. diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c index 5bc08b0..168a97d 100644 *** a/src/backend/utils/cache/inval.c --- b/src/backend/utils/cache/inval.c *************** AppendInvalidationMessageList(Invalidati *** 331,349 **** */ /* ! * Add a catcache inval entry */ static void ! AddCatcacheInvalidationMessage(InvalidationListHeader *hdr, ! int id, uint32 hashValue, Oid dbId) { SharedInvalidationMessage msg; Assert(id < CHAR_MAX); ! msg.cc.id = SharedInvalCatcache; ! msg.cc.cacheId = (int8) id; ! msg.cc.dbId = dbId; ! msg.cc.hashValue = hashValue; /* * Define padding bytes in SharedInvalidationMessage structs to be --- 331,349 ---- */ /* ! * Add a catcache inval-by-hash entry */ static void ! AddCatcacheHashInvalidationMessage(InvalidationListHeader *hdr, ! int id, uint32 hashValue, Oid dbId) { SharedInvalidationMessage msg; Assert(id < CHAR_MAX); ! msg.cch.id = SharedInvalCatcacheHash; ! msg.cch.cacheId = (int8) id; ! msg.cch.dbId = dbId; ! msg.cch.hashValue = hashValue; /* * Define padding bytes in SharedInvalidationMessage structs to be *************** AddCatcacheInvalidationMessage(Invalidat *** 360,365 **** --- 360,386 ---- } /* + * Add a catcache inval-by-OID entry + */ + static void + AddCatcacheOidInvalidationMessage(InvalidationListHeader *hdr, + int id, int ckey, Oid oid, Oid dbId) + { + SharedInvalidationMessage msg; + + Assert(id < CHAR_MAX); + msg.cco.id = SharedInvalCatcacheOid; + msg.cco.cacheId = (int8) id; + msg.cco.ckey = (int8) ckey; + msg.cco.oid = oid; + msg.cco.dbId = dbId; + /* check AddCatcacheHashInvalidationMessage() for an explanation */ + VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg)); + + AddInvalidationMessage(&hdr->cclist, &msg); + } + + /* * Add a whole-catalog inval entry */ static void *************** AddCatalogInvalidationMessage(Invalidati *** 371,377 **** msg.cat.id = SharedInvalCatalog; msg.cat.dbId = dbId; msg.cat.catId = catId; ! /* check AddCatcacheInvalidationMessage() for an explanation */ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg)); AddInvalidationMessage(&hdr->cclist, &msg); --- 392,398 ---- msg.cat.id = SharedInvalCatalog; msg.cat.dbId = dbId; msg.cat.catId = catId; ! /* check AddCatcacheHashInvalidationMessage() for an explanation */ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg)); AddInvalidationMessage(&hdr->cclist, &msg); *************** AddRelcacheInvalidationMessage(Invalidat *** 401,407 **** msg.rc.id = SharedInvalRelcache; msg.rc.dbId = dbId; msg.rc.relId = relId; ! /* check AddCatcacheInvalidationMessage() for an explanation */ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg)); AddInvalidationMessage(&hdr->rclist, &msg); --- 422,428 ---- msg.rc.id = SharedInvalRelcache; msg.rc.dbId = dbId; msg.rc.relId = relId; ! /* check AddCatcacheHashInvalidationMessage() for an explanation */ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg)); AddInvalidationMessage(&hdr->rclist, &msg); *************** AddSnapshotInvalidationMessage(Invalidat *** 427,433 **** msg.sn.id = SharedInvalSnapshot; msg.sn.dbId = dbId; msg.sn.relId = relId; ! /* check AddCatcacheInvalidationMessage() for an explanation */ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg)); AddInvalidationMessage(&hdr->rclist, &msg); --- 448,454 ---- msg.sn.id = SharedInvalSnapshot; msg.sn.dbId = dbId; msg.sn.relId = relId; ! /* check AddCatcacheHashInvalidationMessage() for an explanation */ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg)); AddInvalidationMessage(&hdr->rclist, &msg); *************** ProcessInvalidationMessagesMulti(Invalid *** 477,493 **** */ /* ! * RegisterCatcacheInvalidation * ! * Register an invalidation event for a catcache tuple entry. */ static void ! RegisterCatcacheInvalidation(int cacheId, ! uint32 hashValue, ! Oid dbId) { ! AddCatcacheInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs, ! cacheId, hashValue, dbId); } /* --- 498,529 ---- */ /* ! * RegisterCatcacheHashInvalidation * ! * Register an invalidation event for a catcache tuple entry identified ! * by hash value. */ static void ! RegisterCatcacheHashInvalidation(int cacheId, ! uint32 hashValue, ! Oid dbId) { ! AddCatcacheHashInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs, ! cacheId, hashValue, dbId); ! } ! ! /* ! * RegisterCatcacheOidInvalidation ! * ! * Register an invalidation event for catcache tuple entries having ! * the specified OID in a particular cache key column. ! */ ! static void ! RegisterCatcacheOidInvalidation(int cacheId, ! int ckey, Oid oid, Oid dbId) ! { ! AddCatcacheOidInvalidationMessage(&transInvalInfo->CurrentCmdInvalidMsgs, ! cacheId, ckey, oid, dbId); } /* *************** LocalExecuteInvalidationMessage(SharedIn *** 556,569 **** { switch ((SharedInvalMsgType) msg->id) { ! case SharedInvalCatcache: ! if (msg->cc.dbId == MyDatabaseId || msg->cc.dbId == InvalidOid) { InvalidateCatalogSnapshot(); ! SysCacheInvalidate(msg->cc.cacheId, msg->cc.hashValue); ! CallSyscacheCallbacks(msg->cc.cacheId, msg->cc.hashValue); } break; case SharedInvalCatalog: --- 592,612 ---- { switch ((SharedInvalMsgType) msg->id) { ! case SharedInvalCatcacheHash: ! if (msg->cch.dbId == MyDatabaseId || msg->cch.dbId == InvalidOid) { InvalidateCatalogSnapshot(); ! SysCacheInvalidateByHash(msg->cch.cacheId, msg->cch.hashValue); ! CallSyscacheCallbacks(msg->cch.cacheId, msg->cch.hashValue); ! } ! break; ! case SharedInvalCatcacheOid: ! if (msg->cco.dbId == MyDatabaseId || msg->cco.dbId == InvalidOid) ! { ! SysCacheInvalidateByOid(msg->cco.cacheId, msg->cco.ckey, ! msg->cco.oid); } break; case SharedInvalCatalog: *************** CacheInvalidateHeapTuple(Relation relati *** 1157,1163 **** } else PrepareToInvalidateCacheTuple(relation, tuple, newtuple, ! RegisterCatcacheInvalidation); /* * Now, is this tuple one of the primary definers of a relcache entry? See --- 1200,1206 ---- } else PrepareToInvalidateCacheTuple(relation, tuple, newtuple, ! RegisterCatcacheHashInvalidation); /* * Now, is this tuple one of the primary definers of a relcache entry? See *************** CacheInvalidateHeapTuple(Relation relati *** 1217,1222 **** --- 1260,1286 ---- } /* + * CacheInvalidateCatcacheByOid + * Register invalidation of catcache entries referencing a given OID. + * + * This is used to kill negative catcache entries that are believed to be + * no longer useful. The entries are identified by which cache they are + * in, the cache key column to look at, and the target OID. + * + * Note: we expect caller to know whether the specified cache is on a + * shared or local system catalog. We could ask syscache.c for that info, + * but it seems probably not worth the trouble, since this is likely to + * have few callers. + */ + void + CacheInvalidateCatcacheByOid(int cacheId, bool isshared, int ckey, Oid oid) + { + Oid dbId = isshared ? (Oid) 0 : MyDatabaseId; + + RegisterCatcacheOidInvalidation(cacheId, ckey, oid, dbId); + } + + /* * CacheInvalidateCatalog * Register invalidation of the whole content of a system catalog. * *************** CacheInvalidateSmgr(RelFileNodeBackend r *** 1359,1365 **** msg.sm.backend_hi = rnode.backend >> 16; msg.sm.backend_lo = rnode.backend & 0xffff; msg.sm.rnode = rnode.node; ! /* check AddCatcacheInvalidationMessage() for an explanation */ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg)); SendSharedInvalidMessages(&msg, 1); --- 1423,1429 ---- msg.sm.backend_hi = rnode.backend >> 16; msg.sm.backend_lo = rnode.backend & 0xffff; msg.sm.rnode = rnode.node; ! /* check AddCatcacheHashInvalidationMessage() for an explanation */ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg)); SendSharedInvalidMessages(&msg, 1); *************** CacheInvalidateRelmap(Oid databaseId) *** 1387,1393 **** msg.rm.id = SharedInvalRelmap; msg.rm.dbId = databaseId; ! /* check AddCatcacheInvalidationMessage() for an explanation */ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg)); SendSharedInvalidMessages(&msg, 1); --- 1451,1457 ---- msg.rm.id = SharedInvalRelmap; msg.rm.dbId = databaseId; ! /* check AddCatcacheHashInvalidationMessage() for an explanation */ VALGRIND_MAKE_MEM_DEFINED(&msg, sizeof(msg)); SendSharedInvalidMessages(&msg, 1); diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c index ac98c19..3e5acd5 100644 *** a/src/backend/utils/cache/syscache.c --- b/src/backend/utils/cache/syscache.c *************** SearchSysCacheList(int cacheId, int nkey *** 1434,1448 **** } /* ! * SysCacheInvalidate * * Invalidate entries in the specified cache, given a hash value. ! * See CatCacheInvalidate() for more info. * * This routine is only quasi-public: it should only be used by inval.c. */ void ! SysCacheInvalidate(int cacheId, uint32 hashValue) { if (cacheId < 0 || cacheId >= SysCacheSize) elog(ERROR, "invalid cache ID: %d", cacheId); --- 1434,1448 ---- } /* ! * SysCacheInvalidateByHash * * Invalidate entries in the specified cache, given a hash value. ! * See CatCacheInvalidateByHash() for more info. * * This routine is only quasi-public: it should only be used by inval.c. */ void ! SysCacheInvalidateByHash(int cacheId, uint32 hashValue) { if (cacheId < 0 || cacheId >= SysCacheSize) elog(ERROR, "invalid cache ID: %d", cacheId); *************** SysCacheInvalidate(int cacheId, uint32 h *** 1451,1457 **** if (!PointerIsValid(SysCache[cacheId])) return; ! CatCacheInvalidate(SysCache[cacheId], hashValue); } /* --- 1451,1478 ---- if (!PointerIsValid(SysCache[cacheId])) return; ! CatCacheInvalidateByHash(SysCache[cacheId], hashValue); ! } ! ! /* ! * SysCacheInvalidateByOid ! * ! * Invalidate negative entries in the specified cache, given a target OID. ! * See CatCacheInvalidateByOid() for more info. ! * ! * This routine is only quasi-public: it should only be used by inval.c. ! */ ! void ! SysCacheInvalidateByOid(int cacheId, int ckey, Oid oid) ! { ! if (cacheId < 0 || cacheId >= SysCacheSize) ! elog(ERROR, "invalid cache ID: %d", cacheId); ! ! /* if this cache isn't initialized yet, no need to do anything */ ! if (!PointerIsValid(SysCache[cacheId])) ! return; ! ! CatCacheInvalidateByOid(SysCache[cacheId], ckey, oid); } /* diff --git a/src/include/storage/sinval.h b/src/include/storage/sinval.h index d0d9ece..004cb45 100644 *** a/src/include/storage/sinval.h --- b/src/include/storage/sinval.h *************** *** 20,26 **** /* * We support several types of shared-invalidation messages: ! * * invalidate a specific tuple in a specific catcache * * invalidate all catcache entries from a given system catalog * * invalidate a relcache entry for a specific logical relation * * invalidate all relcache entries --- 20,27 ---- /* * We support several types of shared-invalidation messages: ! * * invalidate a specific tuple (identified by hash) in a specific catcache ! * * invalidate negative entries matching a given OID in a specific catcache * * invalidate all catcache entries from a given system catalog * * invalidate a relcache entry for a specific logical relation * * invalidate all relcache entries *************** *** 30,36 **** * More types could be added if needed. The message type is identified by * the first "int8" field of the message struct. * ! * Catcache inval events are initially driven by detecting tuple inserts, * updates and deletions in system catalogs (see CacheInvalidateHeapTuple). * An update can generate two inval events, one for the old tuple and one for * the new, but this is reduced to one event if the tuple's hash key doesn't --- 31,37 ---- * More types could be added if needed. The message type is identified by * the first "int8" field of the message struct. * ! * Catcache hash inval events are initially driven by detecting tuple inserts, * updates and deletions in system catalogs (see CacheInvalidateHeapTuple). * An update can generate two inval events, one for the old tuple and one for * the new, but this is reduced to one event if the tuple's hash key doesn't *************** *** 57,63 **** typedef enum SharedInvalMsgType { ! SharedInvalCatcache, SharedInvalCatalog, SharedInvalRelcache, SharedInvalSmgr, --- 58,65 ---- typedef enum SharedInvalMsgType { ! SharedInvalCatcacheHash, ! SharedInvalCatcacheOid, SharedInvalCatalog, SharedInvalRelcache, SharedInvalSmgr, *************** typedef struct *** 71,77 **** int8 cacheId; /* cache ID */ Oid dbId; /* database ID, or 0 if a shared relation */ uint32 hashValue; /* hash value of key for this catcache */ ! } SharedInvalCatcacheMsg; typedef struct { --- 73,88 ---- int8 cacheId; /* cache ID */ Oid dbId; /* database ID, or 0 if a shared relation */ uint32 hashValue; /* hash value of key for this catcache */ ! } SharedInvalCatcacheHashMsg; ! ! typedef struct ! { ! int8 id; /* type field --- must be first */ ! int8 cacheId; /* cache ID */ ! int8 ckey; /* cache key column (1..CATCACHE_MAXKEYS) */ ! Oid oid; /* OID of cache entries to remove */ ! Oid dbId; /* database ID, or 0 if a shared relation */ ! } SharedInvalCatcacheOidMsg; typedef struct { *************** typedef struct *** 112,118 **** typedef union { int8 id; /* type field --- must be first */ ! SharedInvalCatcacheMsg cc; SharedInvalCatalogMsg cat; SharedInvalRelcacheMsg rc; SharedInvalSmgrMsg sm; --- 123,130 ---- typedef union { int8 id; /* type field --- must be first */ ! SharedInvalCatcacheHashMsg cch; ! SharedInvalCatcacheOidMsg cco; SharedInvalCatalogMsg cat; SharedInvalRelcacheMsg rc; SharedInvalSmgrMsg sm; diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 65d816a..47b72d6 100644 *** a/src/include/utils/catcache.h --- b/src/include/utils/catcache.h *************** extern void ReleaseCatCacheList(CatCList *** 219,225 **** extern void ResetCatalogCaches(void); extern void CatalogCacheFlushCatalog(Oid catId); ! extern void CatCacheInvalidate(CatCache *cache, uint32 hashValue); extern void PrepareToInvalidateCacheTuple(Relation relation, HeapTuple tuple, HeapTuple newtuple, --- 219,226 ---- extern void ResetCatalogCaches(void); extern void CatalogCacheFlushCatalog(Oid catId); ! extern void CatCacheInvalidateByHash(CatCache *cache, uint32 hashValue); ! extern void CatCacheInvalidateByOid(CatCache *cache, int ckey, Oid oid); extern void PrepareToInvalidateCacheTuple(Relation relation, HeapTuple tuple, HeapTuple newtuple, diff --git a/src/include/utils/inval.h b/src/include/utils/inval.h index c557640..d1181bc 100644 *** a/src/include/utils/inval.h --- b/src/include/utils/inval.h *************** extern void CacheInvalidateHeapTuple(Rel *** 39,44 **** --- 39,47 ---- HeapTuple tuple, HeapTuple newtuple); + extern void CacheInvalidateCatcacheByOid(int cacheId, bool isshared, + int ckey, Oid oid); + extern void CacheInvalidateCatalog(Oid catalogId); extern void CacheInvalidateRelcache(Relation relation); diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h index 95ee489..983fd00 100644 *** a/src/include/utils/syscache.h --- b/src/include/utils/syscache.h *************** struct catclist; *** 159,165 **** extern struct catclist *SearchSysCacheList(int cacheId, int nkeys, Datum key1, Datum key2, Datum key3); ! extern void SysCacheInvalidate(int cacheId, uint32 hashValue); extern bool RelationInvalidatesSnapshotsOnly(Oid relid); extern bool RelationHasSysCache(Oid relid); --- 159,166 ---- extern struct catclist *SearchSysCacheList(int cacheId, int nkeys, Datum key1, Datum key2, Datum key3); ! extern void SysCacheInvalidateByHash(int cacheId, uint32 hashValue); ! extern void SysCacheInvalidateByOid(int cacheId, int ckey, Oid oid); extern bool RelationInvalidatesSnapshotsOnly(Oid relid); extern bool RelationHasSysCache(Oid relid);
From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > I'm really disappointed by the direction this thread is going in. > The latest patches add an enormous amount of mechanism, and user-visible > complexity, to do something that we learned was a bad idea decades ago. > Putting a limit on the size of the syscaches doesn't accomplish anything > except to add cycles if your cache working set is below the limit, or make > performance fall off a cliff if it's above the limit. I don't think there's > any reason to believe that making it more complicated will avoid that > problem. > > What does seem promising is something similar to Horiguchi-san's original > patches all the way back at > > https://www.postgresql.org/message-id/20161219.201505.11562604.horiguc > hi.kyotaro@lab.ntt.co.jp > so I'd been thinking about ways to fix that case in particular. You're suggesting to go back to the original issue (bloat by negative cache entries) and give simpler solution to it once,aren't you? That may be the way to go. But the syscache/relcache bloat still remains a problem, when there are many live tables and application connections. Wouldyou agree to solve this in some way? I thought Horiguchi-san's latest patches would solve this and the negative entries. Can we consider that his patch and yours are orthogonal, i.e., we can pursue Horiguchi-san's patch after yours iscommitted? (As you said, some parts of Horiguchi-san's patches may be made simpler. For example, the ability to change another session'sGUC variable can be discussed in a separate thread.) I think we need some limit to the size of the relcache, syscache, and plancache. Oracle and MySQL both have it, using LRUto evict less frequently used entries. You seem to be concerned about the LRU management based on your experience, butwould it really cost so much as long as each postgres process can change the LRU list without coordination with otherbackends now? Could you share your experience? FYI, Oracle provides one parameter, shared_pool_size, that determine the size of a memory area that contains SQL plans andvarious dictionary objects. Oracle decides how to divide the area among constituents. So it could be possible that onecomponent (e.g. table/index metadata) is short of space, and another (e.g. SQL plans) has free space. Oracle providesa system view to see the free space and hit/miss of each component. If one component suffers from memory shortage,the user increases shared_pool_size. This is similar to what Horiguchi-san is proposing. MySQL enables fine-tuning of each component. It provides the size parameters for six memory partitions of the dictionaryobject cache, and the usage statistics of those partitions through the Performance Schema. tablespace definition cache schema definition cache table definition cache stored program definition cache character set definition cache collation definition cache I wonder whether we can group existing relcache/syscache entries like this. [MySQL] 14.4 Dictionary Object Cache https://dev.mysql.com/doc/refman/8.0/en/data-dictionary-object-cache.html -------------------------------------------------- The dictionary object cache is a shared global cache that stores previously accessed data dictionary objects in memory toenable object reuse and minimize disk I/O. Similar to other cache mechanisms used by MySQL, the dictionary object cacheuses an LRU-based eviction strategy to evict least recently used objects from memory. The dictionary object cache comprises cache partitions that store different object types. Some cache partition size limitsare configurable, whereas others are hardcoded. -------------------------------------------------- 8.12.3.1 How MySQL Uses Memory https://dev.mysql.com/doc/refman/8.0/en/memory-use.html -------------------------------------------------- table_open_cache MySQL requires memory and descriptors for the table cache. table_definition_cache For InnoDB, table_definition_cache acts as a soft limit for the number of open table instances in the InnoDB data dictionarycache. If the number of open table instances exceeds the table_definition_cache setting, the LRU mechanism beginsto mark table instances for eviction and eventually removes them from the data dictionary cache. The limit helps addresssituations in which significant amounts of memory would be used to cache rarely used table instances until the nextserver restart. -------------------------------------------------- Regards Takayuki Tsunakawa
"Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes: > But the syscache/relcache bloat still remains a problem, when there are many live tables and application connections. Would you agree to solve this in some way? I thought Horiguchi-san's latest patches would solve this and the negative entries. Can we consider that his patch and yours are orthogonal, i.e., we can pursue Horiguchi-san's patch after yours iscommitted? Certainly, what I've done here doesn't preclude adding some wider solution to the issue of extremely large catcaches. I think it takes the pressure off for one rather narrow problem case, and the mechanism could be used to fix other ones. But if you've got an application that just plain accesses a huge number of objects, this isn't going to make your life better. > (As you said, some parts of Horiguchi-san's patches may be made simpler. For example, the ability to change another session'sGUC variable can be discussed in a separate thread.) Yeah, that idea seems just bad from here ... > I think we need some limit to the size of the relcache, syscache, and plancache. Oracle and MySQL both have it, usingLRU to evict less frequently used entries. You seem to be concerned about the LRU management based on your experience,but would it really cost so much as long as each postgres process can change the LRU list without coordinationwith other backends now? Could you share your experience? Well, we *had* an LRU mechanism for the catcaches way back when. We got rid of it --- see commit 8b9bc234a --- because (a) maintaining the LRU info was expensive and (b) performance fell off a cliff in scenarios where the cache size limit was exceeded. You could probably find some more info about that by scanning the mail list archives from around the time of that commit, but I'm too lazy to do so right now. That was a dozen years ago, and it's possible that machine performance has moved so much since then that the problems are gone or mitigated. In particular I'm sure that any limit we would want to impose today will be far more than the 5000-entries-across-all-caches limit that was in use back then. But I'm not convinced that a workload that would create 100K cache entries in the first place wouldn't have severe problems if you tried to constrain it to use only 80K entries. I fear it's just wishful thinking to imagine that the behavior of a larger cache won't be just like a smaller one. Also, IIRC some of the problem with the LRU code was that it resulted in lots of touches of unrelated data, leading to CPU cache miss problems. It's hard to see how that doesn't get even worse with a bigger cache. As far as the relcache goes, we've never had a limit on that, but there are enough routine causes of relcache flushes --- autovacuum for instance --- that I'm not really convinced relcache bloat can be a big problem in production. The plancache has never had a limit either, which is a design choice that was strongly influenced by our experience with catcaches. Again, I'm concerned about the costs of adding a management layer, and the likelihood that cache flushes will simply remove entries we'll soon have to rebuild. > FYI, Oracle provides one parameter, shared_pool_size, that determine the > size of a memory area that contains SQL plans and various dictionary > objects. Oracle decides how to divide the area among constituents. So > it could be possible that one component (e.g. table/index metadata) is > short of space, and another (e.g. SQL plans) has free space. Oracle > provides a system view to see the free space and hit/miss of each > component. If one component suffers from memory shortage, the user > increases shared_pool_size. This is similar to what Horiguchi-san is > proposing. Oracle seldom impresses me as having designs we ought to follow. They have a well-earned reputation for requiring a lot of expertise to operate, which is not the direction this project should be going in. In particular, I don't want to "solve" cache size issues by exposing a bunch of knobs that most users won't know how to twiddle. regards, tom lane
Hi, On 2019-01-15 13:32:36 -0500, Tom Lane wrote: > Well, we *had* an LRU mechanism for the catcaches way back when. We got > rid of it --- see commit 8b9bc234a --- because (a) maintaining the LRU > info was expensive and (b) performance fell off a cliff in scenarios where > the cache size limit was exceeded. You could probably find some more info > about that by scanning the mail list archives from around the time of that > commit, but I'm too lazy to do so right now. > > That was a dozen years ago, and it's possible that machine performance > has moved so much since then that the problems are gone or mitigated. > In particular I'm sure that any limit we would want to impose today will > be far more than the 5000-entries-across-all-caches limit that was in use > back then. But I'm not convinced that a workload that would create 100K > cache entries in the first place wouldn't have severe problems if you > tried to constrain it to use only 80K entries. I think that'd be true if you the accesses were truly randomly distributed - but that's not the case in the cases where I've seen huge caches. It's usually workloads that have tons of functions, partitions, ... and a lot of them are not that frequently accessed, but because we have no cache purging mechanism stay around for a long time. This is often exascerbated by using a pooler to keep connections around for longer (which you have to, to cope with other limits of PG). > As far as the relcache goes, we've never had a limit on that, but there > are enough routine causes of relcache flushes --- autovacuum for instance > --- that I'm not really convinced relcache bloat can be a big problem in > production. It definitely is. > The plancache has never had a limit either, which is a design choice that > was strongly influenced by our experience with catcaches. This sounds a lot of having learned lessons from one bad implementation and using it far outside of that situation. Greetings, Andres Freund
On Tue, Jan 15, 2019 at 01:32:36PM -0500, Tom Lane wrote: > ... > > FYI, Oracle provides one parameter, shared_pool_size, that determine the > > size of a memory area that contains SQL plans and various dictionary > > objects. Oracle decides how to divide the area among constituents. So > > it could be possible that one component (e.g. table/index metadata) is > > short of space, and another (e.g. SQL plans) has free space. Oracle > > provides a system view to see the free space and hit/miss of each > > component. If one component suffers from memory shortage, the user > > increases shared_pool_size. This is similar to what Horiguchi-san is > > proposing. > > Oracle seldom impresses me as having designs we ought to follow. > They have a well-earned reputation for requiring a lot of expertise to > operate, which is not the direction this project should be going in. > In particular, I don't want to "solve" cache size issues by exposing > a bunch of knobs that most users won't know how to twiddle. > > regards, tom lane +1 Regards, Ken
From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Certainly, what I've done here doesn't preclude adding some wider solution > to > the issue of extremely large catcaches. I'm relieved to hear that. > I think it takes the pressure off > for one rather narrow problem case, and the mechanism could be used to fix > other ones. But if you've got an application that just plain accesses a > huge number of objects, this isn't going to make your life better. I understand you're trying to solve the problem caused by negative cache entries as soon as possible, because the user isreally suffering from it. I feel sympathy with that attitude, because you seem to be always addressing issues that othersare reluctant to take. That's one of the reasons I respect you. > Well, we *had* an LRU mechanism for the catcaches way back when. We got > rid of it --- see commit 8b9bc234a --- because (a) maintaining the LRU > info was expensive and (b) performance fell off a cliff in scenarios where > the cache size limit was exceeded. You could probably find some more info > about that by scanning the mail list archives from around the time of that > commit, but I'm too lazy to do so right now. Oh, in 2006... I'll examine the patch and the discussion to see how the LRU management was done. > That was a dozen years ago, and it's possible that machine performance > has moved so much since then that the problems are gone or mitigated. I really, really hope so. Even if we see some visible impact by the LRU management, I think that's the debt PostgreSQL hadto pay for but doesn't now. Even the single-process MySQL, which doesn't suffer from cache bloat for many server processes,has the ability to limit the cache. And PostgreSQL has many parameters for various memory components such as shared_buffers,wal_buffers, work_mem, etc, so it would be reasonable to also have the limit for the catalog caches. Thatsaid, we can avoid the penalty and retain the current performance by disabling the limit (some_size_param = 0). I think we'll evaluate the impact of LRU management by adding prev and next members to catcache and relcache structures,and putting the entry at the front (or back) of the LRU chain every time the entry is obtained. I think pgbench'sselect-only mode is enough for evaluation. I'd like to hear if any other workload is more appropriate to see theCPU cache effect. > In particular I'm sure that any limit we would want to impose today will > be far more than the 5000-entries-across-all-caches limit that was in use > back then. But I'm not convinced that a workload that would create 100K > cache entries in the first place wouldn't have severe problems if you > tried to constrain it to use only 80K entries. I fear it's just wishful > thinking to imagine that the behavior of a larger cache won't be just > like a smaller one. Also, IIRC some of the problem with the LRU code > was that it resulted in lots of touches of unrelated data, leading to > CPU cache miss problems. It's hard to see how that doesn't get even > worse with a bigger cache. > > As far as the relcache goes, we've never had a limit on that, but there > are enough routine causes of relcache flushes --- autovacuum for instance > --- that I'm not really convinced relcache bloat can be a big problem in > production. As Andres and Robert mentioned, we want to free less frequently used cache entries. Otherwise, we're now suffering fromthe bloat to TBs of memory. This is a real, not hypothetical issue... > The plancache has never had a limit either, which is a design choice that > was strongly influenced by our experience with catcaches. Again, I'm > concerned about the costs of adding a management layer, and the likelihood > that cache flushes will simply remove entries we'll soon have to rebuild. Fortunately, we're not bothered with the plan cache. But I remember you said you were annoyed by PL/pgSQL's plan cache useat Salesforce. Were you able to overcome it somehow? > Oracle seldom impresses me as having designs we ought to follow. > They have a well-earned reputation for requiring a lot of expertise to > operate, which is not the direction this project should be going in. > In particular, I don't want to "solve" cache size issues by exposing > a bunch of knobs that most users won't know how to twiddle. Oracle certainly seems to be difficult to use. But they seem to be studying other DBMSs to make it simpler to use. I'msure they also have a lot we should learn, and the cache limit if one of them (although MySQL's per-cache tuning may bebetter.) And having limits for various components would be the first step toward the autonomous database; tunable -> auto tuning ->autonomous Regards Takayuki Tsunakawa
On Sun, Jan 13, 2019 at 11:41 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Putting a limit on the size of the syscaches doesn't accomplish anything > except to add cycles if your cache working set is below the limit, or > make performance fall off a cliff if it's above the limit. If you're running on a Turing machine, sure. But real machines have finite memory, or at least all the ones I use do. Horiguchi-san is right that this is a real, not theoretical problem. It is one of the most frequent operational concerns that EnterpriseDB customers have. I'm not against solving specific cases with more targeted fixes, but I really believe we need something more. Andres mentioned one problem case: connection poolers that eventually end up with a cache entry for every object in the system. Another case is that of people who keep idle connections open for long periods of time; those connections can gobble up large amounts of memory even though they're not going to use any of their cache entries any time soon. The flaw in your thinking, as it seems to me, is that in your concern for "the likelihood that cache flushes will simply remove entries we'll soon have to rebuild," you're apparently unwilling to consider the possibility of workloads where cache flushes will remove entries we *won't* soon have to rebuild. Every time that issue gets raised, you seem to blow it off as if it were not a thing that really happens. I can't make sense of that position. Is it really so hard to imagine a connection pooler that switches the same connection back and forth between two applications with different working sets? Or a system that keeps persistent connections open even when they are idle? Do you really believe that a connection that has not accessed a cache entry in 10 minutes still derives more benefit from that cache entry than it would from freeing up some memory? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 17, 2019 at 11:33:35AM -0500, Robert Haas wrote: > The flaw in your thinking, as it seems to me, is that in your concern > for "the likelihood that cache flushes will simply remove entries > we'll soon have to rebuild," you're apparently unwilling to consider > the possibility of workloads where cache flushes will remove entries > we *won't* soon have to rebuild. Every time that issue gets raised, > you seem to blow it off as if it were not a thing that really happens. > I can't make sense of that position. Is it really so hard to imagine > a connection pooler that switches the same connection back and forth > between two applications with different working sets? Or a system > that keeps persistent connections open even when they are idle? Do > you really believe that a connection that has not accessed a cache > entry in 10 minutes still derives more benefit from that cache entry > than it would from freeing up some memory? Well, I think everyone agrees there are workloads that cause undesired cache bloat. What we have not found is a solution that doesn't cause code complexity or undesired overhead, or one that >1% of users will know how to use. Unfortunately, because we have not found something we are happy with, we have done nothing. I agree LRU can be expensive. What if we do some kind of clock sweep and expiration like we do for shared buffers? I think the trick is figuring how frequently to do the sweep. What if we mark entries as unused every 10 queries, mark them as used on first use, and delete cache entries that have not be used in the past 10 queries. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On 18/01/2019 08:48, Bruce Momjian wrote: > On Thu, Jan 17, 2019 at 11:33:35AM -0500, Robert Haas wrote: >> The flaw in your thinking, as it seems to me, is that in your concern >> for "the likelihood that cache flushes will simply remove entries >> we'll soon have to rebuild," you're apparently unwilling to consider >> the possibility of workloads where cache flushes will remove entries >> we *won't* soon have to rebuild. Every time that issue gets raised, >> you seem to blow it off as if it were not a thing that really happens. >> I can't make sense of that position. Is it really so hard to imagine >> a connection pooler that switches the same connection back and forth >> between two applications with different working sets? Or a system >> that keeps persistent connections open even when they are idle? Do >> you really believe that a connection that has not accessed a cache >> entry in 10 minutes still derives more benefit from that cache entry >> than it would from freeing up some memory? > Well, I think everyone agrees there are workloads that cause undesired > cache bloat. What we have not found is a solution that doesn't cause > code complexity or undesired overhead, or one that >1% of users will > know how to use. > > Unfortunately, because we have not found something we are happy with, we > have done nothing. I agree LRU can be expensive. What if we do some > kind of clock sweep and expiration like we do for shared buffers? I > think the trick is figuring how frequently to do the sweep. What if we > mark entries as unused every 10 queries, mark them as used on first use, > and delete cache entries that have not be used in the past 10 queries. > If you take that approach, then this number should be configurable. What if I had 12 common queries I used in rotation? The ARM3 processor cache logic was to simply eject an entry at random, as the obviously Acorn felt that the silicon required to have a more sophisticated algorithm would reduce the cache size too much! I upgraded my Acorn Archimedes that had an 8MHZ bus, from an 8MHz ARM2 to a 25MZ ARM3. that is a clock rate improvement of about 3 times. However BASIC programs ran about 7 times faster, which I put down to the ARM3 having a cache. Obviously for Postgres this is not directly relevant, but I think it suggests that it may be worth considering replacing cache items at random. As there are no pathological corner cases, and the logic is very simple. Cheers, Gavin
Hello. At Fri, 18 Jan 2019 11:46:03 +1300, Gavin Flower <GavinFlower@archidevsys.co.nz> wrote in <4e62e6b7-0ffb-54ae-3757-5583fcca38c0@archidevsys.co.nz> > On 18/01/2019 08:48, Bruce Momjian wrote: > > On Thu, Jan 17, 2019 at 11:33:35AM -0500, Robert Haas wrote: > >> The flaw in your thinking, as it seems to me, is that in your concern > >> for "the likelihood that cache flushes will simply remove entries > >> we'll soon have to rebuild," you're apparently unwilling to consider > >> the possibility of workloads where cache flushes will remove entries > >> we *won't* soon have to rebuild. Every time that issue gets raised, > >> you seem to blow it off as if it were not a thing that really happens. > >> I can't make sense of that position. Is it really so hard to imagine > >> a connection pooler that switches the same connection back and forth > >> between two applications with different working sets? Or a system > >> that keeps persistent connections open even when they are idle? Do > >> you really believe that a connection that has not accessed a cache > >> entry in 10 minutes still derives more benefit from that cache entry > >> than it would from freeing up some memory? > > Well, I think everyone agrees there are workloads that cause undesired > > cache bloat. What we have not found is a solution that doesn't cause > > code complexity or undesired overhead, or one that >1% of users will > > know how to use. > > > > Unfortunately, because we have not found something we are happy with, > > we > > have done nothing. I agree LRU can be expensive. What if we do some > > kind of clock sweep and expiration like we do for shared buffers? I So, it doesn't use LRU but a kind of clock-sweep method. If it finds the size is about to exceed the threshold by resiz(doubl)ing when the current hash is filled up, it tries to trim away the entries that are left for a duration corresponding to usage count. This is not a hard limit but seems to be a good compromise. > > think the trick is figuring how frequently to do the sweep. What if > > we > > mark entries as unused every 10 queries, mark them as used on first > > use, > > and delete cache entries that have not be used in the past 10 queries. As above, it tires pruning at every resizing time. So this adds complexity to the frequent paths only by setting last accessed time and incrementing access counter. It scans the whole hash at resize time but it doesn't add much comparing to resizing itself. > If you take that approach, then this number should be configurable. > What if I had 12 common queries I used in rotation? This basically has two knobs. The minimum hash size to do the pruning and idle time before reaping unused entries, per catcache. > The ARM3 processor cache logic was to simply eject an entry at random, > as the obviously Acorn felt that the silicon required to have a more > sophisticated algorithm would reduce the cache size too much! > > I upgraded my Acorn Archimedes that had an 8MHZ bus, from an 8MHz ARM2 > to a 25MZ ARM3. that is a clock rate improvement of about 3 times. > However BASIC programs ran about 7 times faster, which I put down to > the ARM3 having a cache. > > Obviously for Postgres this is not directly relevant, but I think it > suggests that it may be worth considering replacing cache items at > random. As there are no pathological corner cases, and the logic is > very simple. Memory was expensive than nowadays by.. about 10^3 times? An obvious advantage of random reaping is requiring less silicon. I think we don't need to be so stingy but perhaps clock-sweep is at the maximum we can pay. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Fri, 18 Jan 2019 16:39:29 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190118.163929.229869562.horiguchi.kyotaro@lab.ntt.co.jp> > Hello. > > At Fri, 18 Jan 2019 11:46:03 +1300, Gavin Flower <GavinFlower@archidevsys.co.nz> wrote in <4e62e6b7-0ffb-54ae-3757-5583fcca38c0@archidevsys.co.nz> > > On 18/01/2019 08:48, Bruce Momjian wrote: > > > Unfortunately, because we have not found something we are happy with, > > > we > > > have done nothing. I agree LRU can be expensive. What if we do some > > > kind of clock sweep and expiration like we do for shared buffers? I > > So, it doesn't use LRU but a kind of clock-sweep method. If it > finds the size is about to exceed the threshold by > resiz(doubl)ing when the current hash is filled up, it tries to > trim away the entries that are left for a duration corresponding > to usage count. This is not a hard limit but seems to be a good > compromise. > > > > think the trick is figuring how frequently to do the sweep. What if > > > we > > > mark entries as unused every 10 queries, mark them as used on first > > > use, > > > and delete cache entries that have not be used in the past 10 queries. > > As above, it tires pruning at every resizing time. So this adds > complexity to the frequent paths only by setting last accessed > time and incrementing access counter. It scans the whole hash at > resize time but it doesn't add much comparing to resizing itself. > > > If you take that approach, then this number should be configurable. > > What if I had 12 common queries I used in rotation? > > This basically has two knobs. The minimum hash size to do the > pruning and idle time before reaping unused entries, per > catcache. This is the rebased version. 0001: catcache pruning syscache_memory_target controls per-cache basis minimum size where this starts pruning. syscache_prune_min_time controls minimum idle duration until an catcache entry is removed. 0002: catcache statistics view track_syscache_usage_interval is the interval statitics of catcache is collected. pg_stat_syscache is the view that shows the statistics. 0003: Remote GUC setting It is independent from the above two, and heavily arguable. pg_set_backend_config(pid, name, value) changes the GUC <name> on the backend with <pid> to <value>. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 7071de30e79507f55d8021dc9c8b6801a292745c Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 16 Oct 2018 13:04:30 +0900 Subject: [PATCH 1/3] Remove entries that haven't been used for a certain time Catcache entries can be left alone for several reasons. It is not desirable that they eat up memory. With this patch, This adds consideration of removal of entries that haven't been used for a certain time before enlarging the hash array. --- doc/src/sgml/config.sgml | 38 ++++++ src/backend/access/transam/xact.c | 5 + src/backend/utils/cache/catcache.c | 166 ++++++++++++++++++++++++-- src/backend/utils/misc/guc.c | 23 ++++ src/backend/utils/misc/postgresql.conf.sample | 2 + src/include/utils/catcache.h | 28 ++++- 6 files changed, 254 insertions(+), 8 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index b6f5822b84..af3c52b868 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1662,6 +1662,44 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target"> + <term><varname>syscache_memory_target</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_memory_target</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the maximum amount of memory to which syscache is expanded + without pruning. The value defaults to 0, indicating that pruning is + always considered. After exceeding this size, syscache pruning is + considered according to + <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep + certain amount of syscache entries with intermittent usage, try + increase this setting. + </para> + </listitem> + </varlistentry> + + <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age"> + <term><varname>syscache_prune_min_age</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the minimum amount of unused time in seconds at which a + syscache entry is considered to be removed. -1 indicates that syscache + pruning is disabled at all. The value defaults to 600 seconds + (<literal>10 minutes</literal>). The syscache entries that are not + used for the duration can be removed to prevent syscache bloat. This + behavior is suppressed until the size of syscache exceeds + <xref linkend="guc-syscache-memory-target"/>. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth"> <term><varname>max_stack_depth</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 18467d96d2..dbffec8067 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -733,7 +733,12 @@ void SetCurrentStatementStartTimestamp(void) { if (!IsParallelWorker()) + { stmtStartTimestamp = GetCurrentTimestamp(); + + /* Set this timestamp as aproximated current time */ + SetCatCacheClock(stmtStartTimestamp); + } else Assert(stmtStartTimestamp != 0); } diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 8152f7e21e..ee40093553 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -72,9 +72,24 @@ #define CACHE6_elog(a,b,c,d,e,f,g) #endif +/* + * GUC variable to define the minimum size of hash to cosider entry eviction. + * This variable is shared among various cache mechanisms. + */ +int cache_memory_target = 0; + +/* GUC variable to define the minimum age of entries that will be cosidered to + * be evicted in seconds. This variable is shared among various cache + * mechanisms. + */ +int cache_prune_min_age = 600; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Timestamp used for any operation on caches. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -491,6 +506,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys, cache->cc_keyno, ct->keys); + cache->cc_tupsize -= ct->size; pfree(ct); --cache->cc_ntup; @@ -842,6 +858,7 @@ InitCatCache(int id, cp->cc_nkeys = nkeys; for (i = 0; i < nkeys; ++i) cp->cc_keyno[i] = key[i]; + cp->cc_tupsize = 0; /* * new cache is initialized as far as we can go for now. print some @@ -859,9 +876,129 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initilize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries can be left alone for several reasons. We remove them if + * they are not accessed for a certain time to prevent catcache from + * bloating. The eviction is performed with the similar algorithm with buffer + * eviction using access counter. Entries that are accessed several times can + * live longer than those that have had no access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int i; + int nremoved = 0; + size_t hash_size; +#ifdef CATCACHE_STATS + /* These variables are only for debugging purpose */ + int ntotal = 0; + /* + * nth element in nentries stores the number of cache entries that have + * lived unaccessed for corresponding multiple in ageclass of + * cache_prune_min_age. The index of nremoved_entry is the value of the + * clock-sweep counter, which takes from 0 up to 2. + */ + double ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; + int nentries[] = {0, 0, 0, 0, 0, 0}; + int nremoved_entry[3] = {0, 0, 0}; + int j; +#endif + + /* Return immediately if no pruning is wanted */ + if (cache_prune_min_age < 0) + return false; + + /* + * Return without pruning if the size of the hash is below the target. + */ + hash_size = cp->cc_nbuckets * sizeof(dlist_head); + if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L) + return false; + + /* Search the whole hash for entries to remove */ + for (i = 0; i < cp->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + + /* + * Calculate the duration from the time of the last access to the + * "current" time. Since catcacheclock is not advanced within a + * transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + +#ifdef CATCACHE_STATS + /* count catcache entries for each age class */ + ntotal++; + for (j = 0 ; + ageclass[j] != 0.0 && + entry_age > cache_prune_min_age * ageclass[j] ; + j++); + if (ageclass[j] == 0.0) j--; + nentries[j]++; +#endif + + /* + * Try to remove entries older than cache_prune_min_age seconds. + * Entries that are not accessed after last pruning are removed in + * that seconds, and that has been accessed several times are + * removed after leaving alone for up to three times of the + * duration. We don't try shrink buckets since pruning effectively + * caps catcache expansion in the long term. + */ + if (entry_age > cache_prune_min_age) + { +#ifdef CATCACHE_STATS + Assert (ct->naccess >= 0 && ct->naccess <= 2); + nremoved_entry[ct->naccess]++; +#endif + if (ct->naccess > 0) + ct->naccess--; + else + { + if (!ct->c_list || ct->c_list->refcount == 0) + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + } + } + } + } + } + +#ifdef CATCACHE_STATS + ereport(DEBUG1, + (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d, 2:%d)", + nremoved, ntotal, + ageclass[0] * cache_prune_min_age, nentries[0], + ageclass[1] * cache_prune_min_age, nentries[1], + ageclass[2] * cache_prune_min_age, nentries[2], + ageclass[3] * cache_prune_min_age, nentries[3], + ageclass[4] * cache_prune_min_age, nentries[4], + nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]), + errhidestmt(true))); +#endif + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -1275,6 +1412,11 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* Update access information for pruning */ + if (ct->naccess < 2) + ct->naccess++; + ct->lastaccess = catcacheclock; + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1820,11 +1962,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, CatCTup *ct; HeapTuple dtp; MemoryContext oldcxt; + int tupsize = 0; /* negative entries have no tuple associated */ if (ntp) { int i; + int tupsize; Assert(!negative); @@ -1843,13 +1987,14 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, /* Allocate memory for CatCTup and the cached tuple in one go */ oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - ct = (CatCTup *) palloc(sizeof(CatCTup) + - MAXIMUM_ALIGNOF + dtp->t_len); + tupsize = sizeof(CatCTup) + MAXIMUM_ALIGNOF + dtp->t_len; + ct = (CatCTup *) palloc(tupsize); ct->tuple.t_len = dtp->t_len; ct->tuple.t_self = dtp->t_self; ct->tuple.t_tableOid = dtp->t_tableOid; ct->tuple.t_data = (HeapTupleHeader) MAXALIGN(((char *) ct) + sizeof(CatCTup)); + ct->size = tupsize; /* copy tuple contents */ memcpy((char *) ct->tuple.t_data, (const char *) dtp->t_data, @@ -1877,8 +2022,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, { Assert(negative); oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - ct = (CatCTup *) palloc(sizeof(CatCTup)); - + tupsize = sizeof(CatCTup); + ct = (CatCTup *) palloc(tupsize); /* * Store keys - they'll point into separately allocated memory if not * by-value. @@ -1899,17 +2044,24 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; + ct->size = tupsize; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); cache->cc_ntup++; CacheHdr->ch_ntup++; + cache->cc_tupsize += tupsize; /* - * If the hash table has become too full, enlarge the buckets array. Quite - * arbitrarily, we enlarge when fill factor > 2. + * If the hash table has become too full, try cleanup by removing + * infrequently used entries to make a room for the new entry. If it + * failed, enlarge the bucket array instead. Quite arbitrarily, we try + * this when fill factor > 2. */ - if (cache->cc_ntup > cache->cc_nbuckets * 2) + if (cache->cc_ntup > cache->cc_nbuckets * 2 && + !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); return ct; diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index c216ed0922..134c357bf3 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -80,6 +80,7 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/guc_tables.h" #include "utils/float.h" #include "utils/memutils.h" @@ -2190,6 +2191,28 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"cache_memory_target", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum syscache size to keep."), + gettext_noop("Cache is not pruned before exceeding this size."), + GUC_UNIT_KB + }, + &cache_memory_target, + 0, 0, MAX_KILOBYTES, + NULL, NULL, NULL + }, + + { + {"cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum unused duration of cache entries before removal."), + gettext_noop("Cache entries that live unused for longer than this seconds are considered to be removed."), + GUC_UNIT_S + }, + &cache_prune_min_age, + 600, -1, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index a21865a77f..d82af3bd6c 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -128,6 +128,8 @@ #work_mem = 4MB # min 64kB #maintenance_work_mem = 64MB # min 1MB #autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem +#cache_memory_target = 0kB # in kB +#cache_prune_min_age = 600s # -1 disables pruning #max_stack_depth = 2MB # min 100kB #dynamic_shared_memory_type = posix # the default is the first option # supported by the operating system: diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 65d816a583..5d24809900 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -61,6 +62,7 @@ typedef struct catcache slist_node cc_next; /* list link */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ + int cc_tupsize; /* total amount of catcache tuples */ /* * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS @@ -119,7 +121,9 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ - + int naccess; /* # of access to this entry, up to 2 */ + TimestampTz lastaccess; /* approx. timestamp of the last usage */ + int size; /* palloc'ed size off this tuple */ /* * The tuple may also be a member of at most one CatCList. (If a single * catcache is list-searched with varying numbers of keys, we may have to @@ -189,6 +193,28 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int cache_prune_min_age; +extern int cache_memory_target; + +/* to use as access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* + * SetCatCacheClock - set timestamp for catcache access record + */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + +static inline TimestampTz +GetCatCacheClock(void) +{ + return catcacheclock; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, -- 2.16.3 From 7cc50a1bf62290c704d90cd9b5b740d68cd8f646 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 16 Oct 2018 15:48:28 +0900 Subject: [PATCH 2/3] Syscache usage tracking feature. Collects syscache usage statictics and show it using the view pg_stat_syscache. The feature is controlled by the GUC variable track_syscache_usage_interval. --- doc/src/sgml/config.sgml | 15 ++ src/backend/catalog/system_views.sql | 17 +++ src/backend/postmaster/pgstat.c | 206 ++++++++++++++++++++++++-- src/backend/tcop/postgres.c | 23 +++ src/backend/utils/adt/pgstatfuncs.c | 134 +++++++++++++++++ src/backend/utils/cache/catcache.c | 115 ++++++++++---- src/backend/utils/cache/syscache.c | 24 +++ src/backend/utils/init/globals.c | 1 + src/backend/utils/init/postinit.c | 11 ++ src/backend/utils/misc/guc.c | 10 ++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/catalog/pg_proc.dat | 9 ++ src/include/miscadmin.h | 1 + src/include/pgstat.h | 7 +- src/include/utils/catcache.h | 9 +- src/include/utils/syscache.h | 19 +++ src/include/utils/timeout.h | 1 + src/test/regress/expected/rules.out | 24 ++- 18 files changed, 582 insertions(+), 45 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index af3c52b868..6dd024340b 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -6662,6 +6662,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; </listitem> </varlistentry> + <varlistentry id="guc-track-syscache-usage-interval" xreflabel="track_syscache_usage_interval"> + <term><varname>track_syscache_usage_interval</varname> (<type>integer</type>) + <indexterm> + <primary><varname>track_syscache_usage_interval</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the interval to collect system cache usage statistics in + milliseconds. This parameter is 0 by default, which means disabled. + Only superusers can change this setting. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-track-io-timing" xreflabel="track_io_timing"> <term><varname>track_io_timing</varname> (<type>boolean</type>) <indexterm> diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index f4d9e9daf7..30e2da935a 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -904,6 +904,22 @@ CREATE VIEW pg_stat_progress_vacuum AS FROM pg_stat_get_progress_info('VACUUM') AS S LEFT JOIN pg_database D ON S.datid = D.oid; +CREATE VIEW pg_stat_syscache AS + SELECT + S.pid AS pid, + S.relid::regclass AS relname, + S.indid::regclass AS cache_name, + S.size AS size, + S.ntup AS ntuples, + S.searches AS searches, + S.hits AS hits, + S.neg_hits AS neg_hits, + S.ageclass AS ageclass, + S.last_update AS last_update + FROM pg_stat_activity A + JOIN LATERAL (SELECT A.pid, * FROM pg_get_syscache_stats(A.pid)) S + ON (A.pid = S.pid); + CREATE VIEW pg_user_mappings AS SELECT U.oid AS umid, @@ -1183,6 +1199,7 @@ GRANT EXECUTE ON FUNCTION pg_ls_waldir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_archive_statusdir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_tmpdir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_tmpdir(oid) TO pg_monitor; +GRANT EXECUTE ON FUNCTION pg_get_syscache_stats(int) TO pg_monitor; GRANT pg_read_all_settings TO pg_monitor; GRANT pg_read_all_stats TO pg_monitor; diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 13da412c59..2c0c6b343e 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -66,6 +66,7 @@ #include "utils/ps_status.h" #include "utils/rel.h" #include "utils/snapmgr.h" +#include "utils/syscache.h" #include "utils/timestamp.h" #include "utils/tqual.h" @@ -125,6 +126,7 @@ bool pgstat_track_activities = false; bool pgstat_track_counts = false; int pgstat_track_functions = TRACK_FUNC_OFF; +int pgstat_track_syscache_usage_interval = 0; int pgstat_track_activity_query_size = 1024; /* ---------- @@ -237,6 +239,11 @@ typedef struct TwoPhasePgStatRecord bool t_truncated; /* was the relation truncated? */ } TwoPhasePgStatRecord; +/* bitmap symbols to specify target file types remove */ +#define PGSTAT_REMFILE_DBSTAT 1 /* remove only databsae stats files */ +#define PGSTAT_REMFILE_SYSCACHE 2 /* remove only syscache stats files */ +#define PGSTAT_REMFILE_ALL 3 /* remove both type of files */ + /* * Info about current "snapshot" of stats file */ @@ -631,10 +638,13 @@ startup_failed: } /* - * subroutine for pgstat_reset_all + * remove stats files + * + * clean up stats files in specified directory. target is one of + * PGSTAT_REFILE_DBSTAT/SYSCACHE/ALL and restricts files to remove. */ static void -pgstat_reset_remove_files(const char *directory) +pgstat_reset_remove_files(const char *directory, int target) { DIR *dir; struct dirent *entry; @@ -645,25 +655,39 @@ pgstat_reset_remove_files(const char *directory) { int nchars; Oid tmp_oid; + int filetype = 0; /* * Skip directory entries that don't match the file names we write. * See get_dbstat_filename for the database-specific pattern. */ if (strncmp(entry->d_name, "global.", 7) == 0) + { + filetype = PGSTAT_REMFILE_DBSTAT; nchars = 7; + } else { + char head[2]; + nchars = 0; - (void) sscanf(entry->d_name, "db_%u.%n", - &tmp_oid, &nchars); - if (nchars <= 0) - continue; + (void) sscanf(entry->d_name, "%c%c_%u.%n", + head, head + 1, &tmp_oid, &nchars); + /* %u allows leading whitespace, so reject that */ - if (strchr("0123456789", entry->d_name[3]) == NULL) + if (nchars < 3 || !isdigit(entry->d_name[3])) continue; + + if (strncmp(head, "db", 2) == 0) + filetype = PGSTAT_REMFILE_DBSTAT; + else if (strncmp(head, "cc", 2) == 0) + filetype = PGSTAT_REMFILE_SYSCACHE; } + /* skip if this is not a target */ + if ((filetype & target) == 0) + continue; + if (strcmp(entry->d_name + nchars, "tmp") != 0 && strcmp(entry->d_name + nchars, "stat") != 0) continue; @@ -684,8 +708,9 @@ pgstat_reset_remove_files(const char *directory) void pgstat_reset_all(void) { - pgstat_reset_remove_files(pgstat_stat_directory); - pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY); + pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_ALL); + pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY, + PGSTAT_REMFILE_ALL); } #ifdef EXEC_BACKEND @@ -4286,6 +4311,9 @@ PgstatCollectorMain(int argc, char *argv[]) pgStatRunningInCollector = true; pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true); + /* Remove left-over syscache stats files */ + pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_SYSCACHE); + /* * Loop to process messages until we get SIGQUIT or detect ungraceful * death of our parent postmaster. @@ -6376,3 +6404,163 @@ pgstat_clip_activity(const char *raw_activity) return activity; } + +/* + * return the filename for a syscache stat file; filename is the output + * buffer, of length len. + */ +void +pgstat_get_syscachestat_filename(bool permanent, bool tempname, int backendid, + char *filename, int len) +{ + int printed; + + /* NB -- pgstat_reset_remove_files knows about the pattern this uses */ + printed = snprintf(filename, len, "%s/cc_%u.%s", + permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY : + pgstat_stat_directory, + backendid, + tempname ? "tmp" : "stat"); + if (printed >= len) + elog(ERROR, "overlength pgstat path"); +} + +/* + * pgstat_write_syscache_stats() - + * Write the syscache statistics files. + * + * If 'force' is false, this function skips writing a file and resturns the + * time remaining in the current interval in milliseconds. If'force' is true, + * writes a file regardless of the remaining time and reset the interval. + */ +long +pgstat_write_syscache_stats(bool force) +{ + static TimestampTz last_report = 0; + TimestampTz now; + long elapsed; + long secs; + int usecs; + int cacheId; + FILE *fpout; + char statfile[MAXPGPATH]; + char tmpfile[MAXPGPATH]; + + /* Return if we don't want it */ + if (!force && pgstat_track_syscache_usage_interval <= 0) + return 0; + + + /* Check aginst the in*/ + now = GetCurrentTransactionStopTimestamp(); + TimestampDifference(last_report, now, &secs, &usecs); + elapsed = secs * 1000 + usecs / 1000; + + if (!force && elapsed < pgstat_track_syscache_usage_interval) + { + /* not yet the time, inform the remaining time to the caller */ + return pgstat_track_syscache_usage_interval - elapsed; + } + + /* now write the file */ + last_report = now; + + pgstat_get_syscachestat_filename(false, true, + MyBackendId, tmpfile, MAXPGPATH); + pgstat_get_syscachestat_filename(false, false, + MyBackendId, statfile, MAXPGPATH); + + /* + * This function can be called from ProcessInterrupts(). Inhibit recursive + * interrupts to avoid recursive entry. + */ + HOLD_INTERRUPTS(); + + fpout = AllocateFile(tmpfile, PG_BINARY_W); + if (fpout == NULL) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not open temporary statistics file \"%s\": %m", + tmpfile))); + /* + * Failure writing this file is not critical. Just skip this time and + * tell caller to wait for the next interval. + */ + RESUME_INTERRUPTS(); + return pgstat_track_syscache_usage_interval; + } + + /* write out every catcache stats */ + for (cacheId = 0 ; cacheId < SysCacheSize ; cacheId++) + { + SysCacheStats *stats; + + stats = SysCacheGetStats(cacheId); + Assert (stats); + + /* write error is checked later using ferror() */ + fputc('T', fpout); + (void)fwrite(&cacheId, sizeof(int), 1, fpout); + (void)fwrite(&last_report, sizeof(TimestampTz), 1, fpout); + (void)fwrite(stats, sizeof(*stats), 1, fpout); + } + fputc('E', fpout); + + if (ferror(fpout)) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not write syscache statistics file \"%s\": %m", + tmpfile))); + FreeFile(fpout); + unlink(tmpfile); + } + else if (FreeFile(fpout) < 0) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not close syscache statistics file \"%s\": %m", + tmpfile))); + unlink(tmpfile); + } + else if (rename(tmpfile, statfile) < 0) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not rename syscache statistics file \"%s\" to \"%s\": %m", + tmpfile, statfile))); + unlink(tmpfile); + } + + RESUME_INTERRUPTS(); + return 0; +} + +/* + * GUC assignment callback for track_syscache_usage_interval. + * + * Make a statistics file immedately when syscache statistics is turned + * on. Remove it as soon as turned off as well. + */ +void +pgstat_track_syscache_assign_hook(int newval, void *extra) +{ + if (newval > 0) + { + /* + * Immediately create a stats file. It's safe since we're not midst + * accessing syscache. + */ + pgstat_write_syscache_stats(true); + } + else + { + /* Turned off, immediately remove the statsfile */ + char fname[MAXPGPATH]; + + pgstat_get_syscachestat_filename(false, false, MyBackendId, + fname, MAXPGPATH); + unlink(fname); /* don't care of the result */ + } +} diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 0c0891b33e..e7972e645f 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -3157,6 +3157,12 @@ ProcessInterrupts(void) } + if (IdleSyscacheStatsUpdateTimeoutPending) + { + IdleSyscacheStatsUpdateTimeoutPending = false; + pgstat_write_syscache_stats(true); + } + if (ParallelMessagePending) HandleParallelMessages(); } @@ -3733,6 +3739,7 @@ PostgresMain(int argc, char *argv[], sigjmp_buf local_sigjmp_buf; volatile bool send_ready_for_query = true; bool disable_idle_in_transaction_timeout = false; + bool disable_idle_catcache_update_timeout = false; /* Initialize startup process environment if necessary. */ if (!IsUnderPostmaster) @@ -4173,9 +4180,19 @@ PostgresMain(int argc, char *argv[], } else { + long timeout; + ProcessCompletedNotifies(); pgstat_report_stat(false); + timeout = pgstat_write_syscache_stats(false); + + if (timeout > 0) + { + disable_idle_catcache_update_timeout = true; + enable_timeout_after(IDLE_CATCACHE_UPDATE_TIMEOUT, + timeout); + } set_ps_display("idle", false); pgstat_report_activity(STATE_IDLE, NULL); } @@ -4218,6 +4235,12 @@ PostgresMain(int argc, char *argv[], disable_idle_in_transaction_timeout = false; } + if (disable_idle_catcache_update_timeout) + { + disable_timeout(IDLE_CATCACHE_UPDATE_TIMEOUT, false); + disable_idle_catcache_update_timeout = false; + } + /* * (6) check for any other interesting events that happened while we * slept. diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index 053bb73863..0d32bf8daa 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -14,6 +14,8 @@ */ #include "postgres.h" +#include <sys/stat.h> + #include "access/htup_details.h" #include "catalog/pg_authid.h" #include "catalog/pg_type.h" @@ -28,6 +30,7 @@ #include "utils/acl.h" #include "utils/builtins.h" #include "utils/inet.h" +#include "utils/syscache.h" #include "utils/timestamp.h" #define UINT32_ACCESS_ONCE(var) ((uint32)(*((volatile uint32 *)&(var)))) @@ -1882,3 +1885,134 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS) PG_RETURN_DATUM(HeapTupleGetDatum( heap_form_tuple(tupdesc, values, nulls))); } + +Datum +pgstat_get_syscache_stats(PG_FUNCTION_ARGS) +{ +#define PG_GET_SYSCACHE_SIZE 9 + int pid = PG_GETARG_INT32(0); + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; + TupleDesc tupdesc; + Tuplestorestate *tupstore; + MemoryContext per_query_ctx; + MemoryContext oldcontext; + PgBackendStatus *beentry; + int beid; + char fname[MAXPGPATH]; + FILE *fpin; + char c; + + if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("set-valued function called in context that cannot accept a set"))); + if (!(rsinfo->allowedModes & SFRM_Materialize)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("materialize mode required, but it is not " \ + "allowed in this context"))); + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + + per_query_ctx = rsinfo->econtext->ecxt_per_query_memory; + + oldcontext = MemoryContextSwitchTo(per_query_ctx); + tupstore = tuplestore_begin_heap(true, false, work_mem); + rsinfo->returnMode = SFRM_Materialize; + rsinfo->setResult = tupstore; + rsinfo->setDesc = tupdesc; + + MemoryContextSwitchTo(oldcontext); + + /* find beentry for given pid*/ + beentry = NULL; + for (beid = 1; + (beentry = pgstat_fetch_stat_beentry(beid)) && + beentry->st_procpid != pid ; + beid++); + + /* + * we silently return empty result on failure or insufficient privileges + */ + if (!beentry || + (!has_privs_of_role(GetUserId(), beentry->st_userid) && + !is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS))) + goto no_data; + + pgstat_get_syscachestat_filename(false, false, beid, fname, MAXPGPATH); + + if ((fpin = AllocateFile(fname, PG_BINARY_R)) == NULL) + { + if (errno != ENOENT) + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not open statistics file \"%s\": %m", + fname))); + /* also return empty on no statistics file */ + goto no_data; + } + + /* read the statistics file into tuplestore */ + while ((c = fgetc(fpin)) == 'T') + { + TimestampTz last_update; + SysCacheStats stats; + int cacheid; + Datum values[PG_GET_SYSCACHE_SIZE]; + bool nulls[PG_GET_SYSCACHE_SIZE] = {0}; + Datum datums[SYSCACHE_STATS_NAGECLASSES * 2]; + bool arrnulls[SYSCACHE_STATS_NAGECLASSES * 2] = {0}; + int dims[] = {SYSCACHE_STATS_NAGECLASSES, 2}; + int lbs[] = {1, 1}; + ArrayType *arr; + int i, j; + + fread(&cacheid, sizeof(int), 1, fpin); + fread(&last_update, sizeof(TimestampTz), 1, fpin); + if (fread(&stats, 1, sizeof(stats), fpin) != sizeof(stats)) + { + ereport(WARNING, + (errmsg("corrupted syscache statistics file \"%s\"", + fname))); + goto no_data; + } + + i = 0; + values[i++] = ObjectIdGetDatum(stats.reloid); + values[i++] = ObjectIdGetDatum(stats.indoid); + values[i++] = Int64GetDatum(stats.size); + values[i++] = Int64GetDatum(stats.ntuples); + values[i++] = Int64GetDatum(stats.nsearches); + values[i++] = Int64GetDatum(stats.nhits); + values[i++] = Int64GetDatum(stats.nneg_hits); + + for (j = 0 ; j < SYSCACHE_STATS_NAGECLASSES ; j++) + { + datums[j * 2] = Int32GetDatum((int32) stats.ageclasses[j]); + datums[j * 2 + 1] = Int32GetDatum((int32) stats.nclass_entries[j]); + } + + arr = construct_md_array(datums, arrnulls, 2, dims, lbs, + INT4OID, sizeof(int32), true, 'i'); + values[i++] = PointerGetDatum(arr); + + values[i++] = TimestampTzGetDatum(last_update); + + Assert (i == PG_GET_SYSCACHE_SIZE); + + tuplestore_putvalues(tupstore, tupdesc, values, nulls); + } + + /* check for the end of file. abandon the result if file is broken */ + if (c != 'E' || fgetc(fpin) != EOF) + tuplestore_clear(tupstore); + + FreeFile(fpin); + +no_data: + tuplestore_donestoring(tupstore); + return (Datum) 0; +} diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index ee40093553..4a3b3094a0 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -90,6 +90,10 @@ static CatCacheHeader *CacheHdr = NULL; /* Timestamp used for any operation on caches. */ TimestampTz catcacheclock = 0; +/* age classes for pruning */ +static double ageclass[SYSCACHE_STATS_NAGECLASSES] + = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -620,9 +624,7 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue) else CatCacheRemoveCTup(cache, ct); CACHE1_elog(DEBUG2, "CatCacheInvalidate: invalidated"); -#ifdef CATCACHE_STATS cache->cc_invals++; -#endif /* could be multiple matches, so keep looking! */ } } @@ -698,9 +700,7 @@ ResetCatalogCache(CatCache *cache) } else CatCacheRemoveCTup(cache, ct); -#ifdef CATCACHE_STATS cache->cc_invals++; -#endif } } } @@ -907,10 +907,11 @@ CatCacheCleanupOldEntries(CatCache *cp) * cache_prune_min_age. The index of nremoved_entry is the value of the * clock-sweep counter, which takes from 0 up to 2. */ - double ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; - int nentries[] = {0, 0, 0, 0, 0, 0}; + int nentries[SYSCACHE_STATS_NAGECLASSES] = {0, 0, 0, 0, 0, 0}; int nremoved_entry[3] = {0, 0, 0}; int j; + + Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0); #endif /* Return immediately if no pruning is wanted */ @@ -924,7 +925,11 @@ CatCacheCleanupOldEntries(CatCache *cp) if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L) return false; - /* Search the whole hash for entries to remove */ + /* + * Search the whole hash for entries to remove. This is a quite time + * consuming task during catcache lookup, but accetable since now we are + * going to expand the hash table. + */ for (i = 0; i < cp->cc_nbuckets; i++) { dlist_mutable_iter iter; @@ -937,21 +942,21 @@ CatCacheCleanupOldEntries(CatCache *cp) /* - * Calculate the duration from the time of the last access to the - * "current" time. Since catcacheclock is not advanced within a - * transaction, the entries that are accessed within the current - * transaction won't be pruned. + * Calculate the duration from the time from the last access to + * the "current" time. Since catcacheclock is not advanced within + * a transaction, the entries that are accessed within the current + * transaction always get 0 as the result. */ TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); #ifdef CATCACHE_STATS /* count catcache entries for each age class */ ntotal++; - for (j = 0 ; - ageclass[j] != 0.0 && - entry_age > cache_prune_min_age * ageclass[j] ; - j++); - if (ageclass[j] == 0.0) j--; + + j = 0; + while (j < SYSCACHE_STATS_NAGECLASSES - 1 && + entry_age > cache_prune_min_age * ageclass[j]) + j++; nentries[j]++; #endif @@ -984,14 +989,17 @@ CatCacheCleanupOldEntries(CatCache *cp) } #ifdef CATCACHE_STATS + StaticAssertStmt(SYSCACHE_STATS_NAGECLASSES == 6, + "number of syscache age class must be 6"); ereport(DEBUG1, - (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d, 2:%d)", + (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d, rest:%d) naccessed(0:%d,1:%d, 2:%d)", nremoved, ntotal, ageclass[0] * cache_prune_min_age, nentries[0], ageclass[1] * cache_prune_min_age, nentries[1], ageclass[2] * cache_prune_min_age, nentries[2], ageclass[3] * cache_prune_min_age, nentries[3], ageclass[4] * cache_prune_min_age, nentries[4], + nentries[5], nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]), errhidestmt(true))); #endif @@ -1368,9 +1376,7 @@ SearchCatCacheInternal(CatCache *cache, if (unlikely(cache->cc_tupdesc == NULL)) CatalogCacheInitializeCache(cache); -#ifdef CATCACHE_STATS cache->cc_searches++; -#endif /* Initialize local parameter array */ arguments[0] = v1; @@ -1430,9 +1436,7 @@ SearchCatCacheInternal(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_hits++; -#endif return &ct->tuple; } @@ -1441,9 +1445,7 @@ SearchCatCacheInternal(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_neg_hits++; -#endif return NULL; } @@ -1571,9 +1573,7 @@ SearchCatCacheMiss(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): put in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_newloads++; -#endif return &ct->tuple; } @@ -1684,9 +1684,7 @@ SearchCatCacheList(CatCache *cache, Assert(nkeys > 0 && nkeys < cache->cc_nkeys); -#ifdef CATCACHE_STATS cache->cc_lsearches++; -#endif /* Initialize local parameter array */ arguments[0] = v1; @@ -1743,9 +1741,7 @@ SearchCatCacheList(CatCache *cache, CACHE2_elog(DEBUG2, "SearchCatCacheList(%s): found list", cache->cc_relname); -#ifdef CATCACHE_STATS cache->cc_lhits++; -#endif return cl; } @@ -2253,3 +2249,64 @@ PrintCatCacheListLeakWarning(CatCList *list) list->my_cache->cc_relname, list->my_cache->id, list, list->refcount); } + +/* + * CatCacheGetStats - fill in SysCacheStats struct. + * + * This is a support routine for SysCacheGetStats, substantially fills in the + * result. The classification here is based on the same criteria to + * CatCacheCleanupOldEntries(). + */ +void +CatCacheGetStats(CatCache *cache, SysCacheStats *stats) +{ + int i, j; + + Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0); + + /* fill in the stats struct */ + stats->size = cache->cc_tupsize + cache->cc_nbuckets * sizeof(dlist_head); + stats->ntuples = cache->cc_ntup; + stats->nsearches = cache->cc_searches; + stats->nhits = cache->cc_hits; + stats->nneg_hits = cache->cc_neg_hits; + + /* cache_prune_min_age can be changed on-session, fill it every time */ + for (i = 0 ; i < SYSCACHE_STATS_NAGECLASSES ; i++) + stats->ageclasses[i] = (int) (cache_prune_min_age * ageclass[i]); + + /* + * nth element in nclass_entries stores the number of cache entries that + * have lived unaccessed for corresponding multiple in ageclass of + * cache_prune_min_age. + */ + memset(stats->nclass_entries, 0, sizeof(int) * SYSCACHE_STATS_NAGECLASSES); + + /* Scan the whole hash */ + for (i = 0; i < cache->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cache->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + /* + * Calculate the duration from the time from the last access to + * the "current" time. Since catcacheclock is not advanced within + * a transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + + j = 0; + while (j < SYSCACHE_STATS_NAGECLASSES - 1 && + entry_age > stats->ageclasses[j]) + j++; + + stats->nclass_entries[j]++; + } + } +} diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c index ac98c19155..7b38a06708 100644 --- a/src/backend/utils/cache/syscache.c +++ b/src/backend/utils/cache/syscache.c @@ -20,6 +20,9 @@ */ #include "postgres.h" +#include <sys/stat.h> +#include <unistd.h> + #include "access/htup_details.h" #include "access/sysattr.h" #include "catalog/indexing.h" @@ -1534,6 +1537,27 @@ RelationSupportsSysCache(Oid relid) return false; } +/* + * SysCacheGetStats - returns stats of specified syscache + * + * This routine returns the address of its local static memory. + */ +SysCacheStats * +SysCacheGetStats(int cacheId) +{ + static SysCacheStats stats; + + Assert(cacheId >=0 && cacheId < SysCacheSize); + + memset(&stats, 0, sizeof(stats)); + + stats.reloid = cacheinfo[cacheId].reloid; + stats.indoid = cacheinfo[cacheId].indoid; + + CatCacheGetStats(SysCache[cacheId], &stats); + + return &stats; +} /* * OID comparator for pg_qsort diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index fd51934aaf..f039ecd805 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false; volatile sig_atomic_t ProcDiePending = false; volatile sig_atomic_t ClientConnectionLost = false; volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false; +volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending = false; volatile sig_atomic_t ConfigReloadPending = false; volatile uint32 InterruptHoldoffCount = 0; volatile uint32 QueryCancelHoldoffCount = 0; diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index 7415c4faab..6b0fdbbd87 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg); static void StatementTimeoutHandler(void); static void LockTimeoutHandler(void); static void IdleInTransactionSessionTimeoutHandler(void); +static void IdleSyscacheStatsUpdateTimeoutHandler(void); static bool ThereIsAtLeastOneRole(void); static void process_startup_options(Port *port, bool am_superuser); static void process_settings(Oid databaseid, Oid roleid); @@ -629,6 +630,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler); RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT, IdleInTransactionSessionTimeoutHandler); + RegisterTimeout(IDLE_CATCACHE_UPDATE_TIMEOUT, + IdleSyscacheStatsUpdateTimeoutHandler); } /* @@ -1240,6 +1243,14 @@ IdleInTransactionSessionTimeoutHandler(void) SetLatch(MyLatch); } +static void +IdleSyscacheStatsUpdateTimeoutHandler(void) +{ + IdleSyscacheStatsUpdateTimeoutPending = true; + InterruptPending = true; + SetLatch(MyLatch); +} + /* * Returns true if at least one role is defined in this database cluster. */ diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 134c357bf3..e8d7b6998a 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -3154,6 +3154,16 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"track_syscache_usage_interval", PGC_SUSET, STATS_COLLECTOR, + gettext_noop("Sets the interval between syscache usage collection, in milliseconds. Zero disables syscache usagetracking."), + NULL + }, + &pgstat_track_syscache_usage_interval, + 0, 0, INT_MAX / 2, + NULL, NULL, NULL + }, + { {"gin_pending_list_limit", PGC_USERSET, CLIENT_CONN_STATEMENT, gettext_noop("Sets the maximum size of the pending list for GIN index."), diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index d82af3bd6c..4a6c9fceb5 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -554,6 +554,7 @@ #track_io_timing = off #track_functions = none # none, pl, all #track_activity_query_size = 1024 # (change requires restart) +#track_syscache_usage_interval = 0 # zero disables tracking #stats_temp_directory = 'pg_stat_tmp' diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 3ecc2e12c3..11fc1f3075 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -9669,6 +9669,15 @@ proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}', proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}', prosrc => 'pg_get_replication_slots' }, +{ oid => '3425', + descr => 'syscache statistics', + proname => 'pg_get_syscache_stats', prorows => '100', proisstrict => 'f', + proretset => 't', provolatile => 'v', prorettype => 'record', + proargtypes => 'int4', + proallargtypes => '{int4,oid,oid,int8,int8,int8,int8,int8,_int4,timestamptz}', + proargmodes => '{i,o,o,o,o,o,o,o,o,o}', + proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,last_update}', + prosrc => 'pgstat_get_syscache_stats' }, { oid => '3786', descr => 'set up a logical replication slot', proname => 'pg_create_logical_replication_slot', provolatile => 'v', proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool', diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index c9e35003a5..69b9a976f0 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending; extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending; extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending; extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending; +extern PGDLLIMPORT volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending; extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending; extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost; diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 313ca5f3c3..ee9968f81a 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -1134,6 +1134,7 @@ extern bool pgstat_track_activities; extern bool pgstat_track_counts; extern int pgstat_track_functions; extern PGDLLIMPORT int pgstat_track_activity_query_size; +extern int pgstat_track_syscache_usage_interval; extern char *pgstat_stat_directory; extern char *pgstat_stat_tmpname; extern char *pgstat_stat_filename; @@ -1218,7 +1219,8 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id); extern void pgstat_initstats(Relation rel); extern char *pgstat_clip_activity(const char *raw_activity); - +extern void pgstat_get_syscachestat_filename(bool permanent, + bool tempname, int backendid, char *filename, int len); /* ---------- * pgstat_report_wait_start() - * @@ -1353,5 +1355,6 @@ extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid); extern int pgstat_fetch_stat_numbackends(void); extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void); extern PgStat_GlobalStats *pgstat_fetch_global(void); - +extern long pgstat_write_syscache_stats(bool force); +extern void pgstat_track_syscache_assign_hook(int newval, void *extra); #endif /* PGSTAT_H */ diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 5d24809900..4d51975920 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -65,10 +65,8 @@ typedef struct catcache int cc_tupsize; /* total amount of catcache tuples */ /* - * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS - * doesn't break ABI for other modules + * Statistics entries */ -#ifdef CATCACHE_STATS long cc_searches; /* total # searches against this cache */ long cc_hits; /* # of matches against existing entry */ long cc_neg_hits; /* # of matches against negative entry */ @@ -81,7 +79,6 @@ typedef struct catcache long cc_invals; /* # of entries invalidated from cache */ long cc_lsearches; /* total # list-searches */ long cc_lhits; /* # of matches against existing lists */ -#endif } CatCache; @@ -254,4 +251,8 @@ extern void PrepareToInvalidateCacheTuple(Relation relation, extern void PrintCatCacheLeakWarning(HeapTuple tuple); extern void PrintCatCacheListLeakWarning(CatCList *list); +/* defined in syscache.h */ +typedef struct syscachestats SysCacheStats; +extern void CatCacheGetStats(CatCache *cache, SysCacheStats *syscachestats); + #endif /* CATCACHE_H */ diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h index 95ee48954e..71b399c902 100644 --- a/src/include/utils/syscache.h +++ b/src/include/utils/syscache.h @@ -112,6 +112,24 @@ enum SysCacheIdentifier #define SysCacheSize (USERMAPPINGUSERSERVER + 1) }; +#define SYSCACHE_STATS_NAGECLASSES 6 +/* Struct for catcache tracking information */ +typedef struct syscachestats +{ + Oid reloid; /* target relation */ + Oid indoid; /* index */ + size_t size; /* size of the catcache */ + int ntuples; /* number of tuples resides in the catcache */ + int nsearches; /* number of searches */ + int nhits; /* number of cache hits */ + int nneg_hits; /* number of negative cache hits */ + /* age classes in seconds */ + int ageclasses[SYSCACHE_STATS_NAGECLASSES]; + /* number of tuples fall into the corresponding age class */ + int nclass_entries[SYSCACHE_STATS_NAGECLASSES]; +} SysCacheStats; + + extern void InitCatalogCache(void); extern void InitCatalogCachePhase2(void); @@ -164,6 +182,7 @@ extern void SysCacheInvalidate(int cacheId, uint32 hashValue); extern bool RelationInvalidatesSnapshotsOnly(Oid relid); extern bool RelationHasSysCache(Oid relid); extern bool RelationSupportsSysCache(Oid relid); +extern SysCacheStats *SysCacheGetStats(int cacheId); /* * The use of the macros below rather than direct calls to the corresponding diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h index 9244a2a7b7..0ab441a364 100644 --- a/src/include/utils/timeout.h +++ b/src/include/utils/timeout.h @@ -31,6 +31,7 @@ typedef enum TimeoutId STANDBY_TIMEOUT, STANDBY_LOCK_TIMEOUT, IDLE_IN_TRANSACTION_SESSION_TIMEOUT, + IDLE_CATCACHE_UPDATE_TIMEOUT, /* First user-definable timeout reason */ USER_TIMEOUT, /* Maximum number of timeout reasons */ diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out index e384cd2279..1991e75e97 100644 --- a/src/test/regress/expected/rules.out +++ b/src/test/regress/expected/rules.out @@ -1919,6 +1919,28 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid, pg_stat_all_tables.autoanalyze_count FROM pg_stat_all_tables WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname~ '^pg_toast'::text)); +pg_stat_syscache| SELECT s.pid, + (s.relid)::regclass AS relname, + (s.indid)::regclass AS cache_name, + s.size, + s.ntup AS ntuples, + s.searches, + s.hits, + s.neg_hits, + s.ageclass, + s.last_update + FROM (pg_stat_activity a + JOIN LATERAL ( SELECT a.pid, + pg_get_syscache_stats.relid, + pg_get_syscache_stats.indid, + pg_get_syscache_stats.size, + pg_get_syscache_stats.ntup, + pg_get_syscache_stats.searches, + pg_get_syscache_stats.hits, + pg_get_syscache_stats.neg_hits, + pg_get_syscache_stats.ageclass, + pg_get_syscache_stats.last_update + FROM pg_get_syscache_stats(a.pid) pg_get_syscache_stats(relid, indid, size, ntup, searches, hits, neg_hits, ageclass,last_update)) s ON ((a.pid = s.pid))); pg_stat_user_functions| SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, @@ -2350,7 +2372,7 @@ pg_settings|pg_settings_n|CREATE RULE pg_settings_n AS ON UPDATE TO pg_catalog.pg_settings DO INSTEAD NOTHING; pg_settings|pg_settings_u|CREATE RULE pg_settings_u AS ON UPDATE TO pg_catalog.pg_settings - WHERE (new.name = old.name) DO SELECT set_config(old.name, new.setting, false) AS set_config; + WHERE (new.name = old.name) DO SELECT set_config(old.name, new.setting, false, false) AS set_config; rtest_emp|rtest_emp_del|CREATE RULE rtest_emp_del AS ON DELETE TO public.rtest_emp DO INSERT INTO rtest_emplog (ename, who, action, newsal, oldsal) VALUES (old.ename, CURRENT_USER, 'fired'::bpchar, '$0.00'::money, old.salary); -- 2.16.3 From 4434b92429d9b60baed6f45bf8132a67225b0671 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Fri, 18 Jan 2019 17:16:12 +0900 Subject: [PATCH 3/3] Remote GUC setting feature and non-xact GUC config. This adds two features at once. (will be split later). One is non-transactional GUC setting feature. This allows setting GUC variable set by the action GUC_ACTION_NONXACT(the name requires condieration) survive beyond rollback. It is required by remote guc setting to work sanely. Without the feature a remote-set value within a trasction will disappear when involved in rollback. The only local interface for the NONXACT action is set_config(name, value, is_local=false, is_nonxact = true). The second is remote guc setting feature. It uses ProcSignal to notify the target server. --- doc/src/sgml/config.sgml | 4 + doc/src/sgml/func.sgml | 30 ++ src/backend/catalog/system_views.sql | 7 +- src/backend/postmaster/pgstat.c | 3 + src/backend/storage/ipc/ipci.c | 2 + src/backend/storage/ipc/procsignal.c | 4 + src/backend/tcop/postgres.c | 10 + src/backend/utils/misc/README | 26 +- src/backend/utils/misc/guc.c | 619 +++++++++++++++++++++++++++++++++-- src/include/catalog/pg_proc.dat | 10 +- src/include/pgstat.h | 3 +- src/include/storage/procsignal.h | 3 + src/include/utils/guc.h | 13 +- src/include/utils/guc_tables.h | 5 +- src/test/regress/expected/guc.out | 223 +++++++++++++ src/test/regress/sql/guc.sql | 88 +++++ 16 files changed, 1002 insertions(+), 48 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 6dd024340b..d024d9b069 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -281,6 +281,10 @@ UPDATE pg_settings SET setting = reset_val WHERE name = 'configuration_parameter </listitem> </itemizedlist> + <para> + Also values on other sessions can be set using the SQL + function <function>pg_set_backend_setting</function>. + </para> </sect2> <sect2> diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 4930ec17f6..aeb0c4483a 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -18687,6 +18687,20 @@ SELECT collation for ('foo' COLLATE "de_DE"); <entry><type>text</type></entry> <entry>set parameter and return new value</entry> </row> + <row> + <entry> + <indexterm> + <primary>pg_set_backend_setting</primary> + </indexterm> + <literal><function>pg_set_backend_config( + <parameter>process_id</parameter>, + <parameter>setting_name</parameter>, + <parameter>new_value</parameter>) + </function></literal> + </entry> + <entry><type>bool</type></entry> + <entry>set parameter on another session</entry> + </row> </tbody> </tgroup> </table> @@ -18741,6 +18755,22 @@ SELECT set_config('log_statement_stats', 'off', false); ------------ off (1 row) +</programlisting> + </para> + + <para> + <function>pg_set_backend_config</function> sets the parameter + <parameter>setting_name</parameter> to + <parameter>new_value</parameter> on the other session with PID + <parameter>process_id</parameter>. The setting is always session-local and + returns true if succeeded. An example: +<programlisting> +SELECT pg_set_backend_config(2134, 'work_mem', '16MB'); + +pg_set_backend_config +------------ + t +(1 row) </programlisting> </para> diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 30e2da935a..3d2e341c19 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -474,7 +474,7 @@ CREATE VIEW pg_settings AS CREATE RULE pg_settings_u AS ON UPDATE TO pg_settings WHERE new.name = old.name DO - SELECT set_config(old.name, new.setting, 'f'); + SELECT set_config(old.name, new.setting, 'f', 'f'); CREATE RULE pg_settings_n AS ON UPDATE TO pg_settings @@ -1049,6 +1049,11 @@ CREATE OR REPLACE FUNCTION RETURNS boolean STRICT VOLATILE LANGUAGE INTERNAL AS 'pg_promote' PARALLEL SAFE; +CREATE OR REPLACE FUNCTION set_config ( + setting_name text, new_value text, is_local boolean, is_nonxact boolean DEFAULT false) + RETURNS text STRICT VOLATILE LANGUAGE internal AS 'set_config_by_name' + PARALLEL UNSAFE; + -- legacy definition for compatibility with 9.3 CREATE OR REPLACE FUNCTION json_populate_record(base anyelement, from_json json, use_json_as_text boolean DEFAULT false) diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 2c0c6b343e..5d6c0edcd9 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -3707,6 +3707,9 @@ pgstat_get_wait_ipc(WaitEventIPC w) case WAIT_EVENT_SYNC_REP: event_name = "SyncRep"; break; + case WAIT_EVENT_REMOTE_GUC: + event_name = "RemoteGUC"; + break; /* no default case, so that compiler will warn */ } diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c index 2849e47d99..044107b354 100644 --- a/src/backend/storage/ipc/ipci.c +++ b/src/backend/storage/ipc/ipci.c @@ -148,6 +148,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port) size = add_size(size, BTreeShmemSize()); size = add_size(size, SyncScanShmemSize()); size = add_size(size, AsyncShmemSize()); + size = add_size(size, GucShmemSize()); #ifdef EXEC_BACKEND size = add_size(size, ShmemBackendArraySize()); #endif @@ -267,6 +268,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port) BTreeShmemInit(); SyncScanShmemInit(); AsyncShmemInit(); + GucShmemInit(); #ifdef EXEC_BACKEND diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c index 7605b2c367..98c0f84378 100644 --- a/src/backend/storage/ipc/procsignal.c +++ b/src/backend/storage/ipc/procsignal.c @@ -27,6 +27,7 @@ #include "storage/shmem.h" #include "storage/sinval.h" #include "tcop/tcopprot.h" +#include "utils/guc.h" /* @@ -292,6 +293,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS) if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN)) RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN); + if (CheckProcSignal(PROCSIG_REMOTE_GUC)) + HandleRemoteGucSetInterrupt(); + SetLatch(MyLatch); latch_sigusr1_handler(); diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index e7972e645f..3db2a7eacc 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -3165,6 +3165,10 @@ ProcessInterrupts(void) if (ParallelMessagePending) HandleParallelMessages(); + + /* We don't want chage GUC variables while running a query */ + if (RemoteGucChangePending && DoingCommandRead) + HandleGucRemoteChanges(); } @@ -4201,6 +4205,12 @@ PostgresMain(int argc, char *argv[], send_ready_for_query = false; } + /* + * (2.5) Process some pending works. + */ + if (RemoteGucChangePending) + HandleGucRemoteChanges(); + /* * (2) Allow asynchronous signals to be executed immediately if they * come in while we are waiting for client input. (This must be diff --git a/src/backend/utils/misc/README b/src/backend/utils/misc/README index 6e294386f7..42ae6c1a8f 100644 --- a/src/backend/utils/misc/README +++ b/src/backend/utils/misc/README @@ -169,10 +169,14 @@ Entry to a function with a SET option: Plain SET command: If no stack entry of current level: - Push new stack entry w/prior value and state SET + Push new stack entry w/prior value and state SET or + push new stack entry w/o value and state NONXACT. else if stack entry's state is SAVE, SET, or LOCAL: change stack state to SET, don't change saved value (here we are forgetting effects of prior set action) + else if stack entry's state is NONXACT: + change stack state to NONXACT_SET, set the current value to + prior. else (entry must have state SET+LOCAL): discard its masked value, change state to SET (here we are forgetting effects of prior SET and SET LOCAL) @@ -185,13 +189,20 @@ SET LOCAL command: else if stack entry's state is SAVE or LOCAL or SET+LOCAL: no change to stack entry (in SAVE case, SET LOCAL will be forgotten at func exit) + else if stack entry's state is NONXACT: + set current value to both prior and masked slots. set state + NONXACT+LOCAL. else (entry must have state SET): put current active into its masked slot, set state SET+LOCAL Now set new value. +Setting by NONXACT action (no command exists): + Always blow away existing stack then create a new NONXACT entry. + Transaction or subtransaction abort: - Pop stack entries, restoring prior value, until top < subxact depth + Pop stack entries, restoring prior value unless the stack entry's + state is NONXACT, until top < subxact depth Transaction or subtransaction commit (incl. successful function exit): @@ -199,9 +210,9 @@ Transaction or subtransaction commit (incl. successful function exit): if entry's state is SAVE: pop, restoring prior value - else if level is 1 and entry's state is SET+LOCAL: + else if level is 1 and entry's state is SET+LOCAL or NONXACT+LOCAL: pop, restoring *masked* value - else if level is 1 and entry's state is SET: + else if level is 1 and entry's state is SET or NONXACT+SET: pop, discarding old value else if level is 1 and entry's state is LOCAL: pop, restoring prior value @@ -210,9 +221,9 @@ Transaction or subtransaction commit (incl. successful function exit): else merge entries of level N-1 and N as specified below -The merged entry will have level N-1 and prior = older prior, so easiest -to keep older entry and free newer. There are 12 possibilities since -we already handled level N state = SAVE: +The merged entry will have level N-1 and prior = older prior, so +easiest to keep older entry and free newer. Disregarding to NONXACT, +here are 12 possibilities since we already handled level N state = SAVE: N-1 N @@ -232,6 +243,7 @@ SET+LOCAL SET discard top prior and second masked, state SET SET+LOCAL LOCAL discard top prior, no change to stack entry SET+LOCAL SET+LOCAL discard top prior, copy masked, state S+L +(TODO: states involving NONXACT) RESET is executed like a SET, but using the reset_val as the desired new value. (We do not provide a RESET LOCAL command, but SET LOCAL TO DEFAULT diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index e8d7b6998a..5a4eaed622 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -217,6 +217,37 @@ static ConfigVariable *ProcessConfigFileInternal(GucContext context, bool applySettings, int elevel); +/* Enum and struct to command GUC setting to another backend */ +typedef enum +{ + REMGUC_VACANT, + REMGUC_REQUEST, + REMGUC_INPROCESS, + REMGUC_DONE, + REMGUC_CANCELING, + REMGUC_CANCELED, +} remote_guc_status; + +#define GUC_REMOTE_MAX_VALUE_LEN 1024 /* an arbitrary value */ +#define GUC_REMOTE_CANCEL_TIMEOUT 5000 /* in milliseconds */ + +typedef struct +{ + remote_guc_status state; + char name[NAMEDATALEN]; + char value[GUC_REMOTE_MAX_VALUE_LEN]; + int sourcepid; + int targetpid; + Oid userid; + bool success; + volatile Latch *sender_latch; + LWLock lock; +} GucRemoteSetting; + +static GucRemoteSetting *remote_setting; + +volatile bool RemoteGucChangePending = false; + /* * Options for enum values defined in this module. * @@ -3161,7 +3192,7 @@ static struct config_int ConfigureNamesInt[] = }, &pgstat_track_syscache_usage_interval, 0, 0, INT_MAX / 2, - NULL, NULL, NULL + NULL, &pgstat_track_syscache_assign_hook, NULL }, { @@ -4730,7 +4761,6 @@ discard_stack_value(struct config_generic *gconf, config_var_value *val) set_extra_field(gconf, &(val->extra), NULL); } - /* * Fetch the sorted array pointer (exported for help_config.c's use ONLY) */ @@ -5522,6 +5552,22 @@ push_old_value(struct config_generic *gconf, GucAction action) /* Do we already have a stack entry of the current nest level? */ stack = gconf->stack; + + /* NONXACT action make existing stack useles */ + if (action == GUC_ACTION_NONXACT) + { + while (stack) + { + GucStack *prev = stack->prev; + + discard_stack_value(gconf, &stack->prior); + discard_stack_value(gconf, &stack->masked); + pfree(stack); + stack = prev; + } + stack = gconf->stack = NULL; + } + if (stack && stack->nest_level >= GUCNestLevel) { /* Yes, so adjust its state if necessary */ @@ -5529,28 +5575,63 @@ push_old_value(struct config_generic *gconf, GucAction action) switch (action) { case GUC_ACTION_SET: - /* SET overrides any prior action at same nest level */ - if (stack->state == GUC_SET_LOCAL) + if (stack->state == GUC_NONXACT) { - /* must discard old masked value */ - discard_stack_value(gconf, &stack->masked); + /* NONXACT rollbacks to the current value */ + stack->scontext = gconf->scontext; + set_stack_value(gconf, &stack->prior); + stack->state = GUC_NONXACT_SET; } - stack->state = GUC_SET; + else + { + /* SET overrides other prior actions at same nest level */ + if (stack->state == GUC_SET_LOCAL) + { + /* must discard old masked value */ + discard_stack_value(gconf, &stack->masked); + } + stack->state = GUC_SET; + } + break; + case GUC_ACTION_LOCAL: if (stack->state == GUC_SET) { - /* SET followed by SET LOCAL, remember SET's value */ + /* SET followed by SET LOCAL, remember it's value */ stack->masked_scontext = gconf->scontext; set_stack_value(gconf, &stack->masked); stack->state = GUC_SET_LOCAL; } + else if (stack->state == GUC_NONXACT) + { + /* + * NONXACT followed by SET LOCAL, both prior and masked + * are set to the current value + */ + stack->scontext = gconf->scontext; + set_stack_value(gconf, &stack->prior); + stack->masked_scontext = stack->scontext; + stack->masked = stack->prior; + stack->state = GUC_NONXACT_LOCAL; + } + else if (stack->state == GUC_NONXACT_SET) + { + /* NONXACT_SET followed by SET LOCAL, set masked */ + stack->masked_scontext = gconf->scontext; + set_stack_value(gconf, &stack->masked); + stack->state = GUC_NONXACT_LOCAL; + } /* in all other cases, no change to stack entry */ break; case GUC_ACTION_SAVE: /* Could only have a prior SAVE of same variable */ Assert(stack->state == GUC_SAVE); break; + + case GUC_ACTION_NONXACT: + Assert(false); + break; } Assert(guc_dirty); /* must be set already */ return; @@ -5566,6 +5647,7 @@ push_old_value(struct config_generic *gconf, GucAction action) stack->prev = gconf->stack; stack->nest_level = GUCNestLevel; + switch (action) { case GUC_ACTION_SET: @@ -5577,10 +5659,15 @@ push_old_value(struct config_generic *gconf, GucAction action) case GUC_ACTION_SAVE: stack->state = GUC_SAVE; break; + case GUC_ACTION_NONXACT: + stack->state = GUC_NONXACT; + break; } stack->source = gconf->source; stack->scontext = gconf->scontext; - set_stack_value(gconf, &stack->prior); + + if (action != GUC_ACTION_NONXACT) + set_stack_value(gconf, &stack->prior); gconf->stack = stack; @@ -5675,22 +5762,31 @@ AtEOXact_GUC(bool isCommit, int nestLevel) * stack entries to avoid leaking memory. If we do set one of * those flags, unused fields will be cleaned up after restoring. */ - if (!isCommit) /* if abort, always restore prior value */ - restorePrior = true; + if (!isCommit) + { + /* GUC_NONXACT does't rollback */ + if (stack->state != GUC_NONXACT) + restorePrior = true; + } else if (stack->state == GUC_SAVE) restorePrior = true; else if (stack->nest_level == 1) { /* transaction commit */ - if (stack->state == GUC_SET_LOCAL) + if (stack->state == GUC_SET_LOCAL || + stack->state == GUC_NONXACT_LOCAL) restoreMasked = true; - else if (stack->state == GUC_SET) + else if (stack->state == GUC_SET || + stack->state == GUC_NONXACT_SET) { /* we keep the current active value */ discard_stack_value(gconf, &stack->prior); } - else /* must be GUC_LOCAL */ + else if (stack->state != GUC_NONXACT) + { + /* must be GUC_LOCAL */ restorePrior = true; + } } else if (prev == NULL || prev->nest_level < stack->nest_level - 1) @@ -5712,11 +5808,27 @@ AtEOXact_GUC(bool isCommit, int nestLevel) break; case GUC_SET: - /* next level always becomes SET */ - discard_stack_value(gconf, &stack->prior); - if (prev->state == GUC_SET_LOCAL) + if (prev->state == GUC_SET || + prev->state == GUC_NONXACT_SET) + { + discard_stack_value(gconf, &stack->prior); + } + else if (prev->state == GUC_NONXACT) + { + prev->scontext = stack->scontext; + prev->prior = stack->prior; + prev->state = GUC_NONXACT_SET; + } + else if (prev->state == GUC_SET_LOCAL || + prev->state == GUC_NONXACT_LOCAL) + { + discard_stack_value(gconf, &stack->prior); discard_stack_value(gconf, &prev->masked); - prev->state = GUC_SET; + if (prev->state == GUC_SET_LOCAL) + prev->state = GUC_SET; + else + prev->state = GUC_NONXACT_SET; + } break; case GUC_LOCAL: @@ -5727,6 +5839,16 @@ AtEOXact_GUC(bool isCommit, int nestLevel) prev->masked = stack->prior; prev->state = GUC_SET_LOCAL; } + else if (prev->state == GUC_NONXACT) + { + prev->prior = stack->masked; + prev->scontext = stack->masked_scontext; + prev->masked = stack->masked; + prev->masked_scontext = stack->masked_scontext; + discard_stack_value(gconf, &stack->prior); + discard_stack_value(gconf, &stack->masked); + prev->state = GUC_NONXACT_SET; + } else { /* else just forget this stack level */ @@ -5735,15 +5857,32 @@ AtEOXact_GUC(bool isCommit, int nestLevel) break; case GUC_SET_LOCAL: - /* prior state at this level no longer wanted */ - discard_stack_value(gconf, &stack->prior); - /* copy down the masked state */ - prev->masked_scontext = stack->masked_scontext; - if (prev->state == GUC_SET_LOCAL) - discard_stack_value(gconf, &prev->masked); - prev->masked = stack->masked; - prev->state = GUC_SET_LOCAL; + if (prev->state == GUC_NONXACT) + { + prev->prior = stack->prior; + prev->masked = stack->prior; + discard_stack_value(gconf, &stack->prior); + discard_stack_value(gconf, &stack->masked); + prev->state = GUC_NONXACT_SET; + } + else if (prev->state != GUC_NONXACT_SET) + { + /* prior state at this level no longer wanted */ + discard_stack_value(gconf, &stack->prior); + /* copy down the masked state */ + prev->masked_scontext = stack->masked_scontext; + if (prev->state == GUC_SET_LOCAL) + discard_stack_value(gconf, &prev->masked); + prev->masked = stack->masked; + prev->state = GUC_SET_LOCAL; + } break; + case GUC_NONXACT: + case GUC_NONXACT_SET: + case GUC_NONXACT_LOCAL: + Assert(false); + break; + } } @@ -8024,7 +8163,8 @@ set_config_by_name(PG_FUNCTION_ARGS) char *name; char *value; char *new_value; - bool is_local; + int set_action = GUC_ACTION_SET; + if (PG_ARGISNULL(0)) ereport(ERROR, @@ -8044,18 +8184,27 @@ set_config_by_name(PG_FUNCTION_ARGS) * Get the desired state of is_local. Default to false if provided value * is NULL */ - if (PG_ARGISNULL(2)) - is_local = false; - else - is_local = PG_GETARG_BOOL(2); + if (!PG_ARGISNULL(2) && PG_GETARG_BOOL(2)) + set_action = GUC_ACTION_LOCAL; + + /* + * Get the desired state of is_nonxact. Default to false if provided value + * is NULL + */ + if (!PG_ARGISNULL(3) && PG_GETARG_BOOL(3)) + { + if (set_action == GUC_ACTION_LOCAL) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("Only one of is_local and is_nonxact can be true"))); + set_action = GUC_ACTION_NONXACT; + } /* Note SET DEFAULT (argstring == NULL) is equivalent to RESET */ (void) set_config_option(name, value, (superuser() ? PGC_SUSET : PGC_USERSET), - PGC_S_SESSION, - is_local ? GUC_ACTION_LOCAL : GUC_ACTION_SET, - true, 0, false); + PGC_S_SESSION, set_action, true, 0, false); /* get the new current value */ new_value = GetConfigOptionByName(name, NULL, false); @@ -8064,7 +8213,6 @@ set_config_by_name(PG_FUNCTION_ARGS) PG_RETURN_TEXT_P(cstring_to_text(new_value)); } - /* * Common code for DefineCustomXXXVariable subroutines: allocate the * new variable's config struct and fill in generic fields. @@ -8263,6 +8411,13 @@ reapply_stacked_values(struct config_generic *variable, WARNING, false); break; + case GUC_NONXACT: + (void) set_config_option(name, curvalue, + curscontext, cursource, + GUC_ACTION_NONXACT, true, + WARNING, false); + break; + case GUC_LOCAL: (void) set_config_option(name, curvalue, curscontext, cursource, @@ -8282,6 +8437,33 @@ reapply_stacked_values(struct config_generic *variable, GUC_ACTION_LOCAL, true, WARNING, false); break; + + case GUC_NONXACT_SET: + /* first, apply the masked value as SET */ + (void) set_config_option(name, stack->masked.val.stringval, + stack->masked_scontext, PGC_S_SESSION, + GUC_ACTION_NONXACT, true, + WARNING, false); + /* then apply the current value as LOCAL */ + (void) set_config_option(name, curvalue, + curscontext, cursource, + GUC_ACTION_SET, true, + WARNING, false); + break; + + case GUC_NONXACT_LOCAL: + /* first, apply the masked value as SET */ + (void) set_config_option(name, stack->masked.val.stringval, + stack->masked_scontext, PGC_S_SESSION, + GUC_ACTION_NONXACT, true, + WARNING, false); + /* then apply the current value as LOCAL */ + (void) set_config_option(name, curvalue, + curscontext, cursource, + GUC_ACTION_LOCAL, true, + WARNING, false); + break; + } /* If we successfully made a stack entry, adjust its nest level */ @@ -10260,6 +10442,373 @@ GUCArrayReset(ArrayType *array) return newarray; } +Size +GucShmemSize(void) +{ + Size size; + + size = sizeof(GucRemoteSetting); + + return size; +} + +void +GucShmemInit(void) +{ + Size size; + bool found; + + size = sizeof(GucRemoteSetting); + remote_setting = (GucRemoteSetting *) + ShmemInitStruct("GUC remote setting", size, &found); + + if (!found) + { + MemSet(remote_setting, 0, size); + LWLockInitialize(&remote_setting->lock, LWLockNewTrancheId()); + } + + LWLockRegisterTranche(remote_setting->lock.tranche, "guc_remote"); +} + +/* + * set_backend_config: SQL callable function to set GUC variable of remote + * session. + */ +Datum +set_backend_config(PG_FUNCTION_ARGS) +{ + int pid = PG_GETARG_INT32(0); + char *name = text_to_cstring(PG_GETARG_TEXT_P(1)); + char *value = text_to_cstring(PG_GETARG_TEXT_P(2)); + TimestampTz cancel_start; + PgBackendStatus *beentry; + int beid; + int rc; + + if (strlen(name) >= NAMEDATALEN) + ereport(ERROR, + (errcode(ERRCODE_NAME_TOO_LONG), + errmsg("name of GUC variable is too long"))); + if (strlen(value) >= GUC_REMOTE_MAX_VALUE_LEN) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("value is too long"), + errdetail("Maximum acceptable length of value is %d", + GUC_REMOTE_MAX_VALUE_LEN - 1))); + + /* find beentry for given pid */ + beentry = NULL; + for (beid = 1; + (beentry = pgstat_fetch_stat_beentry(beid)) && + beentry->st_procpid != pid ; + beid++); + + /* + * This will be checked out by SendProcSignal but do here to emit + * appropriate message message. + */ + if (!beentry) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("process PID %d not found", pid))); + + /* allow only client backends */ + if (beentry->st_backendType != B_BACKEND) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("not a client backend"))); + + /* + * Wait if someone is sending a request. We need to wait with timeout + * since the current user of the struct doesn't wake me up. + */ + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + while (remote_setting->state != REMGUC_VACANT) + { + LWLockRelease(&remote_setting->lock); + rc = WaitLatch(&MyProc->procLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, + 200, PG_WAIT_ACTIVITY); + + if (rc & WL_POSTMASTER_DEATH) + return (Datum) BoolGetDatum(false); + + CHECK_FOR_INTERRUPTS(); + + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + } + + /* my turn, send a request */ + Assert(remote_setting->state == REMGUC_VACANT); + + remote_setting->state = REMGUC_REQUEST; + remote_setting->sourcepid = MyProcPid; + remote_setting->targetpid = pid; + remote_setting->userid = GetUserId(); + + strncpy(remote_setting->name, name, NAMEDATALEN); + remote_setting->name[NAMEDATALEN - 1] = 0; + strncpy(remote_setting->value, value, GUC_REMOTE_MAX_VALUE_LEN); + remote_setting->value[GUC_REMOTE_MAX_VALUE_LEN - 1] = 0; + remote_setting->sender_latch = MyLatch; + + LWLockRelease(&remote_setting->lock); + + if (SendProcSignal(pid, PROCSIG_REMOTE_GUC, InvalidBackendId) < 0) + { + remote_setting->state = REMGUC_VACANT; + ereport(ERROR, + (errmsg("could not signal backend with PID %d: %m", pid))); + } + + /* + * This request is processed only while idle time of peer so it may take a + * long time before we get a response. + */ + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + while (remote_setting->state != REMGUC_DONE) + { + LWLockRelease(&remote_setting->lock); + rc = WaitLatch(&MyProc->procLatch, + WL_LATCH_SET | WL_POSTMASTER_DEATH, + -1, PG_WAIT_ACTIVITY); + + /* don't care of the state in the case.. */ + if (rc & WL_POSTMASTER_DEATH) + return (Datum) BoolGetDatum(false); + + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + + /* get out if we got a query cancel request */ + if (QueryCancelPending) + break; + } + + /* + * Cancel the requset if possible. We cannot cancel the request in the + * case peer have processed it. We don't see QueryCancelPending but the + * request status so that the case is handled properly. + */ + if (remote_setting->state == REMGUC_REQUEST) + { + Assert(QueryCancelPending); + + remote_setting->state = REMGUC_CANCELING; + LWLockRelease(&remote_setting->lock); + + if (SendProcSignal(pid, + PROCSIG_REMOTE_GUC, InvalidBackendId) < 0) + { + remote_setting->state = REMGUC_VACANT; + ereport(ERROR, + (errmsg("could not signal backend with PID %d: %m", + pid))); + } + + /* Peer must respond shortly, don't sleep for a long time. */ + + cancel_start = GetCurrentTimestamp(); + + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + while (remote_setting->state != REMGUC_CANCELED && + !TimestampDifferenceExceeds(cancel_start, GetCurrentTimestamp(), + GUC_REMOTE_CANCEL_TIMEOUT)) + { + LWLockRelease(&remote_setting->lock); + rc = WaitLatch(&MyProc->procLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, + GUC_REMOTE_CANCEL_TIMEOUT, PG_WAIT_ACTIVITY); + + /* don't care of the state in the case.. */ + if (rc & WL_POSTMASTER_DEATH) + return (Datum) BoolGetDatum(false); + + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + } + + if (remote_setting->state != REMGUC_CANCELED) + { + remote_setting->state = REMGUC_VACANT; + ereport(ERROR, (errmsg("failed cancelling remote GUC request"))); + } + + remote_setting->state = REMGUC_VACANT; + LWLockRelease(&remote_setting->lock); + + ereport(INFO, + (errmsg("remote GUC change request to PID %d is canceled", + pid))); + + return (Datum) BoolGetDatum(false); + } + + Assert (remote_setting->state == REMGUC_DONE); + + /* ereport exits on query cancel, we need this before that */ + remote_setting->state = REMGUC_VACANT; + + if (QueryCancelPending) + ereport(INFO, + (errmsg("remote GUC change request to PID %d already completed", + pid))); + + if (!remote_setting->success) + ereport(ERROR, + (errmsg("%s", remote_setting->value))); + + LWLockRelease(&remote_setting->lock); + + return (Datum) BoolGetDatum(true); +} + + +void +HandleRemoteGucSetInterrupt(void) +{ + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + + /* check if any request is being sent to me */ + if (remote_setting->targetpid == MyProcPid) + { + switch (remote_setting->state) + { + case REMGUC_REQUEST: + InterruptPending = true; + RemoteGucChangePending = true; + break; + case REMGUC_CANCELING: + InterruptPending = true; + RemoteGucChangePending = true; + remote_setting->state = REMGUC_CANCELED; + SetLatch(remote_setting->sender_latch); + break; + default: + break; + } + } + LWLockRelease(&remote_setting->lock); +} + +void +HandleGucRemoteChanges(void) +{ + MemoryContext currentcxt = CurrentMemoryContext; + bool canceling = false; + bool process_request = true; + int saveInterruptHoldoffCount = 0; + int saveQueryCancelHoldoffCount = 0; + + RemoteGucChangePending = false; + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + + /* skip if this request is no longer for me */ + if (remote_setting->targetpid != MyProcPid) + process_request = false; + else + { + switch (remote_setting->state) + { + case REMGUC_REQUEST: + remote_setting->state = REMGUC_INPROCESS; + break; + case REMGUC_CANCELING: + /* + * This request is already canceled but entered this function + * before receiving signal. Cancel the request here. + */ + remote_setting->state = REMGUC_CANCELED; + remote_setting->success = false; + canceling = true; + break; + case REMGUC_VACANT: + case REMGUC_CANCELED: + case REMGUC_INPROCESS: + case REMGUC_DONE: + /* Just ignore the cases */ + process_request = false; + break; + } + } + + LWLockRelease(&remote_setting->lock); + + if (!process_request) + return; + + if (canceling) + { + SetLatch(remote_setting->sender_latch); + return; + } + + + /* Okay, actually modify variable */ + remote_setting->success = true; + + PG_TRY(); + { + bool has_privilege; + bool is_superuser; + bool end_transaction = false; + /* + * XXXX: ERROR resets the following varialbes but we don't want that. + */ + saveInterruptHoldoffCount = InterruptHoldoffCount; + saveQueryCancelHoldoffCount = QueryCancelHoldoffCount; + + /* superuser_arg requires a transaction */ + if (!IsTransactionState()) + { + StartTransactionCommand(); + end_transaction = true; + } + is_superuser = superuser_arg(remote_setting->userid); + has_privilege = is_superuser || + has_privs_of_role(remote_setting->userid, GetUserId()); + + if (end_transaction) + CommitTransactionCommand(); + + if (!has_privilege) + elog(ERROR, "role %u is not allowed to set GUC variables on the session with PID %d", + remote_setting->userid, MyProcPid); + + (void) set_config_option(remote_setting->name, remote_setting->value, + is_superuser ? PGC_SUSET : PGC_USERSET, + PGC_S_SESSION, GUC_ACTION_NONXACT, + true, ERROR, false); + } + PG_CATCH(); + { + ErrorData *errdata; + MemoryContextSwitchTo(currentcxt); + errdata = CopyErrorData(); + remote_setting->success = false; + strncpy(remote_setting->value, errdata->message, + GUC_REMOTE_MAX_VALUE_LEN); + remote_setting->value[GUC_REMOTE_MAX_VALUE_LEN - 1] = 0; + FlushErrorState(); + + /* restore the saved value */ + InterruptHoldoffCount = saveInterruptHoldoffCount ; + QueryCancelHoldoffCount = saveQueryCancelHoldoffCount; + + } + PG_END_TRY(); + + ereport(LOG, + (errmsg("GUC variable \"%s\" is changed to \"%s\" by request from another backend with PID %d", + remote_setting->name, remote_setting->value, + remote_setting->sourcepid))); + + LWLockAcquire(&remote_setting->lock, LW_EXCLUSIVE); + remote_setting->state = REMGUC_DONE; + LWLockRelease(&remote_setting->lock); + + SetLatch(remote_setting->sender_latch); +} + /* * Validate a proposed option setting for GUCArrayAdd/Delete/Reset. * diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 11fc1f3075..54d0c3917e 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -5700,8 +5700,8 @@ proargtypes => 'text bool', prosrc => 'show_config_by_name_missing_ok' }, { oid => '2078', descr => 'SET X as a function', proname => 'set_config', proisstrict => 'f', provolatile => 'v', - proparallel => 'u', prorettype => 'text', proargtypes => 'text text bool', - prosrc => 'set_config_by_name' }, + proparallel => 'u', prorettype => 'text', + proargtypes => 'text text bool bool', prosrc => 'set_config_by_name' }, { oid => '2084', descr => 'SHOW ALL as a function', proname => 'pg_show_all_settings', prorows => '1000', proretset => 't', provolatile => 's', prorettype => 'record', proargtypes => '', @@ -9678,6 +9678,12 @@ proargmodes => '{i,o,o,o,o,o,o,o,o,o}', proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,last_update}', prosrc => 'pgstat_get_syscache_stats' }, +{ oid => '3424', + descr => 'set config of another backend', + proname => 'pg_set_backend_config', proisstrict => 'f', + proretset => 'f', provolatile => 'v', proparallel => 'u', + prorettype => 'bool', proargtypes => 'int4 text text', + prosrc => 'set_backend_config' }, { oid => '3786', descr => 'set up a logical replication slot', proname => 'pg_create_logical_replication_slot', provolatile => 'v', proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool', diff --git a/src/include/pgstat.h b/src/include/pgstat.h index ee9968f81a..70b926a8d1 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -833,7 +833,8 @@ typedef enum WAIT_EVENT_REPLICATION_ORIGIN_DROP, WAIT_EVENT_REPLICATION_SLOT_DROP, WAIT_EVENT_SAFE_SNAPSHOT, - WAIT_EVENT_SYNC_REP + WAIT_EVENT_SYNC_REP, + WAIT_EVENT_REMOTE_GUC } WaitEventIPC; /* ---------- diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h index 9f2f965d5c..040877f5eb 100644 --- a/src/include/storage/procsignal.h +++ b/src/include/storage/procsignal.h @@ -42,6 +42,9 @@ typedef enum PROCSIG_RECOVERY_CONFLICT_BUFFERPIN, PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK, + /* Remote GUC setting */ + PROCSIG_REMOTE_GUC, + NUM_PROCSIGNALS /* Must be last! */ } ProcSignalReason; diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h index c07e7b945e..1e12773906 100644 --- a/src/include/utils/guc.h +++ b/src/include/utils/guc.h @@ -193,7 +193,8 @@ typedef enum /* Types of set_config_option actions */ GUC_ACTION_SET, /* regular SET command */ GUC_ACTION_LOCAL, /* SET LOCAL command */ - GUC_ACTION_SAVE /* function SET option, or temp assignment */ + GUC_ACTION_SAVE, /* function SET option, or temp assignment */ + GUC_ACTION_NONXACT /* transactional setting */ } GucAction; #define GUC_QUALIFIER_SEPARATOR '.' @@ -269,6 +270,8 @@ extern int tcp_keepalives_idle; extern int tcp_keepalives_interval; extern int tcp_keepalives_count; +extern volatile bool RemoteGucChangePending; + #ifdef TRACE_SORT extern bool trace_sort; #endif @@ -276,6 +279,11 @@ extern bool trace_sort; /* * Functions exported by guc.c */ +extern Size GucShmemSize(void); +extern void GucShmemInit(void); +extern Datum set_backend_setting(PG_FUNCTION_ARGS); +extern void HandleRemoteGucSetInterrupt(void); +extern void HandleGucRemoteChanges(void); extern void SetConfigOption(const char *name, const char *value, GucContext context, GucSource source); @@ -395,6 +403,9 @@ extern Size EstimateGUCStateSpace(void); extern void SerializeGUCState(Size maxsize, char *start_address); extern void RestoreGUCState(void *gucstate); +/* Remote GUC setting */ +extern void HandleGucRemoteChanges(void); + /* Support for messages reported from GUC check hooks */ extern PGDLLIMPORT char *GUC_check_errmsg_string; diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h index a0970b2e1c..c00520e90c 100644 --- a/src/include/utils/guc_tables.h +++ b/src/include/utils/guc_tables.h @@ -115,7 +115,10 @@ typedef enum GUC_SAVE, /* entry caused by function SET option */ GUC_SET, /* entry caused by plain SET command */ GUC_LOCAL, /* entry caused by SET LOCAL command */ - GUC_SET_LOCAL /* entry caused by SET then SET LOCAL */ + GUC_NONXACT, /* entry caused by non-transactional ops */ + GUC_SET_LOCAL, /* entry caused by SET then SET LOCAL */ + GUC_NONXACT_SET, /* entry caused by NONXACT then SET */ + GUC_NONXACT_LOCAL /* entry caused by NONXACT then (SET)LOCAL */ } GucStackState; typedef struct guc_stack diff --git a/src/test/regress/expected/guc.out b/src/test/regress/expected/guc.out index b0d7351145..2d19697a8c 100644 --- a/src/test/regress/expected/guc.out +++ b/src/test/regress/expected/guc.out @@ -476,6 +476,229 @@ SELECT '2006-08-13 12:34:56'::timestamptz; 2006-08-13 12:34:56-07 (1 row) +-- NONXACT followed by SET, SET LOCAL through COMMIT +BEGIN; +SELECT set_config('work_mem', '128kB', false, true); -- NONXACT + set_config +------------ + 128kB +(1 row) + +SET work_mem to '256kB'; +SET LOCAL work_mem to '512kB'; +SHOW work_mem; -- must see 512kB + work_mem +---------- + 512kB +(1 row) + +COMMIT; +SHOW work_mem; -- must see 256kB + work_mem +---------- + 256kB +(1 row) + +-- NONXACT followed by SET, SET LOCAL through ROLLBACK +BEGIN; +SELECT set_config('work_mem', '128kB', false, true); -- NONXACT + set_config +------------ + 128kB +(1 row) + +SET work_mem to '256kB'; +SET LOCAL work_mem to '512kB'; +SHOW work_mem; -- must see 512kB + work_mem +---------- + 512kB +(1 row) + +ROLLBACK; +SHOW work_mem; -- must see 128kB + work_mem +---------- + 128kB +(1 row) + +-- SET, SET LOCAL followed by NONXACT through COMMIT +BEGIN; +SET work_mem to '256kB'; +SET LOCAL work_mem to '512kB'; +SELECT set_config('work_mem', '128kB', false, true); -- NONXACT + set_config +------------ + 128kB +(1 row) + +SHOW work_mem; -- must see 128kB + work_mem +---------- + 128kB +(1 row) + +COMMIT; +SHOW work_mem; -- must see 128kB + work_mem +---------- + 128kB +(1 row) + +-- SET, SET LOCAL followed by NONXACT through ROLLBACK +BEGIN; +SET work_mem to '256kB'; +SET LOCAL work_mem to '512kB'; +SELECT set_config('work_mem', '128kB', false, true); -- NONXACT + set_config +------------ + 128kB +(1 row) + +SHOW work_mem; -- must see 128kB + work_mem +---------- + 128kB +(1 row) + +ROLLBACK; +SHOW work_mem; -- must see 128kB + work_mem +---------- + 128kB +(1 row) + +-- NONXACT and SAVEPOINT +SET work_mem TO '64kB'; +BEGIN; +SET work_mem TO '128kB'; +SAVEPOINT a; +SELECT set_config('work_mem', '256kB', false, true); -- NONXACT + set_config +------------ + 256kB +(1 row) + +SHOW work_mem; + work_mem +---------- + 256kB +(1 row) + +SET LOCAL work_mem TO '384kB'; +RELEASE SAVEPOINT a; +SHOW work_mem; -- will see 384kB + work_mem +---------- + 384kB +(1 row) + +COMMIT; +SHOW work_mem; -- will see 256kB + work_mem +---------- + 256kB +(1 row) + +-- +SET work_mem TO '64kB'; +BEGIN; +SET work_mem TO '128kB'; +SAVEPOINT a; +SELECT set_config('work_mem', '256kB', false, true); -- NONXACT + set_config +------------ + 256kB +(1 row) + +SHOW work_mem; + work_mem +---------- + 256kB +(1 row) + +SET LOCAL work_mem TO '384kB'; +ROLLBACK TO SAVEPOINT a; +SHOW work_mem; -- will see 256kB + work_mem +---------- + 256kB +(1 row) + +ROLLBACK; +SHOW work_mem; -- will see 256kB + work_mem +---------- + 256kB +(1 row) + +-- +SET work_mem TO '64kB'; +BEGIN; +SET work_mem TO '128kB'; +SET LOCAL work_mem TO '384kB'; +SAVEPOINT a; +SELECT set_config('work_mem', '256kB', false, true); -- NONXACT + set_config +------------ + 256kB +(1 row) + +SHOW work_mem; + work_mem +---------- + 256kB +(1 row) + +SET LOCAL work_mem TO '384kB'; +RELEASE SAVEPOINT a; +SHOW work_mem; -- will see 384kB + work_mem +---------- + 384kB +(1 row) + +ROLLBACK; +SHOW work_mem; -- will see 256kB + work_mem +---------- + 256kB +(1 row) + +-- +SET work_mem TO '64kB'; +BEGIN; +SET work_mem TO '128kB'; +SET LOCAL work_mem TO '384kB'; +SAVEPOINT a; +SELECT set_config('work_mem', '256kB', false, true); -- NONXACT + set_config +------------ + 256kB +(1 row) + +SHOW work_mem; + work_mem +---------- + 256kB +(1 row) + +SET LOCAL work_mem TO '384kB'; +ROLLBACK TO SAVEPOINT a; +SHOW work_mem; -- will see 256kB + work_mem +---------- + 256kB +(1 row) + +COMMIT; +SHOW work_mem; -- will see 256kB + work_mem +---------- + 256kB +(1 row) + +SET work_mem TO DEFAULT; -- -- Test RESET. We use datestyle because the reset value is forced by -- pg_regress, so it doesn't depend on the installation's configuration. diff --git a/src/test/regress/sql/guc.sql b/src/test/regress/sql/guc.sql index 3b854ac496..bbb91aaa98 100644 --- a/src/test/regress/sql/guc.sql +++ b/src/test/regress/sql/guc.sql @@ -133,6 +133,94 @@ SHOW vacuum_cost_delay; SHOW datestyle; SELECT '2006-08-13 12:34:56'::timestamptz; +-- NONXACT followed by SET, SET LOCAL through COMMIT +BEGIN; +SELECT set_config('work_mem', '128kB', false, true); -- NONXACT +SET work_mem to '256kB'; +SET LOCAL work_mem to '512kB'; +SHOW work_mem; -- must see 512kB +COMMIT; +SHOW work_mem; -- must see 256kB + +-- NONXACT followed by SET, SET LOCAL through ROLLBACK +BEGIN; +SELECT set_config('work_mem', '128kB', false, true); -- NONXACT +SET work_mem to '256kB'; +SET LOCAL work_mem to '512kB'; +SHOW work_mem; -- must see 512kB +ROLLBACK; +SHOW work_mem; -- must see 128kB + +-- SET, SET LOCAL followed by NONXACT through COMMIT +BEGIN; +SET work_mem to '256kB'; +SET LOCAL work_mem to '512kB'; +SELECT set_config('work_mem', '128kB', false, true); -- NONXACT +SHOW work_mem; -- must see 128kB +COMMIT; +SHOW work_mem; -- must see 128kB + +-- SET, SET LOCAL followed by NONXACT through ROLLBACK +BEGIN; +SET work_mem to '256kB'; +SET LOCAL work_mem to '512kB'; +SELECT set_config('work_mem', '128kB', false, true); -- NONXACT +SHOW work_mem; -- must see 128kB +ROLLBACK; +SHOW work_mem; -- must see 128kB + +-- NONXACT and SAVEPOINT +SET work_mem TO '64kB'; +BEGIN; +SET work_mem TO '128kB'; +SAVEPOINT a; +SELECT set_config('work_mem', '256kB', false, true); -- NONXACT +SHOW work_mem; +SET LOCAL work_mem TO '384kB'; +RELEASE SAVEPOINT a; +SHOW work_mem; -- will see 384kB +COMMIT; +SHOW work_mem; -- will see 256kB +-- +SET work_mem TO '64kB'; +BEGIN; +SET work_mem TO '128kB'; +SAVEPOINT a; +SELECT set_config('work_mem', '256kB', false, true); -- NONXACT +SHOW work_mem; +SET LOCAL work_mem TO '384kB'; +ROLLBACK TO SAVEPOINT a; +SHOW work_mem; -- will see 256kB +ROLLBACK; +SHOW work_mem; -- will see 256kB +-- +SET work_mem TO '64kB'; +BEGIN; +SET work_mem TO '128kB'; +SET LOCAL work_mem TO '384kB'; +SAVEPOINT a; +SELECT set_config('work_mem', '256kB', false, true); -- NONXACT +SHOW work_mem; +SET LOCAL work_mem TO '384kB'; +RELEASE SAVEPOINT a; +SHOW work_mem; -- will see 384kB +ROLLBACK; +SHOW work_mem; -- will see 256kB +-- +SET work_mem TO '64kB'; +BEGIN; +SET work_mem TO '128kB'; +SET LOCAL work_mem TO '384kB'; +SAVEPOINT a; +SELECT set_config('work_mem', '256kB', false, true); -- NONXACT +SHOW work_mem; +SET LOCAL work_mem TO '384kB'; +ROLLBACK TO SAVEPOINT a; +SHOW work_mem; -- will see 256kB +COMMIT; +SHOW work_mem; -- will see 256kB + +SET work_mem TO DEFAULT; -- -- Test RESET. We use datestyle because the reset value is forced by -- pg_regress, so it doesn't depend on the installation's configuration. -- 2.16.3
On Thu, Jan 17, 2019 at 2:48 PM Bruce Momjian <bruce@momjian.us> wrote: > Well, I think everyone agrees there are workloads that cause undesired > cache bloat. What we have not found is a solution that doesn't cause > code complexity or undesired overhead, or one that >1% of users will > know how to use. > > Unfortunately, because we have not found something we are happy with, we > have done nothing. I agree LRU can be expensive. What if we do some > kind of clock sweep and expiration like we do for shared buffers? I > think the trick is figuring how frequently to do the sweep. What if we > mark entries as unused every 10 queries, mark them as used on first use, > and delete cache entries that have not be used in the past 10 queries. I still think wall-clock time is a perfectly reasonable heuristic. Say every 5 or 10 minutes you walk through the cache. Anything that hasn't been touched since the last scan you throw away. If you do this, you MIGHT flush an entry that you're just about to need again, but (1) it's not very likely, because if it hasn't been touched in many minutes, the chances that it's about to be needed again are low, and (2) even if it does happen, it probably won't cost all that much, because *occasionally* reloading a cache entry unnecessarily isn't that costly; the big problem is when you do it over and over again, which can easily happen with a fixed size limit on the cache, and (3) if somebody does have a workload where they touch the same object every 11 minutes, we can give them a GUC to control the timeout between cache sweeps and it's really not that hard to understand how to set it. And most people won't need to. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Jan 17, 2019 at 2:48 PM Bruce Momjian <bruce@momjian.us> wrote: >> Unfortunately, because we have not found something we are happy with, we >> have done nothing. I agree LRU can be expensive. What if we do some >> kind of clock sweep and expiration like we do for shared buffers? I >> think the trick is figuring how frequently to do the sweep. What if we >> mark entries as unused every 10 queries, mark them as used on first use, >> and delete cache entries that have not be used in the past 10 queries. > I still think wall-clock time is a perfectly reasonable heuristic. The easy implementations of that involve putting gettimeofday() calls into hot code paths, which would be a Bad Thing. But maybe we could do this only at transaction or statement start, and piggyback on the gettimeofday() calls that already happen at those times. regards, tom lane
On 2019-01-18 15:57:17 -0500, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Thu, Jan 17, 2019 at 2:48 PM Bruce Momjian <bruce@momjian.us> wrote: > >> Unfortunately, because we have not found something we are happy with, we > >> have done nothing. I agree LRU can be expensive. What if we do some > >> kind of clock sweep and expiration like we do for shared buffers? I > >> think the trick is figuring how frequently to do the sweep. What if we > >> mark entries as unused every 10 queries, mark them as used on first use, > >> and delete cache entries that have not be used in the past 10 queries. > > > I still think wall-clock time is a perfectly reasonable heuristic. > > The easy implementations of that involve putting gettimeofday() calls > into hot code paths, which would be a Bad Thing. But maybe we could > do this only at transaction or statement start, and piggyback on the > gettimeofday() calls that already happen at those times. My proposal for this was to attach a 'generation' to cache entries. Upon access cache entries are marked to be of the current generation. Whenever existing memory isn't sufficient for further cache entries and, on a less frequent schedule, triggered by a timer, the cache generation is increased and th new generation's "creation time" is measured. Then generations that are older than a certain threshold are purged, and if there are any, the entries of the purged generation are removed from the caches using a sequential scan through the cache. This outline achieves: - no additional time measurements in hot code paths - no need for a sequential scan of the entire cache when no generations are too old - both size and time limits can be implemented reasonably cheaply - overhead when feature disabled should be close to zero Greetings, Andres Freund
On Fri, Jan 18, 2019 at 4:23 PM andres@anarazel.de <andres@anarazel.de> wrote: > My proposal for this was to attach a 'generation' to cache entries. Upon > access cache entries are marked to be of the current > generation. Whenever existing memory isn't sufficient for further cache > entries and, on a less frequent schedule, triggered by a timer, the > cache generation is increased and th new generation's "creation time" is > measured. Then generations that are older than a certain threshold are > purged, and if there are any, the entries of the purged generation are > removed from the caches using a sequential scan through the cache. > > This outline achieves: > - no additional time measurements in hot code paths > - no need for a sequential scan of the entire cache when no generations > are too old > - both size and time limits can be implemented reasonably cheaply > - overhead when feature disabled should be close to zero Seems generally reasonable. The "whenever existing memory isn't sufficient for further cache entries" part I'm not sure about. Couldn't that trigger very frequently and prevent necessary cache size growth? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-01-18 19:57:03 -0500, Robert Haas wrote: > On Fri, Jan 18, 2019 at 4:23 PM andres@anarazel.de <andres@anarazel.de> wrote: > > My proposal for this was to attach a 'generation' to cache entries. Upon > > access cache entries are marked to be of the current > > generation. Whenever existing memory isn't sufficient for further cache > > entries and, on a less frequent schedule, triggered by a timer, the > > cache generation is increased and th new generation's "creation time" is > > measured. Then generations that are older than a certain threshold are > > purged, and if there are any, the entries of the purged generation are > > removed from the caches using a sequential scan through the cache. > > > > This outline achieves: > > - no additional time measurements in hot code paths > > - no need for a sequential scan of the entire cache when no generations > > are too old > > - both size and time limits can be implemented reasonably cheaply > > - overhead when feature disabled should be close to zero > > Seems generally reasonable. The "whenever existing memory isn't > sufficient for further cache entries" part I'm not sure about. > Couldn't that trigger very frequently and prevent necessary cache size > growth? I'm thinking it'd just trigger a new generation, with it's associated "creation" time (which is cheap to acquire in comparison to creating a number of cache entries) . Depending on settings or just code policy we can decide up to which generation to prune the cache, using that creation time. I'd imagine that we'd have some default cache-pruning time in the minutes, and for workloads where relevant one can make sizing configurations more aggressive - or something like that. Greetings, Andres Freund
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] > 0003: Remote GUC setting > > It is independent from the above two, and heavily arguable. > > pg_set_backend_config(pid, name, value) changes the GUC <name> on the > backend with <pid> to <value>. > Not having looked at the code yet, why did you think this is necessary? Can't we always collect the cache stats? Is itheavy due to some locking in the shared memory, or sending the stats to the stats collector? Regards Takayuki Tsunakawa
Hello. At Fri, 18 Jan 2019 17:09:41 -0800, "andres@anarazel.de" <andres@anarazel.de> wrote in <20190119010941.6ruftewah7t3k3yk@alap3.anarazel.de> > Hi, > > On 2019-01-18 19:57:03 -0500, Robert Haas wrote: > > On Fri, Jan 18, 2019 at 4:23 PM andres@anarazel.de <andres@anarazel.de> wrote: > > > My proposal for this was to attach a 'generation' to cache entries. Upon > > > access cache entries are marked to be of the current > > > generation. Whenever existing memory isn't sufficient for further cache > > > entries and, on a less frequent schedule, triggered by a timer, the > > > cache generation is increased and th new generation's "creation time" is > > > measured. Then generations that are older than a certain threshold are > > > purged, and if there are any, the entries of the purged generation are > > > removed from the caches using a sequential scan through the cache. > > > > > > This outline achieves: > > > - no additional time measurements in hot code paths It is caused at every transaction start time and stored in TimestampTz in this patch. No additional time measurement exists already but cache puruing won't happen if a transaction lives for a long time. Time-driven generation value, maybe with 10s-1min fixed interval, is a possible option. > > > - no need for a sequential scan of the entire cache when no generations > > > are too old This patch didn't precheck against the oldest generation, but it can be easily calculated. (But doesn't base on the creation time but on the last-access time.) (Attached applies over the v7-0001-Remove-entries-..patch) Using generation time, entries are purged even if it is recently accessed. I think last-accessed time is more sutable for the purpse. On the other hand using last-accessed time, the oldest generation can be stale by later access. > > > - both size and time limits can be implemented reasonably cheaply > > > - overhead when feature disabled should be close to zero Overhead when disabled is already nothing since scanning is inhibited when cache_prune_min_age is a negative value. > > Seems generally reasonable. The "whenever existing memory isn't > > sufficient for further cache entries" part I'm not sure about. > > Couldn't that trigger very frequently and prevent necessary cache size > > growth? > > I'm thinking it'd just trigger a new generation, with it's associated > "creation" time (which is cheap to acquire in comparison to creating a > number of cache entries) . Depending on settings or just code policy we > can decide up to which generation to prune the cache, using that > creation time. I'd imagine that we'd have some default cache-pruning > time in the minutes, and for workloads where relevant one can make > sizing configurations more aggressive - or something like that. The current patch uses last-accesed time by non-gettimeofday() method. The genreation is fixed up to 3 and infrequently-accessed entries are removed sooner. Generation interval is determined by cache_prune_min_age. Although this doesn't put a hard cap on memory usage, it is indirectly and softly limited by the cache_prune_min_age and cache_memory_target, which determins how large a cache can grow until pruning happens. They are per-cache basis. If we prefer to set a budget on all the syschaches (or even including other caches), it would be more complex. regares. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 4a3b3094a0..8274704af7 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -859,6 +859,7 @@ InitCatCache(int id, for (i = 0; i < nkeys; ++i) cp->cc_keyno[i] = key[i]; cp->cc_tupsize = 0; + cp->cc_noprune_until = 0; /* * new cache is initialized as far as we can go for now. print some @@ -898,6 +899,7 @@ CatCacheCleanupOldEntries(CatCache *cp) int i; int nremoved = 0; size_t hash_size; + TimestampTz oldest_lastaccess = 0; #ifdef CATCACHE_STATS /* These variables are only for debugging purpose */ int ntotal = 0; @@ -918,6 +920,10 @@ CatCacheCleanupOldEntries(CatCache *cp) if (cache_prune_min_age < 0) return false; + /* Return immediately if apparently no entry to remove */ + if (cp->cc_noprune_until == 0 || catcacheclock <= cp->cc_noprune_until) + return false; + /* * Return without pruning if the size of the hash is below the target. */ @@ -939,6 +945,7 @@ CatCacheCleanupOldEntries(CatCache *cp) CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); long entry_age; int us; + bool removed = false; /* @@ -982,12 +989,24 @@ CatCacheCleanupOldEntries(CatCache *cp) { CatCacheRemoveCTup(cp, ct); nremoved++; + removed = true; } } } + + /* Take the oldest lastaccess among survived entries */ + if (!removed && + (oldest_lastaccess == 0 || ct->lastaccess < oldest_lastaccess)) + oldest_lastaccess = ct->lastaccess; } } + /* Calculate the next pruning time if any entry remains */ + if (oldest_lastaccess > 0) + oldest_lastaccess += cache_prune_min_age * USECS_PER_SEC; + + cp->cc_noprune_until = oldest_lastaccess; + #ifdef CATCACHE_STATS StaticAssertStmt(SYSCACHE_STATS_NAGECLASSES == 6, "number of syscache age class must be 6"); @@ -1423,6 +1442,11 @@ SearchCatCacheInternal(CatCache *cache, ct->naccess++; ct->lastaccess = catcacheclock; + /* the first entry determines the next pruning time */ + if (cache_prune_min_age >= 0 && cache->cc_noprune_until == 0) + cache->cc_noprune_until = + ct->lastaccess + cache_prune_min_age * USECS_PER_SEC; + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 4d51975920..1750919399 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -63,7 +63,8 @@ typedef struct catcache ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ int cc_tupsize; /* total amount of catcache tuples */ - + TimestampTz cc_noprune_until; /* Skip pruning until this time has passed + * zero means no entry lives in this cache */ /* * Statistics entries */
Thank you for pointing out the stupidity. (Tom did earlier, though.) At Mon, 21 Jan 2019 07:12:41 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1FB6C78A@G01JPEXMBYT05> > From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] > > 0003: Remote GUC setting > > > > It is independent from the above two, and heavily arguable. > > > > pg_set_backend_config(pid, name, value) changes the GUC <name> on the > > backend with <pid> to <value>. > > > > Not having looked at the code yet, why did you think this is necessary? Can't we always collect the cache stats? Is itheavy due to some locking in the shared memory, or sending the stats to the stats collector? Yeah, I had a fun making it but I don't think it can be said very good. I must admit that it is a kind of too-much or something stupid. Anyway it needs to scan the whole hash to collect numbers and I don't see how to elimite the complexity without a penalty on regular code paths for now. I don't want do that always for the reason. An option is an additional PGPROC member and interface functions. struct PGPROC { ... int syscahe_usage_track_interval; /* track interval, 0 to disable */ =# select syscahce_usage_track_add(<pid>, <intvl>[, <repetition>]); =# select syscahce_usage_track_remove(2134); Or, just provide an one-shot triggering function. =# select syscahce_take_usage_track(<pid>); This can use both a similar PGPROC variable or SendProcSignal() but the former doesn't fire while idle time unless using timer. Any thoughts? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Jan 18, 2019 at 05:09:41PM -0800, Andres Freund wrote: > Hi, > > On 2019-01-18 19:57:03 -0500, Robert Haas wrote: > > On Fri, Jan 18, 2019 at 4:23 PM andres@anarazel.de <andres@anarazel.de> wrote: > > > My proposal for this was to attach a 'generation' to cache entries. Upon > > > access cache entries are marked to be of the current > > > generation. Whenever existing memory isn't sufficient for further cache > > > entries and, on a less frequent schedule, triggered by a timer, the > > > cache generation is increased and th new generation's "creation time" is > > > measured. Then generations that are older than a certain threshold are > > > purged, and if there are any, the entries of the purged generation are > > > removed from the caches using a sequential scan through the cache. > > > > > > This outline achieves: > > > - no additional time measurements in hot code paths > > > - no need for a sequential scan of the entire cache when no generations > > > are too old > > > - both size and time limits can be implemented reasonably cheaply > > > - overhead when feature disabled should be close to zero > > > > Seems generally reasonable. The "whenever existing memory isn't > > sufficient for further cache entries" part I'm not sure about. > > Couldn't that trigger very frequently and prevent necessary cache size > > growth? > > I'm thinking it'd just trigger a new generation, with it's associated > "creation" time (which is cheap to acquire in comparison to creating a > number of cache entries) . Depending on settings or just code policy we > can decide up to which generation to prune the cache, using that > creation time. I'd imagine that we'd have some default cache-pruning > time in the minutes, and for workloads where relevant one can make > sizing configurations more aggressive - or something like that. OK, so it seems everyone likes the idea of a timer. The open questions are whether we want multiple epochs, and whether we want some kind of size trigger. With only one time epoch, if the timer is 10 minutes, you could expire an entry after 10-19 minutes, while with a new epoch every minute and 10-minute expire, you can do 10-11 minute precision. I am not sure the complexity is worth it. For a size trigger, should removal be effected by how many expired cache entries there are? If there were 10k expired entries or 50, wouldn't you want them removed if they have not been accessed in X minutes? In the worst case, if 10k entries were accessed in a query and never accessed again, what would the ideal cleanup behavior be? Would it matter if it was expired in 10 or 19 minutes? Would it matter if there were only 50 entries? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] > Although this doesn't put a hard cap on memory usage, it is indirectly and > softly limited by the cache_prune_min_age and cache_memory_target, which > determins how large a cache can grow until pruning happens. They are > per-cache basis. > > If we prefer to set a budget on all the syschaches (or even including other > caches), it would be more complex. > This is a pure question. How can we answer these questions from users? * What value can I set to cache_memory_target when I can use 10 GB for the caches and max_connections = 100? * How much RAM do I need to have for the caches when I set cache_memory_target = 1M? The user tends to estimate memory to avoid OOM. Regards Takayuki Tsunakawa
At Mon, 21 Jan 2019 17:22:55 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190121.172255.226467552.horiguchi.kyotaro@lab.ntt.co.jp> > An option is an additional PGPROC member and interface functions. > > struct PGPROC > { > ... > int syscahe_usage_track_interval; /* track interval, 0 to disable */ > > =# select syscahce_usage_track_add(<pid>, <intvl>[, <repetition>]); > =# select syscahce_usage_track_remove(2134); > > > Or, just provide an one-shot triggering function. > > =# select syscahce_take_usage_track(<pid>); > > This can use both a similar PGPROC variable or SendProcSignal() > but the former doesn't fire while idle time unless using timer. The attached is revised version of this patchset, where the third patch is the remote setting feature. It uses static shared memory. =# select pg_backend_catcache_stats(<pid>, <millis>); Activates or changes catcache stats feature on the backend with PID. (The name should be changed to .._syscache_stats, though.) It is far smaller than the remote-GUC feature. (It contains a part that should be in the previous patch. I will fix it later.) regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 067f8ad60f259453271d2bf8323505beb5b9e0a9 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 16 Oct 2018 13:04:30 +0900 Subject: [PATCH 1/3] Remove entries that haven't been used for a certain time Catcache entries can be left alone for several reasons. It is not desirable that they eat up memory. With this patch, This adds consideration of removal of entries that haven't been used for a certain time before enlarging the hash array. --- doc/src/sgml/config.sgml | 38 ++++++ src/backend/access/transam/xact.c | 5 + src/backend/utils/cache/catcache.c | 166 ++++++++++++++++++++++++-- src/backend/utils/misc/guc.c | 23 ++++ src/backend/utils/misc/postgresql.conf.sample | 2 + src/include/utils/catcache.h | 28 ++++- 6 files changed, 254 insertions(+), 8 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index b6f5822b84..af3c52b868 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1662,6 +1662,44 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target"> + <term><varname>syscache_memory_target</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_memory_target</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the maximum amount of memory to which syscache is expanded + without pruning. The value defaults to 0, indicating that pruning is + always considered. After exceeding this size, syscache pruning is + considered according to + <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep + certain amount of syscache entries with intermittent usage, try + increase this setting. + </para> + </listitem> + </varlistentry> + + <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age"> + <term><varname>syscache_prune_min_age</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the minimum amount of unused time in seconds at which a + syscache entry is considered to be removed. -1 indicates that syscache + pruning is disabled at all. The value defaults to 600 seconds + (<literal>10 minutes</literal>). The syscache entries that are not + used for the duration can be removed to prevent syscache bloat. This + behavior is suppressed until the size of syscache exceeds + <xref linkend="guc-syscache-memory-target"/>. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth"> <term><varname>max_stack_depth</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 18467d96d2..dbffec8067 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -733,7 +733,12 @@ void SetCurrentStatementStartTimestamp(void) { if (!IsParallelWorker()) + { stmtStartTimestamp = GetCurrentTimestamp(); + + /* Set this timestamp as aproximated current time */ + SetCatCacheClock(stmtStartTimestamp); + } else Assert(stmtStartTimestamp != 0); } diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 8152f7e21e..ee40093553 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -72,9 +72,24 @@ #define CACHE6_elog(a,b,c,d,e,f,g) #endif +/* + * GUC variable to define the minimum size of hash to cosider entry eviction. + * This variable is shared among various cache mechanisms. + */ +int cache_memory_target = 0; + +/* GUC variable to define the minimum age of entries that will be cosidered to + * be evicted in seconds. This variable is shared among various cache + * mechanisms. + */ +int cache_prune_min_age = 600; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Timestamp used for any operation on caches. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -491,6 +506,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys, cache->cc_keyno, ct->keys); + cache->cc_tupsize -= ct->size; pfree(ct); --cache->cc_ntup; @@ -842,6 +858,7 @@ InitCatCache(int id, cp->cc_nkeys = nkeys; for (i = 0; i < nkeys; ++i) cp->cc_keyno[i] = key[i]; + cp->cc_tupsize = 0; /* * new cache is initialized as far as we can go for now. print some @@ -859,9 +876,129 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initilize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries can be left alone for several reasons. We remove them if + * they are not accessed for a certain time to prevent catcache from + * bloating. The eviction is performed with the similar algorithm with buffer + * eviction using access counter. Entries that are accessed several times can + * live longer than those that have had no access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int i; + int nremoved = 0; + size_t hash_size; +#ifdef CATCACHE_STATS + /* These variables are only for debugging purpose */ + int ntotal = 0; + /* + * nth element in nentries stores the number of cache entries that have + * lived unaccessed for corresponding multiple in ageclass of + * cache_prune_min_age. The index of nremoved_entry is the value of the + * clock-sweep counter, which takes from 0 up to 2. + */ + double ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; + int nentries[] = {0, 0, 0, 0, 0, 0}; + int nremoved_entry[3] = {0, 0, 0}; + int j; +#endif + + /* Return immediately if no pruning is wanted */ + if (cache_prune_min_age < 0) + return false; + + /* + * Return without pruning if the size of the hash is below the target. + */ + hash_size = cp->cc_nbuckets * sizeof(dlist_head); + if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L) + return false; + + /* Search the whole hash for entries to remove */ + for (i = 0; i < cp->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + + /* + * Calculate the duration from the time of the last access to the + * "current" time. Since catcacheclock is not advanced within a + * transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + +#ifdef CATCACHE_STATS + /* count catcache entries for each age class */ + ntotal++; + for (j = 0 ; + ageclass[j] != 0.0 && + entry_age > cache_prune_min_age * ageclass[j] ; + j++); + if (ageclass[j] == 0.0) j--; + nentries[j]++; +#endif + + /* + * Try to remove entries older than cache_prune_min_age seconds. + * Entries that are not accessed after last pruning are removed in + * that seconds, and that has been accessed several times are + * removed after leaving alone for up to three times of the + * duration. We don't try shrink buckets since pruning effectively + * caps catcache expansion in the long term. + */ + if (entry_age > cache_prune_min_age) + { +#ifdef CATCACHE_STATS + Assert (ct->naccess >= 0 && ct->naccess <= 2); + nremoved_entry[ct->naccess]++; +#endif + if (ct->naccess > 0) + ct->naccess--; + else + { + if (!ct->c_list || ct->c_list->refcount == 0) + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + } + } + } + } + } + +#ifdef CATCACHE_STATS + ereport(DEBUG1, + (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d, 2:%d)", + nremoved, ntotal, + ageclass[0] * cache_prune_min_age, nentries[0], + ageclass[1] * cache_prune_min_age, nentries[1], + ageclass[2] * cache_prune_min_age, nentries[2], + ageclass[3] * cache_prune_min_age, nentries[3], + ageclass[4] * cache_prune_min_age, nentries[4], + nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]), + errhidestmt(true))); +#endif + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -1275,6 +1412,11 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* Update access information for pruning */ + if (ct->naccess < 2) + ct->naccess++; + ct->lastaccess = catcacheclock; + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1820,11 +1962,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, CatCTup *ct; HeapTuple dtp; MemoryContext oldcxt; + int tupsize = 0; /* negative entries have no tuple associated */ if (ntp) { int i; + int tupsize; Assert(!negative); @@ -1843,13 +1987,14 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, /* Allocate memory for CatCTup and the cached tuple in one go */ oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - ct = (CatCTup *) palloc(sizeof(CatCTup) + - MAXIMUM_ALIGNOF + dtp->t_len); + tupsize = sizeof(CatCTup) + MAXIMUM_ALIGNOF + dtp->t_len; + ct = (CatCTup *) palloc(tupsize); ct->tuple.t_len = dtp->t_len; ct->tuple.t_self = dtp->t_self; ct->tuple.t_tableOid = dtp->t_tableOid; ct->tuple.t_data = (HeapTupleHeader) MAXALIGN(((char *) ct) + sizeof(CatCTup)); + ct->size = tupsize; /* copy tuple contents */ memcpy((char *) ct->tuple.t_data, (const char *) dtp->t_data, @@ -1877,8 +2022,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, { Assert(negative); oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - ct = (CatCTup *) palloc(sizeof(CatCTup)); - + tupsize = sizeof(CatCTup); + ct = (CatCTup *) palloc(tupsize); /* * Store keys - they'll point into separately allocated memory if not * by-value. @@ -1899,17 +2044,24 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; + ct->size = tupsize; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); cache->cc_ntup++; CacheHdr->ch_ntup++; + cache->cc_tupsize += tupsize; /* - * If the hash table has become too full, enlarge the buckets array. Quite - * arbitrarily, we enlarge when fill factor > 2. + * If the hash table has become too full, try cleanup by removing + * infrequently used entries to make a room for the new entry. If it + * failed, enlarge the bucket array instead. Quite arbitrarily, we try + * this when fill factor > 2. */ - if (cache->cc_ntup > cache->cc_nbuckets * 2) + if (cache->cc_ntup > cache->cc_nbuckets * 2 && + !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); return ct; diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index c216ed0922..134c357bf3 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -80,6 +80,7 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/guc_tables.h" #include "utils/float.h" #include "utils/memutils.h" @@ -2190,6 +2191,28 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"cache_memory_target", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum syscache size to keep."), + gettext_noop("Cache is not pruned before exceeding this size."), + GUC_UNIT_KB + }, + &cache_memory_target, + 0, 0, MAX_KILOBYTES, + NULL, NULL, NULL + }, + + { + {"cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum unused duration of cache entries before removal."), + gettext_noop("Cache entries that live unused for longer than this seconds are considered to be removed."), + GUC_UNIT_S + }, + &cache_prune_min_age, + 600, -1, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index a21865a77f..d82af3bd6c 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -128,6 +128,8 @@ #work_mem = 4MB # min 64kB #maintenance_work_mem = 64MB # min 1MB #autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem +#cache_memory_target = 0kB # in kB +#cache_prune_min_age = 600s # -1 disables pruning #max_stack_depth = 2MB # min 100kB #dynamic_shared_memory_type = posix # the default is the first option # supported by the operating system: diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 65d816a583..5d24809900 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -61,6 +62,7 @@ typedef struct catcache slist_node cc_next; /* list link */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ + int cc_tupsize; /* total amount of catcache tuples */ /* * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS @@ -119,7 +121,9 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ - + int naccess; /* # of access to this entry, up to 2 */ + TimestampTz lastaccess; /* approx. timestamp of the last usage */ + int size; /* palloc'ed size off this tuple */ /* * The tuple may also be a member of at most one CatCList. (If a single * catcache is list-searched with varying numbers of keys, we may have to @@ -189,6 +193,28 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int cache_prune_min_age; +extern int cache_memory_target; + +/* to use as access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* + * SetCatCacheClock - set timestamp for catcache access record + */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + +static inline TimestampTz +GetCatCacheClock(void) +{ + return catcacheclock; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, -- 2.16.3 From f595a8a03f4438c52303c7fae3d95492550106b5 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 16 Oct 2018 15:48:28 +0900 Subject: [PATCH 2/3] Syscache usage tracking feature. Collects syscache usage statictics and show it using the view pg_stat_syscache. The feature is controlled by the GUC variable track_syscache_usage_interval. --- doc/src/sgml/config.sgml | 15 ++ src/backend/catalog/system_views.sql | 17 +++ src/backend/postmaster/pgstat.c | 201 ++++++++++++++++++++++++-- src/backend/tcop/postgres.c | 23 +++ src/backend/utils/adt/pgstatfuncs.c | 134 +++++++++++++++++ src/backend/utils/cache/catcache.c | 115 +++++++++++---- src/backend/utils/cache/syscache.c | 24 +++ src/backend/utils/init/globals.c | 1 + src/backend/utils/init/postinit.c | 11 ++ src/backend/utils/misc/guc.c | 10 ++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/catalog/pg_proc.dat | 9 ++ src/include/miscadmin.h | 1 + src/include/pgstat.h | 6 +- src/include/utils/catcache.h | 9 +- src/include/utils/syscache.h | 19 +++ src/include/utils/timeout.h | 1 + src/test/regress/expected/rules.out | 24 ++- 18 files changed, 576 insertions(+), 45 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index af3c52b868..6dd024340b 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -6662,6 +6662,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; </listitem> </varlistentry> + <varlistentry id="guc-track-syscache-usage-interval" xreflabel="track_syscache_usage_interval"> + <term><varname>track_syscache_usage_interval</varname> (<type>integer</type>) + <indexterm> + <primary><varname>track_syscache_usage_interval</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the interval to collect system cache usage statistics in + milliseconds. This parameter is 0 by default, which means disabled. + Only superusers can change this setting. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-track-io-timing" xreflabel="track_io_timing"> <term><varname>track_io_timing</varname> (<type>boolean</type>) <indexterm> diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index f4d9e9daf7..30e2da935a 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -904,6 +904,22 @@ CREATE VIEW pg_stat_progress_vacuum AS FROM pg_stat_get_progress_info('VACUUM') AS S LEFT JOIN pg_database D ON S.datid = D.oid; +CREATE VIEW pg_stat_syscache AS + SELECT + S.pid AS pid, + S.relid::regclass AS relname, + S.indid::regclass AS cache_name, + S.size AS size, + S.ntup AS ntuples, + S.searches AS searches, + S.hits AS hits, + S.neg_hits AS neg_hits, + S.ageclass AS ageclass, + S.last_update AS last_update + FROM pg_stat_activity A + JOIN LATERAL (SELECT A.pid, * FROM pg_get_syscache_stats(A.pid)) S + ON (A.pid = S.pid); + CREATE VIEW pg_user_mappings AS SELECT U.oid AS umid, @@ -1183,6 +1199,7 @@ GRANT EXECUTE ON FUNCTION pg_ls_waldir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_archive_statusdir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_tmpdir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_tmpdir(oid) TO pg_monitor; +GRANT EXECUTE ON FUNCTION pg_get_syscache_stats(int) TO pg_monitor; GRANT pg_read_all_settings TO pg_monitor; GRANT pg_read_all_stats TO pg_monitor; diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 13da412c59..2e8b7d0d91 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -66,6 +66,7 @@ #include "utils/ps_status.h" #include "utils/rel.h" #include "utils/snapmgr.h" +#include "utils/syscache.h" #include "utils/timestamp.h" #include "utils/tqual.h" @@ -125,6 +126,7 @@ bool pgstat_track_activities = false; bool pgstat_track_counts = false; int pgstat_track_functions = TRACK_FUNC_OFF; +int pgstat_track_syscache_usage_interval = 0; int pgstat_track_activity_query_size = 1024; /* ---------- @@ -237,6 +239,11 @@ typedef struct TwoPhasePgStatRecord bool t_truncated; /* was the relation truncated? */ } TwoPhasePgStatRecord; +/* bitmap symbols to specify target file types remove */ +#define PGSTAT_REMFILE_DBSTAT 1 /* remove only databsae stats files */ +#define PGSTAT_REMFILE_SYSCACHE 2 /* remove only syscache stats files */ +#define PGSTAT_REMFILE_ALL 3 /* remove both type of files */ + /* * Info about current "snapshot" of stats file */ @@ -336,6 +343,7 @@ static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len); static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len); static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len); static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len); +static void pgstat_syscache_remove_statsfile(void); /* ------------------------------------------------------------ * Public functions called from postmaster follow @@ -631,10 +639,13 @@ startup_failed: } /* - * subroutine for pgstat_reset_all + * remove stats files + * + * clean up stats files in specified directory. target is one of + * PGSTAT_REFILE_DBSTAT/SYSCACHE/ALL and restricts files to remove. */ static void -pgstat_reset_remove_files(const char *directory) +pgstat_reset_remove_files(const char *directory, int target) { DIR *dir; struct dirent *entry; @@ -645,25 +656,39 @@ pgstat_reset_remove_files(const char *directory) { int nchars; Oid tmp_oid; + int filetype = 0; /* * Skip directory entries that don't match the file names we write. * See get_dbstat_filename for the database-specific pattern. */ if (strncmp(entry->d_name, "global.", 7) == 0) + { + filetype = PGSTAT_REMFILE_DBSTAT; nchars = 7; + } else { + char head[2]; + nchars = 0; - (void) sscanf(entry->d_name, "db_%u.%n", - &tmp_oid, &nchars); - if (nchars <= 0) - continue; + (void) sscanf(entry->d_name, "%c%c_%u.%n", + head, head + 1, &tmp_oid, &nchars); + /* %u allows leading whitespace, so reject that */ - if (strchr("0123456789", entry->d_name[3]) == NULL) + if (nchars < 3 || !isdigit(entry->d_name[3])) continue; + + if (strncmp(head, "db", 2) == 0) + filetype = PGSTAT_REMFILE_DBSTAT; + else if (strncmp(head, "cc", 2) == 0) + filetype = PGSTAT_REMFILE_SYSCACHE; } + /* skip if this is not a target */ + if ((filetype & target) == 0) + continue; + if (strcmp(entry->d_name + nchars, "tmp") != 0 && strcmp(entry->d_name + nchars, "stat") != 0) continue; @@ -684,8 +709,9 @@ pgstat_reset_remove_files(const char *directory) void pgstat_reset_all(void) { - pgstat_reset_remove_files(pgstat_stat_directory); - pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY); + pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_ALL); + pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY, + PGSTAT_REMFILE_ALL); } #ifdef EXEC_BACKEND @@ -2962,6 +2988,10 @@ pgstat_beshutdown_hook(int code, Datum arg) if (OidIsValid(MyDatabaseId)) pgstat_report_stat(true); + /* clear syscache statistics files and temprary settings */ + if (MyBackendId != InvalidBackendId) + pgstat_syscache_remove_statsfile(); + /* * Clear my status entry, following the protocol of bumping st_changecount * before and after. We use a volatile pointer here to ensure the @@ -4286,6 +4316,9 @@ PgstatCollectorMain(int argc, char *argv[]) pgStatRunningInCollector = true; pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true); + /* Remove left-over syscache stats files */ + pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_SYSCACHE); + /* * Loop to process messages until we get SIGQUIT or detect ungraceful * death of our parent postmaster. @@ -6376,3 +6409,153 @@ pgstat_clip_activity(const char *raw_activity) return activity; } + +/* + * return the filename for a syscache stat file; filename is the output + * buffer, of length len. + */ +void +pgstat_get_syscachestat_filename(bool permanent, bool tempname, int backendid, + char *filename, int len) +{ + int printed; + + /* NB -- pgstat_reset_remove_files knows about the pattern this uses */ + printed = snprintf(filename, len, "%s/cc_%u.%s", + permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY : + pgstat_stat_directory, + backendid, + tempname ? "tmp" : "stat"); + if (printed >= len) + elog(ERROR, "overlength pgstat path"); +} + +/* removes syscache stats files of this backend */ +static void +pgstat_syscache_remove_statsfile(void) +{ + char fname[MAXPGPATH]; + + pgstat_get_syscachestat_filename(false, false, MyBackendId, + fname, MAXPGPATH); + unlink(fname); /* don't care of the result */ +} + +/* + * pgstat_write_syscache_stats() - + * Write the syscache statistics files. + * + * If 'force' is false, this function skips writing a file and resturns the + * time remaining in the current interval in milliseconds. If'force' is true, + * writes a file regardless of the remaining time and reset the interval. + */ +long +pgstat_write_syscache_stats(bool force) +{ + static TimestampTz last_report = 0; + TimestampTz now; + long elapsed; + long secs; + int usecs; + int cacheId; + FILE *fpout; + char statfile[MAXPGPATH]; + char tmpfile[MAXPGPATH]; + + /* Return if we don't want it */ + if (!force && pgstat_track_syscache_usage_interval <= 0) + { + /* disabled. remove the statistics file if any */ + if (last_report > 0) + { + last_report = 0; + pgstat_syscache_remove_statsfile(); + } + return 0; + } + + /* Check aginst the interval */ + now = GetCurrentTransactionStopTimestamp(); + TimestampDifference(last_report, now, &secs, &usecs); + elapsed = secs * 1000 + usecs / 1000; + + if (!force && elapsed < pgstat_track_syscache_usage_interval) + { + /* not yet the time, inform the remaining time to the caller */ + return pgstat_track_syscache_usage_interval - elapsed; + } + + /* now update the stats */ + last_report = now; + + pgstat_get_syscachestat_filename(false, true, + MyBackendId, tmpfile, MAXPGPATH); + pgstat_get_syscachestat_filename(false, false, + MyBackendId, statfile, MAXPGPATH); + + /* + * This function can be called from ProcessInterrupts(). Inhibit recursive + * interrupts to avoid recursive entry. + */ + HOLD_INTERRUPTS(); + + fpout = AllocateFile(tmpfile, PG_BINARY_W); + if (fpout == NULL) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not open temporary statistics file \"%s\": %m", + tmpfile))); + /* + * Failure writing this file is not critical. Just skip this time and + * tell caller to wait for the next interval. + */ + RESUME_INTERRUPTS(); + return pgstat_track_syscache_usage_interval; + } + + /* write out every catcache stats */ + for (cacheId = 0 ; cacheId < SysCacheSize ; cacheId++) + { + SysCacheStats *stats; + + stats = SysCacheGetStats(cacheId); + Assert (stats); + + /* write error is checked later using ferror() */ + fputc('T', fpout); + (void)fwrite(&cacheId, sizeof(int), 1, fpout); + (void)fwrite(&last_report, sizeof(TimestampTz), 1, fpout); + (void)fwrite(stats, sizeof(*stats), 1, fpout); + } + fputc('E', fpout); + + if (ferror(fpout)) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not write syscache statistics file \"%s\": %m", + tmpfile))); + FreeFile(fpout); + unlink(tmpfile); + } + else if (FreeFile(fpout) < 0) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not close syscache statistics file \"%s\": %m", + tmpfile))); + unlink(tmpfile); + } + else if (rename(tmpfile, statfile) < 0) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not rename syscache statistics file \"%s\" to \"%s\": %m", + tmpfile, statfile))); + unlink(tmpfile); + } + + RESUME_INTERRUPTS(); + return 0; +} diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 0c0891b33e..e7972e645f 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -3157,6 +3157,12 @@ ProcessInterrupts(void) } + if (IdleSyscacheStatsUpdateTimeoutPending) + { + IdleSyscacheStatsUpdateTimeoutPending = false; + pgstat_write_syscache_stats(true); + } + if (ParallelMessagePending) HandleParallelMessages(); } @@ -3733,6 +3739,7 @@ PostgresMain(int argc, char *argv[], sigjmp_buf local_sigjmp_buf; volatile bool send_ready_for_query = true; bool disable_idle_in_transaction_timeout = false; + bool disable_idle_catcache_update_timeout = false; /* Initialize startup process environment if necessary. */ if (!IsUnderPostmaster) @@ -4173,9 +4180,19 @@ PostgresMain(int argc, char *argv[], } else { + long timeout; + ProcessCompletedNotifies(); pgstat_report_stat(false); + timeout = pgstat_write_syscache_stats(false); + + if (timeout > 0) + { + disable_idle_catcache_update_timeout = true; + enable_timeout_after(IDLE_CATCACHE_UPDATE_TIMEOUT, + timeout); + } set_ps_display("idle", false); pgstat_report_activity(STATE_IDLE, NULL); } @@ -4218,6 +4235,12 @@ PostgresMain(int argc, char *argv[], disable_idle_in_transaction_timeout = false; } + if (disable_idle_catcache_update_timeout) + { + disable_timeout(IDLE_CATCACHE_UPDATE_TIMEOUT, false); + disable_idle_catcache_update_timeout = false; + } + /* * (6) check for any other interesting events that happened while we * slept. diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index 053bb73863..0d32bf8daa 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -14,6 +14,8 @@ */ #include "postgres.h" +#include <sys/stat.h> + #include "access/htup_details.h" #include "catalog/pg_authid.h" #include "catalog/pg_type.h" @@ -28,6 +30,7 @@ #include "utils/acl.h" #include "utils/builtins.h" #include "utils/inet.h" +#include "utils/syscache.h" #include "utils/timestamp.h" #define UINT32_ACCESS_ONCE(var) ((uint32)(*((volatile uint32 *)&(var)))) @@ -1882,3 +1885,134 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS) PG_RETURN_DATUM(HeapTupleGetDatum( heap_form_tuple(tupdesc, values, nulls))); } + +Datum +pgstat_get_syscache_stats(PG_FUNCTION_ARGS) +{ +#define PG_GET_SYSCACHE_SIZE 9 + int pid = PG_GETARG_INT32(0); + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; + TupleDesc tupdesc; + Tuplestorestate *tupstore; + MemoryContext per_query_ctx; + MemoryContext oldcontext; + PgBackendStatus *beentry; + int beid; + char fname[MAXPGPATH]; + FILE *fpin; + char c; + + if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("set-valued function called in context that cannot accept a set"))); + if (!(rsinfo->allowedModes & SFRM_Materialize)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("materialize mode required, but it is not " \ + "allowed in this context"))); + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + + per_query_ctx = rsinfo->econtext->ecxt_per_query_memory; + + oldcontext = MemoryContextSwitchTo(per_query_ctx); + tupstore = tuplestore_begin_heap(true, false, work_mem); + rsinfo->returnMode = SFRM_Materialize; + rsinfo->setResult = tupstore; + rsinfo->setDesc = tupdesc; + + MemoryContextSwitchTo(oldcontext); + + /* find beentry for given pid*/ + beentry = NULL; + for (beid = 1; + (beentry = pgstat_fetch_stat_beentry(beid)) && + beentry->st_procpid != pid ; + beid++); + + /* + * we silently return empty result on failure or insufficient privileges + */ + if (!beentry || + (!has_privs_of_role(GetUserId(), beentry->st_userid) && + !is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS))) + goto no_data; + + pgstat_get_syscachestat_filename(false, false, beid, fname, MAXPGPATH); + + if ((fpin = AllocateFile(fname, PG_BINARY_R)) == NULL) + { + if (errno != ENOENT) + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not open statistics file \"%s\": %m", + fname))); + /* also return empty on no statistics file */ + goto no_data; + } + + /* read the statistics file into tuplestore */ + while ((c = fgetc(fpin)) == 'T') + { + TimestampTz last_update; + SysCacheStats stats; + int cacheid; + Datum values[PG_GET_SYSCACHE_SIZE]; + bool nulls[PG_GET_SYSCACHE_SIZE] = {0}; + Datum datums[SYSCACHE_STATS_NAGECLASSES * 2]; + bool arrnulls[SYSCACHE_STATS_NAGECLASSES * 2] = {0}; + int dims[] = {SYSCACHE_STATS_NAGECLASSES, 2}; + int lbs[] = {1, 1}; + ArrayType *arr; + int i, j; + + fread(&cacheid, sizeof(int), 1, fpin); + fread(&last_update, sizeof(TimestampTz), 1, fpin); + if (fread(&stats, 1, sizeof(stats), fpin) != sizeof(stats)) + { + ereport(WARNING, + (errmsg("corrupted syscache statistics file \"%s\"", + fname))); + goto no_data; + } + + i = 0; + values[i++] = ObjectIdGetDatum(stats.reloid); + values[i++] = ObjectIdGetDatum(stats.indoid); + values[i++] = Int64GetDatum(stats.size); + values[i++] = Int64GetDatum(stats.ntuples); + values[i++] = Int64GetDatum(stats.nsearches); + values[i++] = Int64GetDatum(stats.nhits); + values[i++] = Int64GetDatum(stats.nneg_hits); + + for (j = 0 ; j < SYSCACHE_STATS_NAGECLASSES ; j++) + { + datums[j * 2] = Int32GetDatum((int32) stats.ageclasses[j]); + datums[j * 2 + 1] = Int32GetDatum((int32) stats.nclass_entries[j]); + } + + arr = construct_md_array(datums, arrnulls, 2, dims, lbs, + INT4OID, sizeof(int32), true, 'i'); + values[i++] = PointerGetDatum(arr); + + values[i++] = TimestampTzGetDatum(last_update); + + Assert (i == PG_GET_SYSCACHE_SIZE); + + tuplestore_putvalues(tupstore, tupdesc, values, nulls); + } + + /* check for the end of file. abandon the result if file is broken */ + if (c != 'E' || fgetc(fpin) != EOF) + tuplestore_clear(tupstore); + + FreeFile(fpin); + +no_data: + tuplestore_donestoring(tupstore); + return (Datum) 0; +} diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index ee40093553..4a3b3094a0 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -90,6 +90,10 @@ static CatCacheHeader *CacheHdr = NULL; /* Timestamp used for any operation on caches. */ TimestampTz catcacheclock = 0; +/* age classes for pruning */ +static double ageclass[SYSCACHE_STATS_NAGECLASSES] + = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -620,9 +624,7 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue) else CatCacheRemoveCTup(cache, ct); CACHE1_elog(DEBUG2, "CatCacheInvalidate: invalidated"); -#ifdef CATCACHE_STATS cache->cc_invals++; -#endif /* could be multiple matches, so keep looking! */ } } @@ -698,9 +700,7 @@ ResetCatalogCache(CatCache *cache) } else CatCacheRemoveCTup(cache, ct); -#ifdef CATCACHE_STATS cache->cc_invals++; -#endif } } } @@ -907,10 +907,11 @@ CatCacheCleanupOldEntries(CatCache *cp) * cache_prune_min_age. The index of nremoved_entry is the value of the * clock-sweep counter, which takes from 0 up to 2. */ - double ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; - int nentries[] = {0, 0, 0, 0, 0, 0}; + int nentries[SYSCACHE_STATS_NAGECLASSES] = {0, 0, 0, 0, 0, 0}; int nremoved_entry[3] = {0, 0, 0}; int j; + + Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0); #endif /* Return immediately if no pruning is wanted */ @@ -924,7 +925,11 @@ CatCacheCleanupOldEntries(CatCache *cp) if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L) return false; - /* Search the whole hash for entries to remove */ + /* + * Search the whole hash for entries to remove. This is a quite time + * consuming task during catcache lookup, but accetable since now we are + * going to expand the hash table. + */ for (i = 0; i < cp->cc_nbuckets; i++) { dlist_mutable_iter iter; @@ -937,21 +942,21 @@ CatCacheCleanupOldEntries(CatCache *cp) /* - * Calculate the duration from the time of the last access to the - * "current" time. Since catcacheclock is not advanced within a - * transaction, the entries that are accessed within the current - * transaction won't be pruned. + * Calculate the duration from the time from the last access to + * the "current" time. Since catcacheclock is not advanced within + * a transaction, the entries that are accessed within the current + * transaction always get 0 as the result. */ TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); #ifdef CATCACHE_STATS /* count catcache entries for each age class */ ntotal++; - for (j = 0 ; - ageclass[j] != 0.0 && - entry_age > cache_prune_min_age * ageclass[j] ; - j++); - if (ageclass[j] == 0.0) j--; + + j = 0; + while (j < SYSCACHE_STATS_NAGECLASSES - 1 && + entry_age > cache_prune_min_age * ageclass[j]) + j++; nentries[j]++; #endif @@ -984,14 +989,17 @@ CatCacheCleanupOldEntries(CatCache *cp) } #ifdef CATCACHE_STATS + StaticAssertStmt(SYSCACHE_STATS_NAGECLASSES == 6, + "number of syscache age class must be 6"); ereport(DEBUG1, - (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d, 2:%d)", + (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d, rest:%d) naccessed(0:%d,1:%d, 2:%d)", nremoved, ntotal, ageclass[0] * cache_prune_min_age, nentries[0], ageclass[1] * cache_prune_min_age, nentries[1], ageclass[2] * cache_prune_min_age, nentries[2], ageclass[3] * cache_prune_min_age, nentries[3], ageclass[4] * cache_prune_min_age, nentries[4], + nentries[5], nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]), errhidestmt(true))); #endif @@ -1368,9 +1376,7 @@ SearchCatCacheInternal(CatCache *cache, if (unlikely(cache->cc_tupdesc == NULL)) CatalogCacheInitializeCache(cache); -#ifdef CATCACHE_STATS cache->cc_searches++; -#endif /* Initialize local parameter array */ arguments[0] = v1; @@ -1430,9 +1436,7 @@ SearchCatCacheInternal(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_hits++; -#endif return &ct->tuple; } @@ -1441,9 +1445,7 @@ SearchCatCacheInternal(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_neg_hits++; -#endif return NULL; } @@ -1571,9 +1573,7 @@ SearchCatCacheMiss(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): put in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_newloads++; -#endif return &ct->tuple; } @@ -1684,9 +1684,7 @@ SearchCatCacheList(CatCache *cache, Assert(nkeys > 0 && nkeys < cache->cc_nkeys); -#ifdef CATCACHE_STATS cache->cc_lsearches++; -#endif /* Initialize local parameter array */ arguments[0] = v1; @@ -1743,9 +1741,7 @@ SearchCatCacheList(CatCache *cache, CACHE2_elog(DEBUG2, "SearchCatCacheList(%s): found list", cache->cc_relname); -#ifdef CATCACHE_STATS cache->cc_lhits++; -#endif return cl; } @@ -2253,3 +2249,64 @@ PrintCatCacheListLeakWarning(CatCList *list) list->my_cache->cc_relname, list->my_cache->id, list, list->refcount); } + +/* + * CatCacheGetStats - fill in SysCacheStats struct. + * + * This is a support routine for SysCacheGetStats, substantially fills in the + * result. The classification here is based on the same criteria to + * CatCacheCleanupOldEntries(). + */ +void +CatCacheGetStats(CatCache *cache, SysCacheStats *stats) +{ + int i, j; + + Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0); + + /* fill in the stats struct */ + stats->size = cache->cc_tupsize + cache->cc_nbuckets * sizeof(dlist_head); + stats->ntuples = cache->cc_ntup; + stats->nsearches = cache->cc_searches; + stats->nhits = cache->cc_hits; + stats->nneg_hits = cache->cc_neg_hits; + + /* cache_prune_min_age can be changed on-session, fill it every time */ + for (i = 0 ; i < SYSCACHE_STATS_NAGECLASSES ; i++) + stats->ageclasses[i] = (int) (cache_prune_min_age * ageclass[i]); + + /* + * nth element in nclass_entries stores the number of cache entries that + * have lived unaccessed for corresponding multiple in ageclass of + * cache_prune_min_age. + */ + memset(stats->nclass_entries, 0, sizeof(int) * SYSCACHE_STATS_NAGECLASSES); + + /* Scan the whole hash */ + for (i = 0; i < cache->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cache->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + /* + * Calculate the duration from the time from the last access to + * the "current" time. Since catcacheclock is not advanced within + * a transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + + j = 0; + while (j < SYSCACHE_STATS_NAGECLASSES - 1 && + entry_age > stats->ageclasses[j]) + j++; + + stats->nclass_entries[j]++; + } + } +} diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c index ac98c19155..7b38a06708 100644 --- a/src/backend/utils/cache/syscache.c +++ b/src/backend/utils/cache/syscache.c @@ -20,6 +20,9 @@ */ #include "postgres.h" +#include <sys/stat.h> +#include <unistd.h> + #include "access/htup_details.h" #include "access/sysattr.h" #include "catalog/indexing.h" @@ -1534,6 +1537,27 @@ RelationSupportsSysCache(Oid relid) return false; } +/* + * SysCacheGetStats - returns stats of specified syscache + * + * This routine returns the address of its local static memory. + */ +SysCacheStats * +SysCacheGetStats(int cacheId) +{ + static SysCacheStats stats; + + Assert(cacheId >=0 && cacheId < SysCacheSize); + + memset(&stats, 0, sizeof(stats)); + + stats.reloid = cacheinfo[cacheId].reloid; + stats.indoid = cacheinfo[cacheId].indoid; + + CatCacheGetStats(SysCache[cacheId], &stats); + + return &stats; +} /* * OID comparator for pg_qsort diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index fd51934aaf..f039ecd805 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false; volatile sig_atomic_t ProcDiePending = false; volatile sig_atomic_t ClientConnectionLost = false; volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false; +volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending = false; volatile sig_atomic_t ConfigReloadPending = false; volatile uint32 InterruptHoldoffCount = 0; volatile uint32 QueryCancelHoldoffCount = 0; diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index 7415c4faab..6b0fdbbd87 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg); static void StatementTimeoutHandler(void); static void LockTimeoutHandler(void); static void IdleInTransactionSessionTimeoutHandler(void); +static void IdleSyscacheStatsUpdateTimeoutHandler(void); static bool ThereIsAtLeastOneRole(void); static void process_startup_options(Port *port, bool am_superuser); static void process_settings(Oid databaseid, Oid roleid); @@ -629,6 +630,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler); RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT, IdleInTransactionSessionTimeoutHandler); + RegisterTimeout(IDLE_CATCACHE_UPDATE_TIMEOUT, + IdleSyscacheStatsUpdateTimeoutHandler); } /* @@ -1240,6 +1243,14 @@ IdleInTransactionSessionTimeoutHandler(void) SetLatch(MyLatch); } +static void +IdleSyscacheStatsUpdateTimeoutHandler(void) +{ + IdleSyscacheStatsUpdateTimeoutPending = true; + InterruptPending = true; + SetLatch(MyLatch); +} + /* * Returns true if at least one role is defined in this database cluster. */ diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 134c357bf3..e8d7b6998a 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -3154,6 +3154,16 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"track_syscache_usage_interval", PGC_SUSET, STATS_COLLECTOR, + gettext_noop("Sets the interval between syscache usage collection, in milliseconds. Zero disables syscache usagetracking."), + NULL + }, + &pgstat_track_syscache_usage_interval, + 0, 0, INT_MAX / 2, + NULL, NULL, NULL + }, + { {"gin_pending_list_limit", PGC_USERSET, CLIENT_CONN_STATEMENT, gettext_noop("Sets the maximum size of the pending list for GIN index."), diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index d82af3bd6c..4a6c9fceb5 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -554,6 +554,7 @@ #track_io_timing = off #track_functions = none # none, pl, all #track_activity_query_size = 1024 # (change requires restart) +#track_syscache_usage_interval = 0 # zero disables tracking #stats_temp_directory = 'pg_stat_tmp' diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 3ecc2e12c3..11fc1f3075 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -9669,6 +9669,15 @@ proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}', proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}', prosrc => 'pg_get_replication_slots' }, +{ oid => '3425', + descr => 'syscache statistics', + proname => 'pg_get_syscache_stats', prorows => '100', proisstrict => 'f', + proretset => 't', provolatile => 'v', prorettype => 'record', + proargtypes => 'int4', + proallargtypes => '{int4,oid,oid,int8,int8,int8,int8,int8,_int4,timestamptz}', + proargmodes => '{i,o,o,o,o,o,o,o,o,o}', + proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,last_update}', + prosrc => 'pgstat_get_syscache_stats' }, { oid => '3786', descr => 'set up a logical replication slot', proname => 'pg_create_logical_replication_slot', provolatile => 'v', proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool', diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index c9e35003a5..69b9a976f0 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending; extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending; extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending; extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending; +extern PGDLLIMPORT volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending; extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending; extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost; diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 313ca5f3c3..4d0f5b8042 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -1134,6 +1134,7 @@ extern bool pgstat_track_activities; extern bool pgstat_track_counts; extern int pgstat_track_functions; extern PGDLLIMPORT int pgstat_track_activity_query_size; +extern int pgstat_track_syscache_usage_interval; extern char *pgstat_stat_directory; extern char *pgstat_stat_tmpname; extern char *pgstat_stat_filename; @@ -1218,7 +1219,8 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id); extern void pgstat_initstats(Relation rel); extern char *pgstat_clip_activity(const char *raw_activity); - +extern void pgstat_get_syscachestat_filename(bool permanent, + bool tempname, int backendid, char *filename, int len); /* ---------- * pgstat_report_wait_start() - * @@ -1353,5 +1355,5 @@ extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid); extern int pgstat_fetch_stat_numbackends(void); extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void); extern PgStat_GlobalStats *pgstat_fetch_global(void); - +extern long pgstat_write_syscache_stats(bool force); #endif /* PGSTAT_H */ diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 5d24809900..4d51975920 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -65,10 +65,8 @@ typedef struct catcache int cc_tupsize; /* total amount of catcache tuples */ /* - * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS - * doesn't break ABI for other modules + * Statistics entries */ -#ifdef CATCACHE_STATS long cc_searches; /* total # searches against this cache */ long cc_hits; /* # of matches against existing entry */ long cc_neg_hits; /* # of matches against negative entry */ @@ -81,7 +79,6 @@ typedef struct catcache long cc_invals; /* # of entries invalidated from cache */ long cc_lsearches; /* total # list-searches */ long cc_lhits; /* # of matches against existing lists */ -#endif } CatCache; @@ -254,4 +251,8 @@ extern void PrepareToInvalidateCacheTuple(Relation relation, extern void PrintCatCacheLeakWarning(HeapTuple tuple); extern void PrintCatCacheListLeakWarning(CatCList *list); +/* defined in syscache.h */ +typedef struct syscachestats SysCacheStats; +extern void CatCacheGetStats(CatCache *cache, SysCacheStats *syscachestats); + #endif /* CATCACHE_H */ diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h index 95ee48954e..71b399c902 100644 --- a/src/include/utils/syscache.h +++ b/src/include/utils/syscache.h @@ -112,6 +112,24 @@ enum SysCacheIdentifier #define SysCacheSize (USERMAPPINGUSERSERVER + 1) }; +#define SYSCACHE_STATS_NAGECLASSES 6 +/* Struct for catcache tracking information */ +typedef struct syscachestats +{ + Oid reloid; /* target relation */ + Oid indoid; /* index */ + size_t size; /* size of the catcache */ + int ntuples; /* number of tuples resides in the catcache */ + int nsearches; /* number of searches */ + int nhits; /* number of cache hits */ + int nneg_hits; /* number of negative cache hits */ + /* age classes in seconds */ + int ageclasses[SYSCACHE_STATS_NAGECLASSES]; + /* number of tuples fall into the corresponding age class */ + int nclass_entries[SYSCACHE_STATS_NAGECLASSES]; +} SysCacheStats; + + extern void InitCatalogCache(void); extern void InitCatalogCachePhase2(void); @@ -164,6 +182,7 @@ extern void SysCacheInvalidate(int cacheId, uint32 hashValue); extern bool RelationInvalidatesSnapshotsOnly(Oid relid); extern bool RelationHasSysCache(Oid relid); extern bool RelationSupportsSysCache(Oid relid); +extern SysCacheStats *SysCacheGetStats(int cacheId); /* * The use of the macros below rather than direct calls to the corresponding diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h index 9244a2a7b7..0ab441a364 100644 --- a/src/include/utils/timeout.h +++ b/src/include/utils/timeout.h @@ -31,6 +31,7 @@ typedef enum TimeoutId STANDBY_TIMEOUT, STANDBY_LOCK_TIMEOUT, IDLE_IN_TRANSACTION_SESSION_TIMEOUT, + IDLE_CATCACHE_UPDATE_TIMEOUT, /* First user-definable timeout reason */ USER_TIMEOUT, /* Maximum number of timeout reasons */ diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out index e384cd2279..1991e75e97 100644 --- a/src/test/regress/expected/rules.out +++ b/src/test/regress/expected/rules.out @@ -1919,6 +1919,28 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid, pg_stat_all_tables.autoanalyze_count FROM pg_stat_all_tables WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname~ '^pg_toast'::text)); +pg_stat_syscache| SELECT s.pid, + (s.relid)::regclass AS relname, + (s.indid)::regclass AS cache_name, + s.size, + s.ntup AS ntuples, + s.searches, + s.hits, + s.neg_hits, + s.ageclass, + s.last_update + FROM (pg_stat_activity a + JOIN LATERAL ( SELECT a.pid, + pg_get_syscache_stats.relid, + pg_get_syscache_stats.indid, + pg_get_syscache_stats.size, + pg_get_syscache_stats.ntup, + pg_get_syscache_stats.searches, + pg_get_syscache_stats.hits, + pg_get_syscache_stats.neg_hits, + pg_get_syscache_stats.ageclass, + pg_get_syscache_stats.last_update + FROM pg_get_syscache_stats(a.pid) pg_get_syscache_stats(relid, indid, size, ntup, searches, hits, neg_hits, ageclass,last_update)) s ON ((a.pid = s.pid))); pg_stat_user_functions| SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, @@ -2350,7 +2372,7 @@ pg_settings|pg_settings_n|CREATE RULE pg_settings_n AS ON UPDATE TO pg_catalog.pg_settings DO INSTEAD NOTHING; pg_settings|pg_settings_u|CREATE RULE pg_settings_u AS ON UPDATE TO pg_catalog.pg_settings - WHERE (new.name = old.name) DO SELECT set_config(old.name, new.setting, false) AS set_config; + WHERE (new.name = old.name) DO SELECT set_config(old.name, new.setting, false, false) AS set_config; rtest_emp|rtest_emp_del|CREATE RULE rtest_emp_del AS ON DELETE TO public.rtest_emp DO INSERT INTO rtest_emplog (ename, who, action, newsal, oldsal) VALUES (old.ename, CURRENT_USER, 'fired'::bpchar, '$0.00'::money, old.salary); -- 2.16.3 From 36f93fb3625e8f1753070d30ec81548a4dfe9eb1 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Wed, 23 Jan 2019 17:32:03 +0900 Subject: [PATCH 3/3] Remote setting feature for catcache statitics. --- src/backend/postmaster/pgstat.c | 26 ++++++--- src/backend/storage/ipc/ipci.c | 3 + src/backend/utils/cache/catcache.c | 116 +++++++++++++++++++++++++++++++++++++ src/include/catalog/pg_proc.dat | 8 +++ src/include/utils/catcache.h | 6 ++ 5 files changed, 151 insertions(+), 8 deletions(-) diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 2e8b7d0d91..338a407552 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -62,6 +62,7 @@ #include "storage/sinvaladt.h" #include "utils/ascii.h" #include "utils/guc.h" +#include "utils/catcache.h" #include "utils/memutils.h" #include "utils/ps_status.h" #include "utils/rel.h" @@ -343,7 +344,7 @@ static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len); static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len); static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len); static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len); -static void pgstat_syscache_remove_statsfile(void); +static void pgstat_remove_syscache_statsfile(void); /* ------------------------------------------------------------ * Public functions called from postmaster follow @@ -2990,7 +2991,10 @@ pgstat_beshutdown_hook(int code, Datum arg) /* clear syscache statistics files and temprary settings */ if (MyBackendId != InvalidBackendId) - pgstat_syscache_remove_statsfile(); + { + pgstat_remove_syscache_statsfile(); + SetCatcacheStatsParam(0); + } /* * Clear my status entry, following the protocol of bumping st_changecount @@ -6432,7 +6436,7 @@ pgstat_get_syscachestat_filename(bool permanent, bool tempname, int backendid, /* removes syscache stats files of this backend */ static void -pgstat_syscache_remove_statsfile(void) +pgstat_remove_syscache_statsfile(void) { char fname[MAXPGPATH]; @@ -6461,15 +6465,21 @@ pgstat_write_syscache_stats(bool force) FILE *fpout; char statfile[MAXPGPATH]; char tmpfile[MAXPGPATH]; + int interval = pgstat_track_syscache_usage_interval; + int interval_by_remote = GetCatcacheStatsParam(); + + /* remote setting overrides if any */ + if (interval_by_remote > 0) + interval = interval_by_remote; /* Return if we don't want it */ - if (!force && pgstat_track_syscache_usage_interval <= 0) + if (!force && interval <= 0) { /* disabled. remove the statistics file if any */ if (last_report > 0) { last_report = 0; - pgstat_syscache_remove_statsfile(); + pgstat_remove_syscache_statsfile(); } return 0; } @@ -6479,10 +6489,10 @@ pgstat_write_syscache_stats(bool force) TimestampDifference(last_report, now, &secs, &usecs); elapsed = secs * 1000 + usecs / 1000; - if (!force && elapsed < pgstat_track_syscache_usage_interval) + if (!force && elapsed < interval) { /* not yet the time, inform the remaining time to the caller */ - return pgstat_track_syscache_usage_interval - elapsed; + return interval - elapsed; } /* now update the stats */ @@ -6511,7 +6521,7 @@ pgstat_write_syscache_stats(bool force) * tell caller to wait for the next interval. */ RESUME_INTERRUPTS(); - return pgstat_track_syscache_usage_interval; + return interval; } /* write out every catcache stats */ diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c index 2849e47d99..be5ee1f4ff 100644 --- a/src/backend/storage/ipc/ipci.c +++ b/src/backend/storage/ipc/ipci.c @@ -44,6 +44,7 @@ #include "storage/procsignal.h" #include "storage/sinvaladt.h" #include "storage/spin.h" +#include "utils/catcache.h" #include "utils/snapmgr.h" @@ -148,6 +149,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port) size = add_size(size, BTreeShmemSize()); size = add_size(size, SyncScanShmemSize()); size = add_size(size, AsyncShmemSize()); + size = add_size(size, CatcacheStatsShmemSize()); #ifdef EXEC_BACKEND size = add_size(size, ShmemBackendArraySize()); #endif @@ -267,6 +269,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port) BTreeShmemInit(); SyncScanShmemInit(); AsyncShmemInit(); + CatcacheStatsShmemInit(); #ifdef EXEC_BACKEND diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 4a3b3094a0..7ff8cd22ca 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -22,14 +22,17 @@ #include "access/tuptoaster.h" #include "access/valid.h" #include "access/xact.h" +#include "catalog/pg_authid.h" #include "catalog/pg_collation.h" #include "catalog/pg_operator.h" #include "catalog/pg_type.h" #include "miscadmin.h" +#include "pgstat.h" #ifdef CATCACHE_STATS #include "storage/ipc.h" /* for on_proc_exit */ #endif #include "storage/lmgr.h" +#include "storage/procarray.h" #include "utils/builtins.h" #include "utils/datum.h" #include "utils/fmgroids.h" @@ -94,6 +97,17 @@ TimestampTz catcacheclock = 0; static double ageclass[SYSCACHE_STATS_NAGECLASSES] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; +/* remote commanding facility */ +typedef struct CatcacheStatsParam +{ + int interval; +} CatcacheStatsParam; + +#define NumCatcacheStatsParam (MaxBackends + NUM_AUXPROCTYPES) + +static slock_t CatcacheStatsParamLock; +static CatcacheStatsParam *CatcacheStatsParamArray = NULL; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -2310,3 +2324,105 @@ CatCacheGetStats(CatCache *cache, SysCacheStats *stats) } } } + +/* Report shared-memory space needed */ +Size +CatcacheStatsShmemSize(void) +{ + Size size; + + /* The same number of elements with backend status array */ + size = mul_size(sizeof(CatcacheStatsParam), NumCatcacheStatsParam); + + return size; +} + +/* Initialize the shared parameter array for catcache statistics */ +void +CatcacheStatsShmemInit(void) +{ + Size size; + bool found; + + size = CatcacheStatsShmemSize(); + CatcacheStatsParamArray = (CatcacheStatsParam *) + ShmemInitStruct("Backend Catcache Statistics Parameter Array", + size, &found); + + if (!found) + { + /* We're the first, initilize it */ + MemSet(CatcacheStatsParamArray, 0, size); + } + + SpinLockInit(&CatcacheStatsParamLock); +} + +/* + * SQL callable function to take catcache statistics of another backend + */ +Datum +backend_catcache_stats(PG_FUNCTION_ARGS) +{ + int target_pid = PG_GETARG_INT32(0); + int interval = PG_GETARG_INT32(1); + PGPROC *target_proc; + + if (interval < 0) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("interval must not be a negtive number"))); + + LWLockAcquire(ProcArrayLock, LW_SHARED); + target_proc = BackendPidGetProcWithLock(target_pid); + + if (target_proc == NULL) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("PID %d is not a PostgreSQL server process", + target_pid))); + + /* The same condition to pg_signal_backend() */ + if ((superuser_arg(target_proc->roleId) && !superuser()) || + (!has_privs_of_role(GetUserId(), target_proc->roleId) && + !has_privs_of_role(GetUserId(), DEFAULT_ROLE_SIGNAL_BACKENDID))) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("permission denied"))); + + if (target_proc->backendId == InvalidBackendId) + ereport(ERROR, + (errmsg("invalid backendid"))); + + SpinLockAcquire(&CatcacheStatsParamLock); + CatcacheStatsParamArray[target_proc->backendId - 1].interval = interval; + SpinLockRelease(&CatcacheStatsParamLock); + LWLockRelease(ProcArrayLock); + + PG_RETURN_VOID(); +} + +/* returns catcache stats paramter of this backend */ +int +GetCatcacheStatsParam(void) +{ + int interval; + + Assert(MyBackendId != InvalidBackendId); + + SpinLockAcquire(&CatcacheStatsParamLock); + interval = CatcacheStatsParamArray[MyBackendId - 1].interval; + SpinLockRelease(&CatcacheStatsParamLock); + + return interval; +} + +/* sets catcache stats paramter of this backend */ +void +SetCatcacheStatsParam(int interval) +{ + Assert(MyBackendId != InvalidBackendId); + SpinLockAcquire(&CatcacheStatsParamLock); + CatcacheStatsParamArray[MyBackendId - 1].interval = interval; + SpinLockRelease(&CatcacheStatsParamLock); +} diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 11fc1f3075..8011d94d5d 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -10518,4 +10518,12 @@ proargnames => '{rootrelid,relid,parentrelid,isleaf,level}', prosrc => 'pg_partition_tree' }, +# catcache statitics +{ oid => '3424', + descr => 'take backend statistics of another backend', + proname => 'pg_backend_catcache_stats', proisstrict => 'f', + proretset => 'f', provolatile => 'v', proparallel => 'u', + prorettype => 'void', proargtypes => 'int4 int4', + prosrc => 'backend_catcache_stats' }, + ] diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 4d51975920..69031f1a5e 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -24,6 +24,7 @@ #include "access/skey.h" #include "datatype/timestamp.h" #include "lib/ilist.h" +#include "utils/catcache.h" #include "utils/relcache.h" /* @@ -212,6 +213,9 @@ GetCatCacheClock(void) return catcacheclock; } +extern Size CatcacheStatsShmemSize(void); +extern void CatcacheStatsShmemInit(void); + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, @@ -254,5 +258,7 @@ extern void PrintCatCacheListLeakWarning(CatCList *list); /* defined in syscache.h */ typedef struct syscachestats SysCacheStats; extern void CatCacheGetStats(CatCache *cache, SysCacheStats *syscachestats); +extern void SetCatcacheStatsParam(int interval); +extern int GetCatcacheStatsParam(void); #endif /* CATCACHE_H */ -- 2.16.3
On Wed, Jan 23, 2019 at 05:35:02PM +0900, Kyotaro HORIGUCHI wrote: > At Mon, 21 Jan 2019 17:22:55 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in<20190121.172255.226467552.horiguchi.kyotaro@lab.ntt.co.jp> > > An option is an additional PGPROC member and interface functions. > > > > struct PGPROC > > { > > ... > > int syscahe_usage_track_interval; /* track interval, 0 to disable */ > > > > =# select syscahce_usage_track_add(<pid>, <intvl>[, <repetition>]); > > =# select syscahce_usage_track_remove(2134); > > > > > > Or, just provide an one-shot triggering function. > > > > =# select syscahce_take_usage_track(<pid>); > > > > This can use both a similar PGPROC variable or SendProcSignal() > > but the former doesn't fire while idle time unless using timer. > > The attached is revised version of this patchset, where the third > patch is the remote setting feature. It uses static shared memory. > > =# select pg_backend_catcache_stats(<pid>, <millis>); > > Activates or changes catcache stats feature on the backend with > PID. (The name should be changed to .._syscache_stats, though.) > It is far smaller than the remote-GUC feature. (It contains a > part that should be in the previous patch. I will fix it later.) I have a few questions to make sure we have not made the API too complex. First, for syscache_prune_min_age, that is the minimum age that we prune, and entries could last twice that long. Is there any value to doing the scan at 50% of the age so that the syscache_prune_min_age is the max age? For example, if our age cutoff is 10 minutes, we could scan every 5 minutes so 10 minutes would be the maximum age kept. Second, when would you use syscache_memory_target != 0? If you had syscache_prune_min_age really fast, e.g. 10 seconds? What is the use-case for this? You have a query that touches 10k objects, and then the connection stays active but doesn't touch many of those 10k objects, and you want it cleaned up in seconds instead of minutes? (I can't see why you would not clean up all unreferenced objects after _minutes_ of disuse, but removing them after seconds of disuse seems undesirable.) What are the odds you would retain the entires you want with a fast target? What is the value of being able to change a specific backend's stat interval? I don't remember any other setting having this ability. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Thank you for the comments. At Wed, 23 Jan 2019 18:21:45 -0500, Bruce Momjian <bruce@momjian.us> wrote in <20190123232145.GA8334@momjian.us> > On Wed, Jan 23, 2019 at 05:35:02PM +0900, Kyotaro HORIGUCHI wrote: > > At Mon, 21 Jan 2019 17:22:55 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in<20190121.172255.226467552.horiguchi.kyotaro@lab.ntt.co.jp> > > > An option is an additional PGPROC member and interface functions. > > > > > > struct PGPROC > > > { > > > ... > > > int syscahe_usage_track_interval; /* track interval, 0 to disable */ > > > > > > =# select syscahce_usage_track_add(<pid>, <intvl>[, <repetition>]); > > > =# select syscahce_usage_track_remove(2134); > > > > > > > > > Or, just provide an one-shot triggering function. > > > > > > =# select syscahce_take_usage_track(<pid>); > > > > > > This can use both a similar PGPROC variable or SendProcSignal() > > > but the former doesn't fire while idle time unless using timer. > > > > The attached is revised version of this patchset, where the third > > patch is the remote setting feature. It uses static shared memory. > > > > =# select pg_backend_catcache_stats(<pid>, <millis>); > > > > Activates or changes catcache stats feature on the backend with > > PID. (The name should be changed to .._syscache_stats, though.) > > It is far smaller than the remote-GUC feature. (It contains a > > part that should be in the previous patch. I will fix it later.) > > I have a few questions to make sure we have not made the API too > complex. First, for syscache_prune_min_age, that is the minimum age > that we prune, and entries could last twice that long. Is there any > value to doing the scan at 50% of the age so that the > syscache_prune_min_age is the max age? For example, if our age cutoff > is 10 minutes, we could scan every 5 minutes so 10 minutes would be the > maximum age kept. (Looking into the patch..) Actually thrice, not twice. It is because I put significance on the access frequency. I think it is reasonable that the entries with more frequent access gets longer life (within a certain limit). The original problem here was negative caches that are created but never accessed. However, there's no firm reason for the number of the steps (3). There might be no difference if the extra life time were up to once of s_p_m_age or even with no extra time. > Second, when would you use syscache_memory_target != 0? It is a suggestion upthread, we sometimes want to keep some known amount of caches despite that expration should be activated. > If you had > syscache_prune_min_age really fast, e.g. 10 seconds? What is the > use-case for this? You have a query that touches 10k objects, and then > the connection stays active but doesn't touch many of those 10k objects, > and you want it cleaned up in seconds instead of minutes? (I can't see > why you would not clean up all unreferenced objects after _minutes_ of > disuse, but removing them after seconds of disuse seems undesirable.) > What are the odds you would retain the entires you want with a fast > target? Do you asking the reason for the unit? It's just because it won't be so large even in seconds, to the utmost 3600 seconds. Even though I don't think such a short dutaion setting is meaningful in the real world, either I don't think we need to inhibit that. (Actually it is useful for testing:p) Another reason is that GUC_UNIT_MIN doesn't seem so common that it is used only by two variables, log_rotation_age and old_snapshot_threshold. > What is the value of being able to change a specific backend's stat > interval? I don't remember any other setting having this ability. As mentioned upthread, it takes significant time to take statistics so I believe no one is willing to turn it on at all times. As the result it should be useless because it cannot be turned on on an active backend when it actually gets bloat. So I wanted to provide a remote switching feture. I also thought that there's some other features that is useful if it could be turned on remotely so the remote GUC feature but it was too complex... regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Jan 24, 2019 at 06:39:24PM +0900, Kyotaro HORIGUCHI wrote: > > Second, when would you use syscache_memory_target != 0? > > It is a suggestion upthread, we sometimes want to keep some known > amount of caches despite that expration should be activated. > > > If you had > > syscache_prune_min_age really fast, e.g. 10 seconds? What is the > > use-case for this? You have a query that touches 10k objects, and then > > the connection stays active but doesn't touch many of those 10k objects, > > and you want it cleaned up in seconds instead of minutes? (I can't see > > why you would not clean up all unreferenced objects after _minutes_ of > > disuse, but removing them after seconds of disuse seems undesirable.) > > What are the odds you would retain the entires you want with a fast > > target? > > Do you asking the reason for the unit? It's just because it won't > be so large even in seconds, to the utmost 3600 seconds. Even > though I don't think such a short dutaion setting is meaningful > in the real world, either I don't think we need to inhibit > that. (Actually it is useful for testing:p) Another reason is We have gone from ignoring the cache bloat problem to designing an API that even we don't know what value they provide, and if we don't know, we can be sure our users will not know. Every GUC has a cost, even if it is not used. I suggest you go with just syscache_prune_min_age, get that into PG 12, and we can then reevaluate what we need. If you want to hard-code a minimum cache size where no pruning will happen, maybe based on the system catalogs or typical load, that is fine. > that GUC_UNIT_MIN doesn't seem so common that it is used only by > two variables, log_rotation_age and old_snapshot_threshold. > > > What is the value of being able to change a specific backend's stat > > interval? I don't remember any other setting having this ability. > > As mentioned upthread, it takes significant time to take > statistics so I believe no one is willing to turn it on at all > times. As the result it should be useless because it cannot be > turned on on an active backend when it actually gets bloat. So I > wanted to provide a remote switching feture. > > I also thought that there's some other features that is useful if > it could be turned on remotely so the remote GUC feature but it > was too complex... Well, I am thinking if we want to do something like this, we should do it for all GUCs, not just for this one, so I suggest we not do this now either. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Bruce Momjian <bruce@momjian.us> writes: > On Thu, Jan 24, 2019 at 06:39:24PM +0900, Kyotaro HORIGUCHI wrote: >> I also thought that there's some other features that is useful if >> it could be turned on remotely so the remote GUC feature but it >> was too complex... > Well, I am thinking if we want to do something like this, we should do > it for all GUCs, not just for this one, so I suggest we not do this now > either. I will argue hard that we should not do it at all, ever. There is already a mechanism for broadcasting global GUC changes: apply them to postgresql.conf (or use ALTER SYSTEM) and SIGHUP. I do not think we need something that can remotely change a GUC's value in just one session. The potential for bugs, misuse, and just plain confusion is enormous, and the advantage seems minimal. regards, tom lane
On Thu, Jan 24, 2019 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Bruce Momjian <bruce@momjian.us> writes: > > On Thu, Jan 24, 2019 at 06:39:24PM +0900, Kyotaro HORIGUCHI wrote: > >> I also thought that there's some other features that is useful if > >> it could be turned on remotely so the remote GUC feature but it > >> was too complex... > > > Well, I am thinking if we want to do something like this, we should do > > it for all GUCs, not just for this one, so I suggest we not do this now > > either. > > I will argue hard that we should not do it at all, ever. > > There is already a mechanism for broadcasting global GUC changes: > apply them to postgresql.conf (or use ALTER SYSTEM) and SIGHUP. > I do not think we need something that can remotely change a GUC's > value in just one session. The potential for bugs, misuse, and > just plain confusion is enormous, and the advantage seems minimal. I think there might be some merit in being able to activate debugging or tracing facilities for a particular session remotely, but designing something that will do that sort of thing well seems like a very complex problem that certainly should not be sandwiched into another patch that is mostly about something else. And if we ever get such a thing I suspect it should be entirely separate from the GUC system. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
From: Robert Haas [mailto:robertmhaas@gmail.com] > On Thu, Jan 24, 2019 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I will argue hard that we should not do it at all, ever. > > > > There is already a mechanism for broadcasting global GUC changes: > > apply them to postgresql.conf (or use ALTER SYSTEM) and SIGHUP. > > I do not think we need something that can remotely change a GUC's > > value in just one session. The potential for bugs, misuse, and > > just plain confusion is enormous, and the advantage seems minimal. > > I think there might be some merit in being able to activate debugging > or tracing facilities for a particular session remotely, but designing > something that will do that sort of thing well seems like a very > complex problem that certainly should not be sandwiched into another > patch that is mostly about something else. And if we ever get such a > thing I suspect it should be entirely separate from the GUC system. +1 for a separate patch for remote session configuration. ALTER SYSTEM + SIGHUP targeted at a particular backend would doif the DBA can log into the database server (so, it can't be used for DBaaS.) It would be useful to have pg_reload_conf(pid). Regards Takayuki Tsunakawa
Hi Horiguchi-san, Bruce, From: Bruce Momjian [mailto:bruce@momjian.us] > I suggest you go with just syscache_prune_min_age, get that into PG 12, > and we can then reevaluate what we need. If you want to hard-code a > minimum cache size where no pruning will happen, maybe based on the system > catalogs or typical load, that is fine. Please forgive me if I say something silly (I might have got lost.) Are you suggesting to make the cache size limit system-defined and uncontrollable by the user? I think it's necessary forthe DBA to be able to control the cache memory amount. Otherwise, if many concurrent connections access many partitionswithin a not-so-long duration, then the cache eviction can't catch up and ends up in OOM. How about the followingquestions I asked in my previous mail? -------------------------------------------------- This is a pure question. How can we answer these questions from users? * What value can I set to cache_memory_target when I can use 10 GB for the caches and max_connections = 100? * How much RAM do I need to have for the caches when I set cache_memory_target = 1M? The user tends to estimate memory to avoid OOM. -------------------------------------------------- Regards Takayuki Tsunakawa
On Fri, Jan 25, 2019 at 08:14:19AM +0000, Tsunakawa, Takayuki wrote: > Hi Horiguchi-san, Bruce, > > From: Bruce Momjian [mailto:bruce@momjian.us] > > I suggest you go with just syscache_prune_min_age, get that into > > PG 12, and we can then reevaluate what we need. If you want to > > hard-code a minimum cache size where no pruning will happen, maybe > > based on the system catalogs or typical load, that is fine. > > Please forgive me if I say something silly (I might have got lost.) > > Are you suggesting to make the cache size limit system-defined and > uncontrollable by the user? I think it's necessary for the DBA to > be able to control the cache memory amount. Otherwise, if many > concurrent connections access many partitions within a not-so-long > duration, then the cache eviction can't catch up and ends up in OOM. > How about the following questions I asked in my previous mail? > > ---------------------------------------------------------------------- > This is a pure question. How can we answer these questions from > users? > > * What value can I set to cache_memory_target when I can use 10 GB for > * the caches and max_connections = 100? How much RAM do I need to > * have for the caches when I set cache_memory_target = 1M? > > The user tends to estimate memory to avoid OOM. Well, let's walk through this. Suppose the default for syscache_prune_min_age is 10 minutes, and that we prune all cache entries unreferenced in the past 10 minutes, or we only prune every 10 minutes if the cache size is larger than some fixed size like 100. So, when would you change syscache_prune_min_age? If you reference many objects and then don't reference them at all for minutes, you might want to lower syscache_prune_min_age to maybe 1 minute. Why would you want to change the behavior of removing all unreferenced cache items, at least when there are more than 100? (You called this syscache_memory_target.) My point is I can see someone wanting to change syscache_prune_min_age, but I can't see someone wanting to change syscache_memory_target. Who would want to keep 5k cache entries that have not been accessed in X minutes? If we had some global resource manager that would allow you to control work_mem, maintenance_work_mem, cache size, and set global limits on their sizes, I can see where maybe it might make sense, but right now the memory usage of a backend is so fluid that setting some limit on its size for unreferenced entries just doesn't make sense. One of my big points is that syscache_memory_target doesn't even guarantee that the cache will be this size or lower, it only controls whether the cleanup happens at syscache_prune_min_age intervals. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
At Fri, 25 Jan 2019 08:14:19 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1FB70EFB@G01JPEXMBYT05> > Hi Horiguchi-san, Bruce, > > From: Bruce Momjian [mailto:bruce@momjian.us] > > I suggest you go with just syscache_prune_min_age, get that into PG 12, > > and we can then reevaluate what we need. If you want to hard-code a > > minimum cache size where no pruning will happen, maybe based on the system > > catalogs or typical load, that is fine. > > Please forgive me if I say something silly (I might have got lost.) > > Are you suggesting to make the cache size limit system-defined and uncontrollable by the user? I think it's necessaryfor the DBA to be able to control the cache memory amount. Otherwise, if many concurrent connections access manypartitions within a not-so-long duration, then the cache eviction can't catch up and ends up in OOM. How about the followingquestions I asked in my previous mail? cache_memory_target does the opposit of limiting memory usage. It keeps some amount of syscahe entries unpruned. It is intended for sessions on where cache-effective queries runs intermittently. syscache_prune_min_age also doesn't directly limit the size. It just eventually prevents infinite memory consumption. The knobs are not no-brainer at all and don't need tuning in most cases. > -------------------------------------------------- > This is a pure question. How can we answer these questions from users? > > * What value can I set to cache_memory_target when I can use 10 GB for the caches and max_connections = 100? > * How much RAM do I need to have for the caches when I set cache_memory_target = 1M? > > The user tends to estimate memory to avoid OOM. > -------------------------------------------------- You don't have a direct control on syscache memory usage. When you find a queriy slowed by the default cache expiration, you can set cache_memory_taret to keep them for intermittent execution of a query, or you can increase syscache_prune_min_age to allow cache live for a longer time. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Fri, 25 Jan 2019 07:26:46 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1FB70E6B@G01JPEXMBYT05> > From: Robert Haas [mailto:robertmhaas@gmail.com] > > On Thu, Jan 24, 2019 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > I will argue hard that we should not do it at all, ever. > > > > > > There is already a mechanism for broadcasting global GUC changes: > > > apply them to postgresql.conf (or use ALTER SYSTEM) and SIGHUP. > > > I do not think we need something that can remotely change a GUC's > > > value in just one session. The potential for bugs, misuse, and > > > just plain confusion is enormous, and the advantage seems minimal. > > > > I think there might be some merit in being able to activate debugging > > or tracing facilities for a particular session remotely, but designing > > something that will do that sort of thing well seems like a very > > complex problem that certainly should not be sandwiched into another > > patch that is mostly about something else. And if we ever get such a > > thing I suspect it should be entirely separate from the GUC system. It means that we have a lesser copy of the GUC system but can be set remotely, then some features explicitly register their own knob on the new system, with the name that I suspenct it should be the same to the related GUC (for users' convenient). > +1 for a separate patch for remote session configuration. It sounds reasnable for me. As I said there should be some such variables. > ALTER SYSTEM + SIGHUP targeted at a particular backend would do > if the DBA can log into the database server (so, it can't be > used for DBaaS.) It would be useful to have > pg_reload_conf(pid). I don't think it is reasonable. ALTER SYSTEM alters a *system* configuration which is assumed to be the same on all sessions and other processes. All sessions start the syscache tracking if another ALTER SYSTEM for another variable then pg_reload_conf() come after doing the above. I think the change should persist no longer than the session-lifetime. I think that a consensus on backend-targetted remote tuning is made here:) A. Let GUC variables settable by a remote session. A-1. Variables are changed at a busy time (my first patch). (transaction-awareness of GUC makes this complex) A-2. Variables are changed when the session is idle (or outside a transaction). B. Override some variables via values laid on shared memory. (my second or the last patch). Very specific to a target feature. I think it consumes a bit too large memory. C. Provide session-specific GUC variable (that overides the global one) - Add new configuration file "postgresql.conf.<PID>" and pg_reload_conf() let the session with the PID loads it as if it is the last included file. All such files are removed at startup or at the end of the coressponding session. - Add a new syntax like this: ALTER SESSION WITH (pid=xxxx) SET configuration_parameter {TO | =} {value | 'value' | DEFAULT} RESET configuration_parameter RESET ALL - Target variables are marked with GUC_REMOTE. I'll consider the last choice and will come up with a patch. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi >> > I suggest you go with just syscache_prune_min_age, get that into PG >> > 12, and we can then reevaluate what we need. If you want to >> > hard-code a minimum cache size where no pruning will happen, maybe >> > based on the system catalogs or typical load, that is fine. >> >> Please forgive me if I say something silly (I might have got lost.) >> >> Are you suggesting to make the cache size limit system-defined and uncontrollable >by the user? I think it's necessary for the DBA to be able to control the cache memory >amount. Otherwise, if many concurrent connections access many partitions within a >not-so-long duration, then the cache eviction can't catch up and ends up in OOM. >How about the following questions I asked in my previous mail? > >cache_memory_target does the opposit of limiting memory usage. It keeps some >amount of syscahe entries unpruned. It is intended for sessions on where >cache-effective queries runs intermittently. >syscache_prune_min_age also doesn't directly limit the size. It just eventually >prevents infinite memory consumption. > >The knobs are not no-brainer at all and don't need tuning in most cases. > >> -------------------------------------------------- >> This is a pure question. How can we answer these questions from users? >> >> * What value can I set to cache_memory_target when I can use 10 GB for the >caches and max_connections = 100? >> * How much RAM do I need to have for the caches when I set cache_memory_target >= 1M? >> >> The user tends to estimate memory to avoid OOM. >> -------------------------------------------------- > >You don't have a direct control on syscache memory usage. When you find a queriy >slowed by the default cache expiration, you can set cache_memory_taret to keep >them for intermittent execution of a query, or you can increase >syscache_prune_min_age to allow cache live for a longer time. > In current ver8 patch there is a stats view representing age class distribution. https://www.postgresql.org/message-id/20181019.173457.68080786.horiguchi.kyotaro%40lab.ntt.co.jp Does it help DBA with tuning cache_prune_age and/or cache_prune_target? If the amount of cache entries of older age class is large, are people supposed to lower prune_age and not to change cache_prune_target? (I get confusion a little bit.) Regards, Takeshi Ideriha
At Wed, 30 Jan 2019 05:06:30 +0000, "Ideriha, Takeshi" <ideriha.takeshi@jp.fujitsu.com> wrote in <4E72940DA2BF16479384A86D54D0988A6F4156D4@G01JPEXMBKW04> > >You don't have a direct control on syscache memory usage. When you find a queriy > >slowed by the default cache expiration, you can set cache_memory_taret to keep > >them for intermittent execution of a query, or you can increase > >syscache_prune_min_age to allow cache live for a longer time. > > In current ver8 patch there is a stats view representing age class distribution. > https://www.postgresql.org/message-id/20181019.173457.68080786.horiguchi.kyotaro%40lab.ntt.co.jp > Does it help DBA with tuning cache_prune_age and/or cache_prune_target? Definitely. At least DBA can see nothing about cache usage. > If the amount of cache entries of older age class is large, are people supposed to lower prune_age and > not to change cache_prune_target? > (I get confusion a little bit.) This feature just removes cache entries that have not accessed for a certain time. If older entries occupies the major portion, it means that syscache is used effectively (in other words most of the entries are accessed frequently enough.) And in that case I believe syscache doesn't put pressure to memory usage. If the total memory usage exceeds expectations in the case, reducing pruning age may reduce it but not necessarily. Extremely short pruning age will work in exchange for performance degradation. If newer entries occupies the major portion, it means that syscache may not be used effectively. The total amount of memory usage will be limited by puruning feature so tuning won't be needed. In both cases, if pruning causes slowdown of intermittent large queries, cache_memory_target will alleviate the slowdown. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Jan 28, 2019 at 01:31:43PM +0900, Kyotaro HORIGUCHI wrote: > I'll consider the last choice and will come up with a patch. Update is recent, so I have just moved the patch to next CF. -- Michael
Attachment
Horiguchi-san, Bruce, Thank you for telling me your ideas behind this feature. Frankly, I don't think I understood the proposed specificationis OK, but I can't explain it well at this instant. So, let me discuss that in a subsequent mail. Anyway, here are my review comments on 0001: (1) (1) +/* GUC variable to define the minimum age of entries that will be cosidered to + /* initilize catcache reference clock if haven't done yet */ cosidered -> considered initilize -> initialize I remember I saw some other wrong spelling and/or missing words, which I forgot (sorry). (2) Only the doc prefixes "sys" to the new parameter names. Other places don't have it. I think we should prefix sys, becauserelcache and plancache should be configurable separately because of their different usage patterns/lifecycle. (3) The doc doesn't describe the unit of syscache_memory_target. Kilobytes? (4) + hash_size = cp->cc_nbuckets * sizeof(dlist_head); + tupsize = sizeof(CatCTup) + MAXIMUM_ALIGNOF + dtp->t_len; + tupsize = sizeof(CatCTup); GetMemoryChunkSpace() should be used to include the memory context overhead. That's what the files in src/backend/utils/sort/do. (5) + if (entry_age > cache_prune_min_age) ">=" instead of ">"? (6) + if (!ct->c_list || ct->c_list->refcount == 0) + { + CatCacheRemoveCTup(cp, ct); It's better to write "ct->c_list == NULL" to follow the style in this file. "ct->refcount == 0" should also be checked prior to removing the catcache tuple, just in case the tuple hasn't been releasedfor a long time, which might hardly happen. (7) CatalogCacheCreateEntry + int tupsize = 0; if (ntp) { int i; + int tupsize; tupsize is defined twice. (8) CatalogCacheCreateEntry In the negative entry case, the memory allocated by CatCacheCopyKeys() is not counted. I'm afraid that's not negligible. (9) The memory for CatCList is not taken into account for syscache_memory_target. Regards Takayuki Tsunakawa
Horiguchi-san, Bruce, all, I hesitate to say this, but I think there are the following problems with the proposed approach: 1) Tries to prune the catalog tuples only when the hash table is about to expand. If no tuple is found to be eligible for eviction at first and the hash table expands, it gets difficult for unnecessary orless frequently accessed tuples to be removed because it will become longer and longer until the next hash table expansion. The hash table doubles in size each time. For example, if many transactions are executed in a short duration that create and drop temporary tables and indexes, thehash table could become large quickly. 2) syscache_prune_min_age is difficult to set to meet contradictory requirements. e.g., in the above temporary objects case, the user wants to shorten syscache_prune_min_age so that the catalog tuples fortemporary objects are removed. But that also is likely to result in the necessary catalog tuples for non-temporary objectsbeing removed. 3) The DBA cannot control the memory usage. It's not predictable. syscache_memory_target doesn't set the limit on memory usage despite the impression from its name. In general, the cacheshould be able to set the upper limit on its size so that the DBA can manage things within a given amount of memory. I think other PostgreSQL parameters are based on that idea -- shared_buffers, wal_buffers, work_mem, temp_buffers,etc. 4) The memory usage doesn't decrease once allocated. The normal allocation memory context, aset.c, which CacheMemoryContextuses, doesn't return pfree()d memory to the operatingsystem. Once CacheMemoryContext becomes big, it won't get smaller. 5) Catcaches are managed independently of each other. Even if there are many unnecessary catalog tuples in one catcache, they are not freed to make room for other catcaches. So, why don't we make syscache_memory_target the upper limit on the total size of all catcaches, and rethink the past LRUmanagement? Regards Takayuki Tsunakawa
On Mon, Feb 4, 2019 at 08:23:39AM +0000, Tsunakawa, Takayuki wrote: > Horiguchi-san, Bruce, all, So, why don't we make > syscache_memory_target the upper limit on the total size of all > catcaches, and rethink the past LRU management? I was going to say that our experience with LRU has been that the overhead is not worth the value, but that was in shared resource cases, which this is not. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
From: bruce@momjian.us [mailto:bruce@momjian.us] > On Mon, Feb 4, 2019 at 08:23:39AM +0000, Tsunakawa, Takayuki wrote: > > Horiguchi-san, Bruce, all, So, why don't we make > > syscache_memory_target the upper limit on the total size of all > > catcaches, and rethink the past LRU management? > > I was going to say that our experience with LRU has been that the > overhead is not worth the value, but that was in shared resource cases, > which this is not. That's good news! Then, let's proceed with the approach involving LRU, Horiguchi-san, Ideriha-san. Regards Takayuki Tsunakawa
>From: bruce@momjian.us [mailto:bruce@momjian.us] >On Mon, Feb 4, 2019 at 08:23:39AM +0000, Tsunakawa, Takayuki wrote: >> Horiguchi-san, Bruce, all, So, why don't we make >> syscache_memory_target the upper limit on the total size of all >> catcaches, and rethink the past LRU management? > >I was going to say that our experience with LRU has been that the overhead is not >worth the value, but that was in shared resource cases, which this is not. One idea is building list with access counter for implementing LRU list based on this current patch. The list is ordered by last access time. When a catcache entry is referenced, the list is maintained , which is just manipulation of pointers at several times. As Bruce mentioned, it's not shared so there is no cost related to lock contention. When it comes to pruning, the cache older than certain timestamp with zero access counter is pruned. This way would improve performance because it only scans limited range (bounded by sys_cache_min_age). Current patch scans all hash entries and check each timestamp which would decrease the performance as cache size grows. I'm thinking hopefully implementing this idea and measuring the performance. And when we want to set the memory size limit as Tsunakawa san said, the LRU list would be suitable. Regards, Takeshi Ideriha
Hi, I find it a bit surprising there are almost no results demonstrating the impact of the proposed changes on some typical workloads. It touches code (syscache, ...) that is quite sensitive performance-wise, and adding even just a little bit of overhead may hurt significantly. Even on systems that don't have issues with cache bloat, etc. I think this is something we need - benchmarks measuring the overhead on a bunch of workloads (both typical and corner cases). Especially when there was a limit on cache size in the past, and it was removed because it was too expensive / hurting in some cases. I can't imagine committing any such changes without this information. This is particularly important as the patch was about one particular issue (bloat due to negative entries) initially, but then the scope grew quite a it. AFAICS the thread now talks about these workloads: * negative entries (due to search_path lookups etc.) * many tables accessed randomly * many tables with only a small subset accessed frequently * many tables with subsets accessed in subsets (due to pooling) * ... Unfortunately, some of those cases seems somewhat contradictory (i.e. what works for one hurts the other), so I doubt it's possible to improve all of them at once. But that makes the bencharking even more important. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 1/21/19 9:56 PM, Bruce Momjian wrote: > On Fri, Jan 18, 2019 at 05:09:41PM -0800, Andres Freund wrote: >> Hi, >> >> On 2019-01-18 19:57:03 -0500, Robert Haas wrote: >>> On Fri, Jan 18, 2019 at 4:23 PM andres@anarazel.de <andres@anarazel.de> wrote: >>>> My proposal for this was to attach a 'generation' to cache entries. Upon >>>> access cache entries are marked to be of the current >>>> generation. Whenever existing memory isn't sufficient for further cache >>>> entries and, on a less frequent schedule, triggered by a timer, the >>>> cache generation is increased and th new generation's "creation time" is >>>> measured. Then generations that are older than a certain threshold are >>>> purged, and if there are any, the entries of the purged generation are >>>> removed from the caches using a sequential scan through the cache. >>>> >>>> This outline achieves: >>>> - no additional time measurements in hot code paths >>>> - no need for a sequential scan of the entire cache when no generations >>>> are too old >>>> - both size and time limits can be implemented reasonably cheaply >>>> - overhead when feature disabled should be close to zero >>> >>> Seems generally reasonable. The "whenever existing memory isn't >>> sufficient for further cache entries" part I'm not sure about. >>> Couldn't that trigger very frequently and prevent necessary cache size >>> growth? >> >> I'm thinking it'd just trigger a new generation, with it's associated >> "creation" time (which is cheap to acquire in comparison to creating a >> number of cache entries) . Depending on settings or just code policy we >> can decide up to which generation to prune the cache, using that >> creation time. I'd imagine that we'd have some default cache-pruning >> time in the minutes, and for workloads where relevant one can make >> sizing configurations more aggressive - or something like that. > > OK, so it seems everyone likes the idea of a timer. The open questions > are whether we want multiple epochs, and whether we want some kind of > size trigger. > FWIW I share the with that time-based eviction (be it some sort of timestamp or epoch) seems promising, seems cheaper than pretty much any other LRU metric (requiring usage count / clock sweep / ...). > With only one time epoch, if the timer is 10 minutes, you could expire an > entry after 10-19 minutes, while with a new epoch every minute and > 10-minute expire, you can do 10-11 minute precision. I am not sure the > complexity is worth it. > I don't think having just a single epoch would be significantly less complex than having more of them. In fact, having more of them might make it actually cheaper. > For a size trigger, should removal be effected by how many expired cache > entries there are? If there were 10k expired entries or 50, wouldn't > you want them removed if they have not been accessed in X minutes? > > In the worst case, if 10k entries were accessed in a query and never > accessed again, what would the ideal cleanup behavior be? Would it > matter if it was expired in 10 or 19 minutes? Would it matter if there > were only 50 entries? > I don't think we need to remove the expired entries right away, if there are only very few of them. The cleanup requires walking the hash table, which means significant fixed cost. So if there are only few expired entries (say, less than 25% of the cache), we can just leave them around and clean them if we happen to stumble on them (although that may not be possible with dynahash, which has no concept of expiration) of before enlarging the hash table. FWIW when it comes to memory consumption, it's important to realize the cache memory context won't release the memory to the system, even if we remove the expired entries. It'll simply stash them into a freelist. That's OK when the entries are to be reused, but the memory usage won't decrease after a sudden spike for example (and there may be other chunks allocated on the same page, so paging it out will hurt). So if we want to address this case too (and we probably want), we may need to discard the old cache memory context someho (e.g. rebuild the cache in a new one, and copy the non-expired entries). Which is a nice opportunity to do the "full" cleanup, of course. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2019-Feb-05, Tomas Vondra wrote: > I don't think we need to remove the expired entries right away, if there > are only very few of them. The cleanup requires walking the hash table, > which means significant fixed cost. So if there are only few expired > entries (say, less than 25% of the cache), we can just leave them around > and clean them if we happen to stumble on them (although that may not be > possible with dynahash, which has no concept of expiration) of before > enlarging the hash table. I think seqscanning the hash table is going to be too slow; Ideriha-san idea of having a dlist with the entries in LRU order (where each entry is moved to head of list when it is touched) seemed good: it allows you to evict older ones when the time comes, without having to scan the rest of the entries. Having a dlist means two more pointers on each cache entry AFAIR, so it's not a huge amount of memory. > So if we want to address this case too (and we probably want), we may > need to discard the old cache memory context someho (e.g. rebuild the > cache in a new one, and copy the non-expired entries). Which is a nice > opportunity to do the "full" cleanup, of course. Yeah, we probably don't want to do this super frequently though. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2/5/19 11:05 PM, Alvaro Herrera wrote: > On 2019-Feb-05, Tomas Vondra wrote: > >> I don't think we need to remove the expired entries right away, if there >> are only very few of them. The cleanup requires walking the hash table, >> which means significant fixed cost. So if there are only few expired >> entries (say, less than 25% of the cache), we can just leave them around >> and clean them if we happen to stumble on them (although that may not be >> possible with dynahash, which has no concept of expiration) of before >> enlarging the hash table. > > I think seqscanning the hash table is going to be too slow; Ideriha-san > idea of having a dlist with the entries in LRU order (where each entry > is moved to head of list when it is touched) seemed good: it allows you > to evict older ones when the time comes, without having to scan the rest > of the entries. Having a dlist means two more pointers on each cache > entry AFAIR, so it's not a huge amount of memory. > Possibly, although my guess is it will depend on the number of entries to remove. For small number of entries, the dlist approach is going to be faster, but at some point the bulk seqscan gets more efficient. FWIW this is exactly where a bit of benchmarking would help. >> So if we want to address this case too (and we probably want), we may >> need to discard the old cache memory context someho (e.g. rebuild the >> cache in a new one, and copy the non-expired entries). Which is a nice >> opportunity to do the "full" cleanup, of course. > > Yeah, we probably don't want to do this super frequently though. > Right. I've also realized the resizing is built into dynahash and is kinda incremental - we add (and split) buckets one by one, instead of immediately rebuilding the whole hash table. So yes, this would need more care and might need to interact with dynahash in some way. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
At Tue, 5 Feb 2019 02:40:35 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1FB93A16@G01JPEXMBYT05> > From: bruce@momjian.us [mailto:bruce@momjian.us] > > On Mon, Feb 4, 2019 at 08:23:39AM +0000, Tsunakawa, Takayuki wrote: > > > Horiguchi-san, Bruce, all, So, why don't we make > > > syscache_memory_target the upper limit on the total size of all > > > catcaches, and rethink the past LRU management? > > > > I was going to say that our experience with LRU has been that the > > overhead is not worth the value, but that was in shared resource cases, > > which this is not. > > That's good news! Then, let's proceed with the approach involving LRU, Horiguchi-san, Ideriha-san. If you mean accessed-time-ordered list of entries by "LRU", I still object to involve it since it is too complex in searching code paths. Invalidation would make things more complex. The current patch sorts entries by ct->lastaccess and discards entries not accessed for more than threshold, only at doubling cache capacity. It is already a kind of LRU in behavior. This patch intends not to let caches bloat by unnecessary entries, which is negative ones at first, then less-accessed ones currently. If you mean by "LRU" something to put a hard limit on the number or size of a catcache or all caches, it would be doable by adding sort phase before pruning, like CatCacheCleanOldEntriesByNum() in the attached as a PoC (first attched) as food for discussion. With the second attached script, we can observe what is happening from another session by the following query. select relname, size, ntuples, ageclass from pg_stat_syscache where relname =' pg_statistic'::regclass; > pg_statistic | 1041024 | 7109 | {{1,1109},{3,0},{30,0},{60,0},{90,6000},{0,0 On the other hand, differently from the original pruning, this happens irrelevantly to hash resize so it will causes another observable intermittent slowdown than rehashing. The two should have the same extent of impact on performance when disabled. I'll take numbers briefly using pgbench. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 21f7b5528be03274dae9e58690c35cee9e68c82f Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 16 Oct 2018 13:04:30 +0900 Subject: [PATCH 1/4] Remove entries that haven't been used for a certain time Catcache entries can be left alone for several reasons. It is not desirable that they eat up memory. With this patch, This adds consideration of removal of entries that haven't been used for a certain time before enlarging the hash array. --- doc/src/sgml/config.sgml | 38 ++++++ src/backend/access/transam/xact.c | 5 + src/backend/utils/cache/catcache.c | 166 ++++++++++++++++++++++++-- src/backend/utils/misc/guc.c | 23 ++++ src/backend/utils/misc/postgresql.conf.sample | 2 + src/include/utils/catcache.h | 28 ++++- 6 files changed, 254 insertions(+), 8 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 9b7a7388d5..d0d2374944 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1662,6 +1662,44 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target"> + <term><varname>syscache_memory_target</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_memory_target</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the maximum amount of memory to which syscache is expanded + without pruning. The value defaults to 0, indicating that pruning is + always considered. After exceeding this size, syscache pruning is + considered according to + <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep + certain amount of syscache entries with intermittent usage, try + increase this setting. + </para> + </listitem> + </varlistentry> + + <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age"> + <term><varname>syscache_prune_min_age</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the minimum amount of unused time in seconds at which a + syscache entry is considered to be removed. -1 indicates that syscache + pruning is disabled at all. The value defaults to 600 seconds + (<literal>10 minutes</literal>). The syscache entries that are not + used for the duration can be removed to prevent syscache bloat. This + behavior is suppressed until the size of syscache exceeds + <xref linkend="guc-syscache-memory-target"/>. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth"> <term><varname>max_stack_depth</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 92bda87804..ddc433c59e 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -734,7 +734,12 @@ void SetCurrentStatementStartTimestamp(void) { if (!IsParallelWorker()) + { stmtStartTimestamp = GetCurrentTimestamp(); + + /* Set this timestamp as aproximated current time */ + SetCatCacheClock(stmtStartTimestamp); + } else Assert(stmtStartTimestamp != 0); } diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 258a1d64cc..2a996d740a 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -71,9 +71,24 @@ #define CACHE6_elog(a,b,c,d,e,f,g) #endif +/* + * GUC variable to define the minimum size of hash to cosider entry eviction. + * This variable is shared among various cache mechanisms. + */ +int cache_memory_target = 0; + +/* GUC variable to define the minimum age of entries that will be cosidered to + * be evicted in seconds. This variable is shared among various cache + * mechanisms. + */ +int cache_prune_min_age = 600; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Timestamp used for any operation on caches. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -490,6 +505,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys, cache->cc_keyno, ct->keys); + cache->cc_tupsize -= ct->size; pfree(ct); --cache->cc_ntup; @@ -841,6 +857,7 @@ InitCatCache(int id, cp->cc_nkeys = nkeys; for (i = 0; i < nkeys; ++i) cp->cc_keyno[i] = key[i]; + cp->cc_tupsize = 0; /* * new cache is initialized as far as we can go for now. print some @@ -858,9 +875,129 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initilize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries can be left alone for several reasons. We remove them if + * they are not accessed for a certain time to prevent catcache from + * bloating. The eviction is performed with the similar algorithm with buffer + * eviction using access counter. Entries that are accessed several times can + * live longer than those that have had no access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int i; + int nremoved = 0; + size_t hash_size; +#ifdef CATCACHE_STATS + /* These variables are only for debugging purpose */ + int ntotal = 0; + /* + * nth element in nentries stores the number of cache entries that have + * lived unaccessed for corresponding multiple in ageclass of + * cache_prune_min_age. The index of nremoved_entry is the value of the + * clock-sweep counter, which takes from 0 up to 2. + */ + double ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; + int nentries[] = {0, 0, 0, 0, 0, 0}; + int nremoved_entry[3] = {0, 0, 0}; + int j; +#endif + + /* Return immediately if no pruning is wanted */ + if (cache_prune_min_age < 0) + return false; + + /* + * Return without pruning if the size of the hash is below the target. + */ + hash_size = cp->cc_nbuckets * sizeof(dlist_head); + if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L) + return false; + + /* Search the whole hash for entries to remove */ + for (i = 0; i < cp->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + + /* + * Calculate the duration from the time of the last access to the + * "current" time. Since catcacheclock is not advanced within a + * transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + +#ifdef CATCACHE_STATS + /* count catcache entries for each age class */ + ntotal++; + for (j = 0 ; + ageclass[j] != 0.0 && + entry_age > cache_prune_min_age * ageclass[j] ; + j++); + if (ageclass[j] == 0.0) j--; + nentries[j]++; +#endif + + /* + * Try to remove entries older than cache_prune_min_age seconds. + * Entries that are not accessed after last pruning are removed in + * that seconds, and that has been accessed several times are + * removed after leaving alone for up to three times of the + * duration. We don't try shrink buckets since pruning effectively + * caps catcache expansion in the long term. + */ + if (entry_age > cache_prune_min_age) + { +#ifdef CATCACHE_STATS + Assert (ct->naccess >= 0 && ct->naccess <= 2); + nremoved_entry[ct->naccess]++; +#endif + if (ct->naccess > 0) + ct->naccess--; + else + { + if (!ct->c_list || ct->c_list->refcount == 0) + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + } + } + } + } + } + +#ifdef CATCACHE_STATS + ereport(DEBUG1, + (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d, 2:%d)", + nremoved, ntotal, + ageclass[0] * cache_prune_min_age, nentries[0], + ageclass[1] * cache_prune_min_age, nentries[1], + ageclass[2] * cache_prune_min_age, nentries[2], + ageclass[3] * cache_prune_min_age, nentries[3], + ageclass[4] * cache_prune_min_age, nentries[4], + nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]), + errhidestmt(true))); +#endif + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -1274,6 +1411,11 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* Update access information for pruning */ + if (ct->naccess < 2) + ct->naccess++; + ct->lastaccess = catcacheclock; + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1819,11 +1961,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, CatCTup *ct; HeapTuple dtp; MemoryContext oldcxt; + int tupsize = 0; /* negative entries have no tuple associated */ if (ntp) { int i; + int tupsize; Assert(!negative); @@ -1842,13 +1986,14 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, /* Allocate memory for CatCTup and the cached tuple in one go */ oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - ct = (CatCTup *) palloc(sizeof(CatCTup) + - MAXIMUM_ALIGNOF + dtp->t_len); + tupsize = sizeof(CatCTup) + MAXIMUM_ALIGNOF + dtp->t_len; + ct = (CatCTup *) palloc(tupsize); ct->tuple.t_len = dtp->t_len; ct->tuple.t_self = dtp->t_self; ct->tuple.t_tableOid = dtp->t_tableOid; ct->tuple.t_data = (HeapTupleHeader) MAXALIGN(((char *) ct) + sizeof(CatCTup)); + ct->size = tupsize; /* copy tuple contents */ memcpy((char *) ct->tuple.t_data, (const char *) dtp->t_data, @@ -1876,8 +2021,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, { Assert(negative); oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - ct = (CatCTup *) palloc(sizeof(CatCTup)); - + tupsize = sizeof(CatCTup); + ct = (CatCTup *) palloc(tupsize); /* * Store keys - they'll point into separately allocated memory if not * by-value. @@ -1898,17 +2043,24 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; + ct->size = tupsize; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); cache->cc_ntup++; CacheHdr->ch_ntup++; + cache->cc_tupsize += tupsize; /* - * If the hash table has become too full, enlarge the buckets array. Quite - * arbitrarily, we enlarge when fill factor > 2. + * If the hash table has become too full, try cleanup by removing + * infrequently used entries to make a room for the new entry. If it + * failed, enlarge the bucket array instead. Quite arbitrarily, we try + * this when fill factor > 2. */ - if (cache->cc_ntup > cache->cc_nbuckets * 2) + if (cache->cc_ntup > cache->cc_nbuckets * 2 && + !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); return ct; diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 8681ada33a..06c589f725 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -81,6 +81,7 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/guc_tables.h" #include "utils/float.h" #include "utils/memutils.h" @@ -2204,6 +2205,28 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"cache_memory_target", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum syscache size to keep."), + gettext_noop("Cache is not pruned before exceeding this size."), + GUC_UNIT_KB + }, + &cache_memory_target, + 0, 0, MAX_KILOBYTES, + NULL, NULL, NULL + }, + + { + {"cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum unused duration of cache entries before removal."), + gettext_noop("Cache entries that live unused for longer than this seconds are considered to be removed."), + GUC_UNIT_S + }, + &cache_prune_min_age, + 600, -1, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index c7f53470df..108d332f2c 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -128,6 +128,8 @@ #work_mem = 4MB # min 64kB #maintenance_work_mem = 64MB # min 1MB #autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem +#cache_memory_target = 0kB # in kB +#cache_prune_min_age = 600s # -1 disables pruning #max_stack_depth = 2MB # min 100kB #shared_memory_type = mmap # the default is the first option # supported by the operating system: diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 65d816a583..5d24809900 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -61,6 +62,7 @@ typedef struct catcache slist_node cc_next; /* list link */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ + int cc_tupsize; /* total amount of catcache tuples */ /* * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS @@ -119,7 +121,9 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ - + int naccess; /* # of access to this entry, up to 2 */ + TimestampTz lastaccess; /* approx. timestamp of the last usage */ + int size; /* palloc'ed size off this tuple */ /* * The tuple may also be a member of at most one CatCList. (If a single * catcache is list-searched with varying numbers of keys, we may have to @@ -189,6 +193,28 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int cache_prune_min_age; +extern int cache_memory_target; + +/* to use as access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* + * SetCatCacheClock - set timestamp for catcache access record + */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + +static inline TimestampTz +GetCatCacheClock(void) +{ + return catcacheclock; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, -- 2.16.3 From 9f243e2fa6c6aaa5e333662f63c28c18ea72ed0f Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 16 Oct 2018 15:48:28 +0900 Subject: [PATCH 2/4] Syscache usage tracking feature. Collects syscache usage statictics and show it using the view pg_stat_syscache. The feature is controlled by the GUC variable track_syscache_usage_interval. --- doc/src/sgml/config.sgml | 15 ++ src/backend/catalog/system_views.sql | 17 +++ src/backend/postmaster/pgstat.c | 201 ++++++++++++++++++++++++-- src/backend/tcop/postgres.c | 23 +++ src/backend/utils/adt/pgstatfuncs.c | 134 +++++++++++++++++ src/backend/utils/cache/catcache.c | 115 +++++++++++---- src/backend/utils/cache/syscache.c | 24 +++ src/backend/utils/init/globals.c | 1 + src/backend/utils/init/postinit.c | 11 ++ src/backend/utils/misc/guc.c | 10 ++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/catalog/pg_proc.dat | 9 ++ src/include/miscadmin.h | 1 + src/include/pgstat.h | 6 +- src/include/utils/catcache.h | 9 +- src/include/utils/syscache.h | 19 +++ src/include/utils/timeout.h | 1 + src/test/regress/expected/rules.out | 24 ++- 18 files changed, 576 insertions(+), 45 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index d0d2374944..5ff3ebeb4e 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -6687,6 +6687,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; </listitem> </varlistentry> + <varlistentry id="guc-track-syscache-usage-interval" xreflabel="track_syscache_usage_interval"> + <term><varname>track_syscache_usage_interval</varname> (<type>integer</type>) + <indexterm> + <primary><varname>track_syscache_usage_interval</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the interval to collect system cache usage statistics in + milliseconds. This parameter is 0 by default, which means disabled. + Only superusers can change this setting. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-track-io-timing" xreflabel="track_io_timing"> <term><varname>track_io_timing</varname> (<type>boolean</type>) <indexterm> diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 3e229c693c..f5d1aaf96f 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -906,6 +906,22 @@ CREATE VIEW pg_stat_progress_vacuum AS FROM pg_stat_get_progress_info('VACUUM') AS S LEFT JOIN pg_database D ON S.datid = D.oid; +CREATE VIEW pg_stat_syscache AS + SELECT + S.pid AS pid, + S.relid::regclass AS relname, + S.indid::regclass AS cache_name, + S.size AS size, + S.ntup AS ntuples, + S.searches AS searches, + S.hits AS hits, + S.neg_hits AS neg_hits, + S.ageclass AS ageclass, + S.last_update AS last_update + FROM pg_stat_activity A + JOIN LATERAL (SELECT A.pid, * FROM pg_get_syscache_stats(A.pid)) S + ON (A.pid = S.pid); + CREATE VIEW pg_user_mappings AS SELECT U.oid AS umid, @@ -1185,6 +1201,7 @@ GRANT EXECUTE ON FUNCTION pg_ls_waldir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_archive_statusdir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_tmpdir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_tmpdir(oid) TO pg_monitor; +GRANT EXECUTE ON FUNCTION pg_get_syscache_stats(int) TO pg_monitor; GRANT pg_read_all_settings TO pg_monitor; GRANT pg_read_all_stats TO pg_monitor; diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 81c6499251..a1939958b7 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -66,6 +66,7 @@ #include "utils/ps_status.h" #include "utils/rel.h" #include "utils/snapmgr.h" +#include "utils/syscache.h" #include "utils/timestamp.h" @@ -124,6 +125,7 @@ bool pgstat_track_activities = false; bool pgstat_track_counts = false; int pgstat_track_functions = TRACK_FUNC_OFF; +int pgstat_track_syscache_usage_interval = 0; int pgstat_track_activity_query_size = 1024; /* ---------- @@ -236,6 +238,11 @@ typedef struct TwoPhasePgStatRecord bool t_truncated; /* was the relation truncated? */ } TwoPhasePgStatRecord; +/* bitmap symbols to specify target file types remove */ +#define PGSTAT_REMFILE_DBSTAT 1 /* remove only databsae stats files */ +#define PGSTAT_REMFILE_SYSCACHE 2 /* remove only syscache stats files */ +#define PGSTAT_REMFILE_ALL 3 /* remove both type of files */ + /* * Info about current "snapshot" of stats file */ @@ -335,6 +342,7 @@ static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len); static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len); static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len); static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len); +static void pgstat_remove_syscache_statsfile(void); /* ------------------------------------------------------------ * Public functions called from postmaster follow @@ -630,10 +638,13 @@ startup_failed: } /* - * subroutine for pgstat_reset_all + * remove stats files + * + * clean up stats files in specified directory. target is one of + * PGSTAT_REFILE_DBSTAT/SYSCACHE/ALL and restricts files to remove. */ static void -pgstat_reset_remove_files(const char *directory) +pgstat_reset_remove_files(const char *directory, int target) { DIR *dir; struct dirent *entry; @@ -644,25 +655,39 @@ pgstat_reset_remove_files(const char *directory) { int nchars; Oid tmp_oid; + int filetype = 0; /* * Skip directory entries that don't match the file names we write. * See get_dbstat_filename for the database-specific pattern. */ if (strncmp(entry->d_name, "global.", 7) == 0) + { + filetype = PGSTAT_REMFILE_DBSTAT; nchars = 7; + } else { + char head[2]; + nchars = 0; - (void) sscanf(entry->d_name, "db_%u.%n", - &tmp_oid, &nchars); - if (nchars <= 0) - continue; + (void) sscanf(entry->d_name, "%c%c_%u.%n", + head, head + 1, &tmp_oid, &nchars); + /* %u allows leading whitespace, so reject that */ - if (strchr("0123456789", entry->d_name[3]) == NULL) + if (nchars < 3 || !isdigit(entry->d_name[3])) continue; + + if (strncmp(head, "db", 2) == 0) + filetype = PGSTAT_REMFILE_DBSTAT; + else if (strncmp(head, "cc", 2) == 0) + filetype = PGSTAT_REMFILE_SYSCACHE; } + /* skip if this is not a target */ + if ((filetype & target) == 0) + continue; + if (strcmp(entry->d_name + nchars, "tmp") != 0 && strcmp(entry->d_name + nchars, "stat") != 0) continue; @@ -683,8 +708,9 @@ pgstat_reset_remove_files(const char *directory) void pgstat_reset_all(void) { - pgstat_reset_remove_files(pgstat_stat_directory); - pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY); + pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_ALL); + pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY, + PGSTAT_REMFILE_ALL); } #ifdef EXEC_BACKEND @@ -2963,6 +2989,10 @@ pgstat_beshutdown_hook(int code, Datum arg) if (OidIsValid(MyDatabaseId)) pgstat_report_stat(true); + /* clear syscache statistics files and temprary settings */ + if (MyBackendId != InvalidBackendId) + pgstat_remove_syscache_statsfile(); + /* * Clear my status entry, following the protocol of bumping st_changecount * before and after. We use a volatile pointer here to ensure the @@ -4287,6 +4317,9 @@ PgstatCollectorMain(int argc, char *argv[]) pgStatRunningInCollector = true; pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true); + /* Remove left-over syscache stats files */ + pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_SYSCACHE); + /* * Loop to process messages until we get SIGQUIT or detect ungraceful * death of our parent postmaster. @@ -6377,3 +6410,153 @@ pgstat_clip_activity(const char *raw_activity) return activity; } + +/* + * return the filename for a syscache stat file; filename is the output + * buffer, of length len. + */ +void +pgstat_get_syscachestat_filename(bool permanent, bool tempname, int backendid, + char *filename, int len) +{ + int printed; + + /* NB -- pgstat_reset_remove_files knows about the pattern this uses */ + printed = snprintf(filename, len, "%s/cc_%u.%s", + permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY : + pgstat_stat_directory, + backendid, + tempname ? "tmp" : "stat"); + if (printed >= len) + elog(ERROR, "overlength pgstat path"); +} + +/* removes syscache stats files of this backend */ +static void +pgstat_remove_syscache_statsfile(void) +{ + char fname[MAXPGPATH]; + + pgstat_get_syscachestat_filename(false, false, MyBackendId, + fname, MAXPGPATH); + unlink(fname); /* don't care of the result */ +} + +/* + * pgstat_write_syscache_stats() - + * Write the syscache statistics files. + * + * If 'force' is false, this function skips writing a file and resturns the + * time remaining in the current interval in milliseconds. If'force' is true, + * writes a file regardless of the remaining time and reset the interval. + */ +long +pgstat_write_syscache_stats(bool force) +{ + static TimestampTz last_report = 0; + TimestampTz now; + long elapsed; + long secs; + int usecs; + int cacheId; + FILE *fpout; + char statfile[MAXPGPATH]; + char tmpfile[MAXPGPATH]; + + /* Return if we don't want it */ + if (!force && pgstat_track_syscache_usage_interval <= 0) + { + /* disabled. remove the statistics file if any */ + if (last_report > 0) + { + last_report = 0; + pgstat_remove_syscache_statsfile(); + } + return 0; + } + + /* Check aginst the interval */ + now = GetCurrentTransactionStopTimestamp(); + TimestampDifference(last_report, now, &secs, &usecs); + elapsed = secs * 1000 + usecs / 1000; + + if (!force && elapsed < pgstat_track_syscache_usage_interval) + { + /* not yet the time, inform the remaining time to the caller */ + return pgstat_track_syscache_usage_interval - elapsed; + } + + /* now update the stats */ + last_report = now; + + pgstat_get_syscachestat_filename(false, true, + MyBackendId, tmpfile, MAXPGPATH); + pgstat_get_syscachestat_filename(false, false, + MyBackendId, statfile, MAXPGPATH); + + /* + * This function can be called from ProcessInterrupts(). Inhibit recursive + * interrupts to avoid recursive entry. + */ + HOLD_INTERRUPTS(); + + fpout = AllocateFile(tmpfile, PG_BINARY_W); + if (fpout == NULL) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not open temporary statistics file \"%s\": %m", + tmpfile))); + /* + * Failure writing this file is not critical. Just skip this time and + * tell caller to wait for the next interval. + */ + RESUME_INTERRUPTS(); + return pgstat_track_syscache_usage_interval; + } + + /* write out every catcache stats */ + for (cacheId = 0 ; cacheId < SysCacheSize ; cacheId++) + { + SysCacheStats *stats; + + stats = SysCacheGetStats(cacheId); + Assert (stats); + + /* write error is checked later using ferror() */ + fputc('T', fpout); + (void)fwrite(&cacheId, sizeof(int), 1, fpout); + (void)fwrite(&last_report, sizeof(TimestampTz), 1, fpout); + (void)fwrite(stats, sizeof(*stats), 1, fpout); + } + fputc('E', fpout); + + if (ferror(fpout)) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not write syscache statistics file \"%s\": %m", + tmpfile))); + FreeFile(fpout); + unlink(tmpfile); + } + else if (FreeFile(fpout) < 0) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not close syscache statistics file \"%s\": %m", + tmpfile))); + unlink(tmpfile); + } + else if (rename(tmpfile, statfile) < 0) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not rename syscache statistics file \"%s\" to \"%s\": %m", + tmpfile, statfile))); + unlink(tmpfile); + } + + RESUME_INTERRUPTS(); + return 0; +} diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 36cfd507b2..fb77a0ce4c 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -3157,6 +3157,12 @@ ProcessInterrupts(void) } + if (IdleSyscacheStatsUpdateTimeoutPending) + { + IdleSyscacheStatsUpdateTimeoutPending = false; + pgstat_write_syscache_stats(true); + } + if (ParallelMessagePending) HandleParallelMessages(); } @@ -3733,6 +3739,7 @@ PostgresMain(int argc, char *argv[], sigjmp_buf local_sigjmp_buf; volatile bool send_ready_for_query = true; bool disable_idle_in_transaction_timeout = false; + bool disable_idle_catcache_update_timeout = false; /* Initialize startup process environment if necessary. */ if (!IsUnderPostmaster) @@ -4173,9 +4180,19 @@ PostgresMain(int argc, char *argv[], } else { + long timeout; + ProcessCompletedNotifies(); pgstat_report_stat(false); + timeout = pgstat_write_syscache_stats(false); + + if (timeout > 0) + { + disable_idle_catcache_update_timeout = true; + enable_timeout_after(IDLE_CATCACHE_UPDATE_TIMEOUT, + timeout); + } set_ps_display("idle", false); pgstat_report_activity(STATE_IDLE, NULL); } @@ -4218,6 +4235,12 @@ PostgresMain(int argc, char *argv[], disable_idle_in_transaction_timeout = false; } + if (disable_idle_catcache_update_timeout) + { + disable_timeout(IDLE_CATCACHE_UPDATE_TIMEOUT, false); + disable_idle_catcache_update_timeout = false; + } + /* * (6) check for any other interesting events that happened while we * slept. diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index b6ba856ebe..6526cfefb4 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -14,6 +14,8 @@ */ #include "postgres.h" +#include <sys/stat.h> + #include "access/htup_details.h" #include "catalog/pg_authid.h" #include "catalog/pg_type.h" @@ -28,6 +30,7 @@ #include "utils/acl.h" #include "utils/builtins.h" #include "utils/inet.h" +#include "utils/syscache.h" #include "utils/timestamp.h" #define UINT32_ACCESS_ONCE(var) ((uint32)(*((volatile uint32 *)&(var)))) @@ -1899,3 +1902,134 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS) PG_RETURN_DATUM(HeapTupleGetDatum( heap_form_tuple(tupdesc, values, nulls))); } + +Datum +pgstat_get_syscache_stats(PG_FUNCTION_ARGS) +{ +#define PG_GET_SYSCACHE_SIZE 9 + int pid = PG_GETARG_INT32(0); + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; + TupleDesc tupdesc; + Tuplestorestate *tupstore; + MemoryContext per_query_ctx; + MemoryContext oldcontext; + PgBackendStatus *beentry; + int beid; + char fname[MAXPGPATH]; + FILE *fpin; + char c; + + if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("set-valued function called in context that cannot accept a set"))); + if (!(rsinfo->allowedModes & SFRM_Materialize)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("materialize mode required, but it is not " \ + "allowed in this context"))); + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + + per_query_ctx = rsinfo->econtext->ecxt_per_query_memory; + + oldcontext = MemoryContextSwitchTo(per_query_ctx); + tupstore = tuplestore_begin_heap(true, false, work_mem); + rsinfo->returnMode = SFRM_Materialize; + rsinfo->setResult = tupstore; + rsinfo->setDesc = tupdesc; + + MemoryContextSwitchTo(oldcontext); + + /* find beentry for given pid*/ + beentry = NULL; + for (beid = 1; + (beentry = pgstat_fetch_stat_beentry(beid)) && + beentry->st_procpid != pid ; + beid++); + + /* + * we silently return empty result on failure or insufficient privileges + */ + if (!beentry || + (!has_privs_of_role(GetUserId(), beentry->st_userid) && + !is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS))) + goto no_data; + + pgstat_get_syscachestat_filename(false, false, beid, fname, MAXPGPATH); + + if ((fpin = AllocateFile(fname, PG_BINARY_R)) == NULL) + { + if (errno != ENOENT) + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not open statistics file \"%s\": %m", + fname))); + /* also return empty on no statistics file */ + goto no_data; + } + + /* read the statistics file into tuplestore */ + while ((c = fgetc(fpin)) == 'T') + { + TimestampTz last_update; + SysCacheStats stats; + int cacheid; + Datum values[PG_GET_SYSCACHE_SIZE]; + bool nulls[PG_GET_SYSCACHE_SIZE] = {0}; + Datum datums[SYSCACHE_STATS_NAGECLASSES * 2]; + bool arrnulls[SYSCACHE_STATS_NAGECLASSES * 2] = {0}; + int dims[] = {SYSCACHE_STATS_NAGECLASSES, 2}; + int lbs[] = {1, 1}; + ArrayType *arr; + int i, j; + + fread(&cacheid, sizeof(int), 1, fpin); + fread(&last_update, sizeof(TimestampTz), 1, fpin); + if (fread(&stats, 1, sizeof(stats), fpin) != sizeof(stats)) + { + ereport(WARNING, + (errmsg("corrupted syscache statistics file \"%s\"", + fname))); + goto no_data; + } + + i = 0; + values[i++] = ObjectIdGetDatum(stats.reloid); + values[i++] = ObjectIdGetDatum(stats.indoid); + values[i++] = Int64GetDatum(stats.size); + values[i++] = Int64GetDatum(stats.ntuples); + values[i++] = Int64GetDatum(stats.nsearches); + values[i++] = Int64GetDatum(stats.nhits); + values[i++] = Int64GetDatum(stats.nneg_hits); + + for (j = 0 ; j < SYSCACHE_STATS_NAGECLASSES ; j++) + { + datums[j * 2] = Int32GetDatum((int32) stats.ageclasses[j]); + datums[j * 2 + 1] = Int32GetDatum((int32) stats.nclass_entries[j]); + } + + arr = construct_md_array(datums, arrnulls, 2, dims, lbs, + INT4OID, sizeof(int32), true, 'i'); + values[i++] = PointerGetDatum(arr); + + values[i++] = TimestampTzGetDatum(last_update); + + Assert (i == PG_GET_SYSCACHE_SIZE); + + tuplestore_putvalues(tupstore, tupdesc, values, nulls); + } + + /* check for the end of file. abandon the result if file is broken */ + if (c != 'E' || fgetc(fpin) != EOF) + tuplestore_clear(tupstore); + + FreeFile(fpin); + +no_data: + tuplestore_donestoring(tupstore); + return (Datum) 0; +} diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 2a996d740a..4ccda06795 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -89,6 +89,10 @@ static CatCacheHeader *CacheHdr = NULL; /* Timestamp used for any operation on caches. */ TimestampTz catcacheclock = 0; +/* age classes for pruning */ +static double ageclass[SYSCACHE_STATS_NAGECLASSES] + = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -619,9 +623,7 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue) else CatCacheRemoveCTup(cache, ct); CACHE1_elog(DEBUG2, "CatCacheInvalidate: invalidated"); -#ifdef CATCACHE_STATS cache->cc_invals++; -#endif /* could be multiple matches, so keep looking! */ } } @@ -697,9 +699,7 @@ ResetCatalogCache(CatCache *cache) } else CatCacheRemoveCTup(cache, ct); -#ifdef CATCACHE_STATS cache->cc_invals++; -#endif } } } @@ -906,10 +906,11 @@ CatCacheCleanupOldEntries(CatCache *cp) * cache_prune_min_age. The index of nremoved_entry is the value of the * clock-sweep counter, which takes from 0 up to 2. */ - double ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; - int nentries[] = {0, 0, 0, 0, 0, 0}; + int nentries[SYSCACHE_STATS_NAGECLASSES] = {0, 0, 0, 0, 0, 0}; int nremoved_entry[3] = {0, 0, 0}; int j; + + Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0); #endif /* Return immediately if no pruning is wanted */ @@ -923,7 +924,11 @@ CatCacheCleanupOldEntries(CatCache *cp) if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L) return false; - /* Search the whole hash for entries to remove */ + /* + * Search the whole hash for entries to remove. This is a quite time + * consuming task during catcache lookup, but accetable since now we are + * going to expand the hash table. + */ for (i = 0; i < cp->cc_nbuckets; i++) { dlist_mutable_iter iter; @@ -936,21 +941,21 @@ CatCacheCleanupOldEntries(CatCache *cp) /* - * Calculate the duration from the time of the last access to the - * "current" time. Since catcacheclock is not advanced within a - * transaction, the entries that are accessed within the current - * transaction won't be pruned. + * Calculate the duration from the time from the last access to + * the "current" time. Since catcacheclock is not advanced within + * a transaction, the entries that are accessed within the current + * transaction always get 0 as the result. */ TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); #ifdef CATCACHE_STATS /* count catcache entries for each age class */ ntotal++; - for (j = 0 ; - ageclass[j] != 0.0 && - entry_age > cache_prune_min_age * ageclass[j] ; - j++); - if (ageclass[j] == 0.0) j--; + + j = 0; + while (j < SYSCACHE_STATS_NAGECLASSES - 1 && + entry_age > cache_prune_min_age * ageclass[j]) + j++; nentries[j]++; #endif @@ -983,14 +988,17 @@ CatCacheCleanupOldEntries(CatCache *cp) } #ifdef CATCACHE_STATS + StaticAssertStmt(SYSCACHE_STATS_NAGECLASSES == 6, + "number of syscache age class must be 6"); ereport(DEBUG1, - (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d, 2:%d)", + (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d, rest:%d) naccessed(0:%d,1:%d, 2:%d)", nremoved, ntotal, ageclass[0] * cache_prune_min_age, nentries[0], ageclass[1] * cache_prune_min_age, nentries[1], ageclass[2] * cache_prune_min_age, nentries[2], ageclass[3] * cache_prune_min_age, nentries[3], ageclass[4] * cache_prune_min_age, nentries[4], + nentries[5], nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]), errhidestmt(true))); #endif @@ -1367,9 +1375,7 @@ SearchCatCacheInternal(CatCache *cache, if (unlikely(cache->cc_tupdesc == NULL)) CatalogCacheInitializeCache(cache); -#ifdef CATCACHE_STATS cache->cc_searches++; -#endif /* Initialize local parameter array */ arguments[0] = v1; @@ -1429,9 +1435,7 @@ SearchCatCacheInternal(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_hits++; -#endif return &ct->tuple; } @@ -1440,9 +1444,7 @@ SearchCatCacheInternal(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_neg_hits++; -#endif return NULL; } @@ -1570,9 +1572,7 @@ SearchCatCacheMiss(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): put in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_newloads++; -#endif return &ct->tuple; } @@ -1683,9 +1683,7 @@ SearchCatCacheList(CatCache *cache, Assert(nkeys > 0 && nkeys < cache->cc_nkeys); -#ifdef CATCACHE_STATS cache->cc_lsearches++; -#endif /* Initialize local parameter array */ arguments[0] = v1; @@ -1742,9 +1740,7 @@ SearchCatCacheList(CatCache *cache, CACHE2_elog(DEBUG2, "SearchCatCacheList(%s): found list", cache->cc_relname); -#ifdef CATCACHE_STATS cache->cc_lhits++; -#endif return cl; } @@ -2252,3 +2248,64 @@ PrintCatCacheListLeakWarning(CatCList *list) list->my_cache->cc_relname, list->my_cache->id, list, list->refcount); } + +/* + * CatCacheGetStats - fill in SysCacheStats struct. + * + * This is a support routine for SysCacheGetStats, substantially fills in the + * result. The classification here is based on the same criteria to + * CatCacheCleanupOldEntries(). + */ +void +CatCacheGetStats(CatCache *cache, SysCacheStats *stats) +{ + int i, j; + + Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0); + + /* fill in the stats struct */ + stats->size = cache->cc_tupsize + cache->cc_nbuckets * sizeof(dlist_head); + stats->ntuples = cache->cc_ntup; + stats->nsearches = cache->cc_searches; + stats->nhits = cache->cc_hits; + stats->nneg_hits = cache->cc_neg_hits; + + /* cache_prune_min_age can be changed on-session, fill it every time */ + for (i = 0 ; i < SYSCACHE_STATS_NAGECLASSES ; i++) + stats->ageclasses[i] = (int) (cache_prune_min_age * ageclass[i]); + + /* + * nth element in nclass_entries stores the number of cache entries that + * have lived unaccessed for corresponding multiple in ageclass of + * cache_prune_min_age. + */ + memset(stats->nclass_entries, 0, sizeof(int) * SYSCACHE_STATS_NAGECLASSES); + + /* Scan the whole hash */ + for (i = 0; i < cache->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cache->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + /* + * Calculate the duration from the time from the last access to + * the "current" time. Since catcacheclock is not advanced within + * a transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + + j = 0; + while (j < SYSCACHE_STATS_NAGECLASSES - 1 && + entry_age > stats->ageclasses[j]) + j++; + + stats->nclass_entries[j]++; + } + } +} diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c index ac98c19155..7b38a06708 100644 --- a/src/backend/utils/cache/syscache.c +++ b/src/backend/utils/cache/syscache.c @@ -20,6 +20,9 @@ */ #include "postgres.h" +#include <sys/stat.h> +#include <unistd.h> + #include "access/htup_details.h" #include "access/sysattr.h" #include "catalog/indexing.h" @@ -1534,6 +1537,27 @@ RelationSupportsSysCache(Oid relid) return false; } +/* + * SysCacheGetStats - returns stats of specified syscache + * + * This routine returns the address of its local static memory. + */ +SysCacheStats * +SysCacheGetStats(int cacheId) +{ + static SysCacheStats stats; + + Assert(cacheId >=0 && cacheId < SysCacheSize); + + memset(&stats, 0, sizeof(stats)); + + stats.reloid = cacheinfo[cacheId].reloid; + stats.indoid = cacheinfo[cacheId].indoid; + + CatCacheGetStats(SysCache[cacheId], &stats); + + return &stats; +} /* * OID comparator for pg_qsort diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index fd51934aaf..f039ecd805 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false; volatile sig_atomic_t ProcDiePending = false; volatile sig_atomic_t ClientConnectionLost = false; volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false; +volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending = false; volatile sig_atomic_t ConfigReloadPending = false; volatile uint32 InterruptHoldoffCount = 0; volatile uint32 QueryCancelHoldoffCount = 0; diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index c0b6231458..dee7f19475 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg); static void StatementTimeoutHandler(void); static void LockTimeoutHandler(void); static void IdleInTransactionSessionTimeoutHandler(void); +static void IdleSyscacheStatsUpdateTimeoutHandler(void); static bool ThereIsAtLeastOneRole(void); static void process_startup_options(Port *port, bool am_superuser); static void process_settings(Oid databaseid, Oid roleid); @@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler); RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT, IdleInTransactionSessionTimeoutHandler); + RegisterTimeout(IDLE_CATCACHE_UPDATE_TIMEOUT, + IdleSyscacheStatsUpdateTimeoutHandler); } /* @@ -1239,6 +1242,14 @@ IdleInTransactionSessionTimeoutHandler(void) SetLatch(MyLatch); } +static void +IdleSyscacheStatsUpdateTimeoutHandler(void) +{ + IdleSyscacheStatsUpdateTimeoutPending = true; + InterruptPending = true; + SetLatch(MyLatch); +} + /* * Returns true if at least one role is defined in this database cluster. */ diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 06c589f725..32e41253a6 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -3168,6 +3168,16 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"track_syscache_usage_interval", PGC_SUSET, STATS_COLLECTOR, + gettext_noop("Sets the interval between syscache usage collection, in milliseconds. Zero disables syscache usagetracking."), + NULL + }, + &pgstat_track_syscache_usage_interval, + 0, 0, INT_MAX / 2, + NULL, NULL, NULL + }, + { {"gin_pending_list_limit", PGC_USERSET, CLIENT_CONN_STATEMENT, gettext_noop("Sets the maximum size of the pending list for GIN index."), diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 108d332f2c..4d4fb42251 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -560,6 +560,7 @@ #track_io_timing = off #track_functions = none # none, pl, all #track_activity_query_size = 1024 # (change requires restart) +#track_syscache_usage_interval = 0 # zero disables tracking #stats_temp_directory = 'pg_stat_tmp' diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index b8de13f03b..6099a828d2 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -9669,6 +9669,15 @@ proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}', proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}', prosrc => 'pg_get_replication_slots' }, +{ oid => '3425', + descr => 'syscache statistics', + proname => 'pg_get_syscache_stats', prorows => '100', proisstrict => 'f', + proretset => 't', provolatile => 'v', prorettype => 'record', + proargtypes => 'int4', + proallargtypes => '{int4,oid,oid,int8,int8,int8,int8,int8,_int4,timestamptz}', + proargmodes => '{i,o,o,o,o,o,o,o,o,o}', + proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,last_update}', + prosrc => 'pgstat_get_syscache_stats' }, { oid => '3786', descr => 'set up a logical replication slot', proname => 'pg_create_logical_replication_slot', provolatile => 'v', proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool', diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index c9e35003a5..69b9a976f0 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending; extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending; extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending; extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending; +extern PGDLLIMPORT volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending; extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending; extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost; diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 88a75fb798..b6bfd7d644 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -1144,6 +1144,7 @@ extern bool pgstat_track_activities; extern bool pgstat_track_counts; extern int pgstat_track_functions; extern PGDLLIMPORT int pgstat_track_activity_query_size; +extern int pgstat_track_syscache_usage_interval; extern char *pgstat_stat_directory; extern char *pgstat_stat_tmpname; extern char *pgstat_stat_filename; @@ -1228,7 +1229,8 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id); extern void pgstat_initstats(Relation rel); extern char *pgstat_clip_activity(const char *raw_activity); - +extern void pgstat_get_syscachestat_filename(bool permanent, + bool tempname, int backendid, char *filename, int len); /* ---------- * pgstat_report_wait_start() - * @@ -1363,5 +1365,5 @@ extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid); extern int pgstat_fetch_stat_numbackends(void); extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void); extern PgStat_GlobalStats *pgstat_fetch_global(void); - +extern long pgstat_write_syscache_stats(bool force); #endif /* PGSTAT_H */ diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 5d24809900..4d51975920 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -65,10 +65,8 @@ typedef struct catcache int cc_tupsize; /* total amount of catcache tuples */ /* - * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS - * doesn't break ABI for other modules + * Statistics entries */ -#ifdef CATCACHE_STATS long cc_searches; /* total # searches against this cache */ long cc_hits; /* # of matches against existing entry */ long cc_neg_hits; /* # of matches against negative entry */ @@ -81,7 +79,6 @@ typedef struct catcache long cc_invals; /* # of entries invalidated from cache */ long cc_lsearches; /* total # list-searches */ long cc_lhits; /* # of matches against existing lists */ -#endif } CatCache; @@ -254,4 +251,8 @@ extern void PrepareToInvalidateCacheTuple(Relation relation, extern void PrintCatCacheLeakWarning(HeapTuple tuple); extern void PrintCatCacheListLeakWarning(CatCList *list); +/* defined in syscache.h */ +typedef struct syscachestats SysCacheStats; +extern void CatCacheGetStats(CatCache *cache, SysCacheStats *syscachestats); + #endif /* CATCACHE_H */ diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h index 95ee48954e..71b399c902 100644 --- a/src/include/utils/syscache.h +++ b/src/include/utils/syscache.h @@ -112,6 +112,24 @@ enum SysCacheIdentifier #define SysCacheSize (USERMAPPINGUSERSERVER + 1) }; +#define SYSCACHE_STATS_NAGECLASSES 6 +/* Struct for catcache tracking information */ +typedef struct syscachestats +{ + Oid reloid; /* target relation */ + Oid indoid; /* index */ + size_t size; /* size of the catcache */ + int ntuples; /* number of tuples resides in the catcache */ + int nsearches; /* number of searches */ + int nhits; /* number of cache hits */ + int nneg_hits; /* number of negative cache hits */ + /* age classes in seconds */ + int ageclasses[SYSCACHE_STATS_NAGECLASSES]; + /* number of tuples fall into the corresponding age class */ + int nclass_entries[SYSCACHE_STATS_NAGECLASSES]; +} SysCacheStats; + + extern void InitCatalogCache(void); extern void InitCatalogCachePhase2(void); @@ -164,6 +182,7 @@ extern void SysCacheInvalidate(int cacheId, uint32 hashValue); extern bool RelationInvalidatesSnapshotsOnly(Oid relid); extern bool RelationHasSysCache(Oid relid); extern bool RelationSupportsSysCache(Oid relid); +extern SysCacheStats *SysCacheGetStats(int cacheId); /* * The use of the macros below rather than direct calls to the corresponding diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h index 9244a2a7b7..0ab441a364 100644 --- a/src/include/utils/timeout.h +++ b/src/include/utils/timeout.h @@ -31,6 +31,7 @@ typedef enum TimeoutId STANDBY_TIMEOUT, STANDBY_LOCK_TIMEOUT, IDLE_IN_TRANSACTION_SESSION_TIMEOUT, + IDLE_CATCACHE_UPDATE_TIMEOUT, /* First user-definable timeout reason */ USER_TIMEOUT, /* Maximum number of timeout reasons */ diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out index 2c8e21baa7..7bd77e9972 100644 --- a/src/test/regress/expected/rules.out +++ b/src/test/regress/expected/rules.out @@ -1921,6 +1921,28 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid, pg_stat_all_tables.autoanalyze_count FROM pg_stat_all_tables WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname~ '^pg_toast'::text)); +pg_stat_syscache| SELECT s.pid, + (s.relid)::regclass AS relname, + (s.indid)::regclass AS cache_name, + s.size, + s.ntup AS ntuples, + s.searches, + s.hits, + s.neg_hits, + s.ageclass, + s.last_update + FROM (pg_stat_activity a + JOIN LATERAL ( SELECT a.pid, + pg_get_syscache_stats.relid, + pg_get_syscache_stats.indid, + pg_get_syscache_stats.size, + pg_get_syscache_stats.ntup, + pg_get_syscache_stats.searches, + pg_get_syscache_stats.hits, + pg_get_syscache_stats.neg_hits, + pg_get_syscache_stats.ageclass, + pg_get_syscache_stats.last_update + FROM pg_get_syscache_stats(a.pid) pg_get_syscache_stats(relid, indid, size, ntup, searches, hits, neg_hits, ageclass,last_update)) s ON ((a.pid = s.pid))); pg_stat_user_functions| SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, @@ -2352,7 +2374,7 @@ pg_settings|pg_settings_n|CREATE RULE pg_settings_n AS ON UPDATE TO pg_catalog.pg_settings DO INSTEAD NOTHING; pg_settings|pg_settings_u|CREATE RULE pg_settings_u AS ON UPDATE TO pg_catalog.pg_settings - WHERE (new.name = old.name) DO SELECT set_config(old.name, new.setting, false) AS set_config; + WHERE (new.name = old.name) DO SELECT set_config(old.name, new.setting, false, false) AS set_config; rtest_emp|rtest_emp_del|CREATE RULE rtest_emp_del AS ON DELETE TO public.rtest_emp DO INSERT INTO rtest_emplog (ename, who, action, newsal, oldsal) VALUES (old.ename, CURRENT_USER, 'fired'::bpchar, '$0.00'::money, old.salary); -- 2.16.3 From 5be729e44acf3f9c94dd9d13fa84cb4ae598406f Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Wed, 6 Feb 2019 14:36:29 +0900 Subject: [PATCH 3/4] PoC add prune-by-number-of-entries feature Adds prune based on the number of cache entries on top of the current pruning patch. It is controlled by two GUC variables. cache_entry_limit: limit of the number of entries per catcache cache_entry_limit_prune_ratio: how much of entries to remove at pruning --- src/backend/utils/cache/catcache.c | 100 ++++++++++++++++++++++++++++++++++++- src/backend/utils/misc/guc.c | 40 +++++++++++++++ src/include/utils/catcache.h | 2 + 3 files changed, 141 insertions(+), 1 deletion(-) diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 4ccda06795..d15eac87d8 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -77,6 +77,11 @@ */ int cache_memory_target = 0; + +/* PoC entry limit */ +int cache_entry_limit = 0; +double cache_entry_limit_prune_ratio = 0.8; + /* GUC variable to define the minimum age of entries that will be cosidered to * be evicted in seconds. This variable is shared among various cache * mechanisms. @@ -882,6 +887,95 @@ InitCatCache(int id, return cp; } +/* + * CatCacheCleanupOldEntriesByNum - + * Poc remove infrequently-used entries by number of entries. + */ +static bool +CatCacheCleanupOldEntriesByNum(CatCache *cp, int cache_entry_limit) +{ + int i; + int n; + int oldndelelem = cp->cc_ntup; + int ndelelem; + CatCTup **ct_array; + + ndelelem = oldndelelem - (int)(cache_entry_limit * cache_entry_limit_prune_ratio); + + /* lower limit: quite arbitrary */ + if (ndelelem < 256) + ndelelem = 256; + + /* + * partial sort array: [0] contains latest access entry + * [1] contains ealiest access entry + */ + ct_array = (CatCTup **) palloc(ndelelem * sizeof(CatCTup*)); + n = 0; + + /* + * Collect entries to be removed, which have older lastaccess. + * Using heap bound sort like tuplesort.c. + */ + for (i = 0; i < cp->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + + if (n < ndelelem) + { + int j = n++; + + while (j > 0) + { + int i = (j - 1) >> 1; + + if (ct->lastaccess >= ct_array[i]->lastaccess) + break; + ct_array[j] = ct_array[i]; + j = i; + } + ct_array[j] = ct; + } + else if (ct->lastaccess > ct_array[0]->lastaccess) + { + unsigned int i; + + i = 0; + + for (;;) + { + unsigned int j = 2 * i + 1; + + if (j >= n) + break; + if (j + 1 < n && + ct_array[j]->lastaccess > ct_array[j + 1]->lastaccess) + j++; + if (ct->lastaccess <= ct_array[j]->lastaccess) + break; + ct_array[i] = ct_array[j]; + i = j; + } + ct_array[i] = ct; + } + } + } + + /* Now we have the list of elements to be deleted */ + for (i = 0 ; i < ndelelem ; i++) + CatCacheRemoveCTup(cp, ct_array[i]); + + pfree(ct_array); + + elog(LOG, "Catcache pruned by entry number: id=%d, %d => %d", cp->id, oldndelelem, cp->cc_ntup); + + return true; +} + /* * CatCacheCleanupOldEntries - Remove infrequently-used entries * @@ -923,7 +1017,7 @@ CatCacheCleanupOldEntries(CatCache *cp) hash_size = cp->cc_nbuckets * sizeof(dlist_head); if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L) return false; - + /* * Search the whole hash for entries to remove. This is a quite time * consuming task during catcache lookup, but accetable since now we are @@ -2049,6 +2143,10 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, CacheHdr->ch_ntup++; cache->cc_tupsize += tupsize; + /* cap number of entries */ + if (cache_entry_limit > 0 && cache->cc_ntup > cache_entry_limit) + CatCacheCleanupOldEntriesByNum(cache, cache_entry_limit); + /* * If the hash table has become too full, try cleanup by removing * infrequently used entries to make a room for the new entry. If it diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 32e41253a6..7bb239a07e 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -2227,6 +2227,36 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the maximum entries of catcache."), + NULL + }, + &cache_entry_limit, + 0, 0, INT_MAX, + NULL, NULL, NULL + }, + + { + {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the maximum entries of catcache."), + NULL + }, + &cache_entry_limit, + 0, 0, INT_MAX, + NULL, NULL, NULL + }, + + { + {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the maximum entries of catcache."), + NULL + }, + &cache_entry_limit, + 0, 0, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if @@ -3401,6 +3431,16 @@ static struct config_real ConfigureNamesReal[] = NULL, NULL, NULL }, + { + {"cache_entry_limit_prune_ratio", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the maximum entries of catcache."), + NULL + }, + &cache_entry_limit_prune_ratio, + 0.8, 0.0, 1.0, + NULL, NULL, NULL + }, + /* End-of-list marker */ { {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 4d51975920..1f7fb51ac0 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -193,6 +193,8 @@ extern PGDLLIMPORT MemoryContext CacheMemoryContext; /* for guc.c, not PGDLLPMPORT'ed */ extern int cache_prune_min_age; extern int cache_memory_target; +extern int cache_entry_limit; +extern double cache_entry_limit_prune_ratio; /* to use as access timestamp of catcache entries */ extern TimestampTz catcacheclock; -- 2.16.3
At Wed, 06 Feb 2019 14:43:34 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190206.144334.193118280.horiguchi.kyotaro@lab.ntt.co.jp> > At Tue, 5 Feb 2019 02:40:35 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1FB93A16@G01JPEXMBYT05> > > From: bruce@momjian.us [mailto:bruce@momjian.us] > > > On Mon, Feb 4, 2019 at 08:23:39AM +0000, Tsunakawa, Takayuki wrote: > > > > Horiguchi-san, Bruce, all, So, why don't we make > > > > syscache_memory_target the upper limit on the total size of all > > > > catcaches, and rethink the past LRU management? > > > > > > I was going to say that our experience with LRU has been that the > > > overhead is not worth the value, but that was in shared resource cases, > > > which this is not. > > > > That's good news! Then, let's proceed with the approach involving LRU, Horiguchi-san, Ideriha-san. > > If you mean accessed-time-ordered list of entries by "LRU", I > still object to involve it since it is too complex in searching > code paths. Invalidation would make things more complex. The > current patch sorts entries by ct->lastaccess and discards > entries not accessed for more than threshold, only at doubling > cache capacity. It is already a kind of LRU in behavior. > > This patch intends not to let caches bloat by unnecessary > entries, which is negative ones at first, then less-accessed ones > currently. If you mean by "LRU" something to put a hard limit on > the number or size of a catcache or all caches, it would be > doable by adding sort phase before pruning, like > CatCacheCleanOldEntriesByNum() in the attached as a PoC (first > attched) as food for discussion. > > With the second attached script, we can observe what is happening > from another session by the following query. > > select relname, size, ntuples, ageclass from pg_stat_syscache where relname =' pg_statistic'::regclass; > > > pg_statistic | 1041024 | 7109 | {{1,1109},{3,0},{30,0},{60,0},{90,6000},{0,0 > > On the other hand, differently from the original pruning, this > happens irrelevantly to hash resize so it will causes another > observable intermittent slowdown than rehashing. > > The two should have the same extent of impact on performance when > disabled. I'll take numbers briefly using pgbench. Sorry, I forgot to consider references in the previous patch, and attach the test script. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 952497a1fad57ac49e0b772a147201aa31065183 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 16 Oct 2018 13:04:30 +0900 Subject: [PATCH 1/4] Remove entries that haven't been used for a certain time Catcache entries can be left alone for several reasons. It is not desirable that they eat up memory. With this patch, This adds consideration of removal of entries that haven't been used for a certain time before enlarging the hash array. --- doc/src/sgml/config.sgml | 38 ++++++ src/backend/access/transam/xact.c | 5 + src/backend/utils/cache/catcache.c | 164 ++++++++++++++++++++++++-- src/backend/utils/misc/guc.c | 23 ++++ src/backend/utils/misc/postgresql.conf.sample | 2 + src/include/utils/catcache.h | 28 ++++- 6 files changed, 252 insertions(+), 8 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 9b7a7388d5..d0d2374944 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1662,6 +1662,44 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target"> + <term><varname>syscache_memory_target</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_memory_target</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the maximum amount of memory to which syscache is expanded + without pruning. The value defaults to 0, indicating that pruning is + always considered. After exceeding this size, syscache pruning is + considered according to + <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep + certain amount of syscache entries with intermittent usage, try + increase this setting. + </para> + </listitem> + </varlistentry> + + <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age"> + <term><varname>syscache_prune_min_age</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the minimum amount of unused time in seconds at which a + syscache entry is considered to be removed. -1 indicates that syscache + pruning is disabled at all. The value defaults to 600 seconds + (<literal>10 minutes</literal>). The syscache entries that are not + used for the duration can be removed to prevent syscache bloat. This + behavior is suppressed until the size of syscache exceeds + <xref linkend="guc-syscache-memory-target"/>. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth"> <term><varname>max_stack_depth</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 92bda87804..ddc433c59e 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -734,7 +734,12 @@ void SetCurrentStatementStartTimestamp(void) { if (!IsParallelWorker()) + { stmtStartTimestamp = GetCurrentTimestamp(); + + /* Set this timestamp as aproximated current time */ + SetCatCacheClock(stmtStartTimestamp); + } else Assert(stmtStartTimestamp != 0); } diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 258a1d64cc..769e173844 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -71,9 +71,24 @@ #define CACHE6_elog(a,b,c,d,e,f,g) #endif +/* + * GUC variable to define the minimum size of hash to cosider entry eviction. + * This variable is shared among various cache mechanisms. + */ +int cache_memory_target = 0; + +/* GUC variable to define the minimum age of entries that will be cosidered to + * be evicted in seconds. This variable is shared among various cache + * mechanisms. + */ +int cache_prune_min_age = 600; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Timestamp used for any operation on caches. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -490,6 +505,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys, cache->cc_keyno, ct->keys); + cache->cc_tupsize -= ct->size; pfree(ct); --cache->cc_ntup; @@ -841,6 +857,7 @@ InitCatCache(int id, cp->cc_nkeys = nkeys; for (i = 0; i < nkeys; ++i) cp->cc_keyno[i] = key[i]; + cp->cc_tupsize = 0; /* * new cache is initialized as far as we can go for now. print some @@ -858,9 +875,127 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initilize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries can be left alone for several reasons. We remove them if + * they are not accessed for a certain time to prevent catcache from + * bloating. The eviction is performed with the similar algorithm with buffer + * eviction using access counter. Entries that are accessed several times can + * live longer than those that have had no access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int i; + int nremoved = 0; + size_t hash_size; +#ifdef CATCACHE_STATS + /* These variables are only for debugging purpose */ + int ntotal = 0; + /* + * nth element in nentries stores the number of cache entries that have + * lived unaccessed for corresponding multiple in ageclass of + * cache_prune_min_age. The index of nremoved_entry is the value of the + * clock-sweep counter, which takes from 0 up to 2. + */ + double ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; + int nentries[] = {0, 0, 0, 0, 0, 0}; + int nremoved_entry[3] = {0, 0, 0}; + int j; +#endif + + /* Return immediately if no pruning is wanted */ + if (cache_prune_min_age < 0) + return false; + + /* + * Return without pruning if the size of the hash is below the target. + */ + hash_size = cp->cc_nbuckets * sizeof(dlist_head); + if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L) + return false; + + /* Search the whole hash for entries to remove */ + for (i = 0; i < cp->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + + /* + * Calculate the duration from the time of the last access to the + * "current" time. Since catcacheclock is not advanced within a + * transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + +#ifdef CATCACHE_STATS + /* count catcache entries for each age class */ + ntotal++; + for (j = 0 ; + ageclass[j] != 0.0 && + entry_age > cache_prune_min_age * ageclass[j] ; + j++); + if (ageclass[j] == 0.0) j--; + nentries[j]++; +#endif + + /* + * Try to remove entries older than cache_prune_min_age seconds. + * Entries that are not accessed after last pruning are removed in + * that seconds, and that has been accessed several times are + * removed after leaving alone for up to three times of the + * duration. We don't try shrink buckets since pruning effectively + * caps catcache expansion in the long term. + */ + if (entry_age > cache_prune_min_age) + { +#ifdef CATCACHE_STATS + Assert (ct->naccess >= 0 && ct->naccess <= 2); + nremoved_entry[ct->naccess]++; +#endif + if (ct->naccess > 0) + ct->naccess--; + else if (ct->refcount == 0 && + (!ct->c_list || ct->c_list->refcount == 0)) + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + } + } + } + } + +#ifdef CATCACHE_STATS + ereport(DEBUG1, + (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d, 2:%d)", + nremoved, ntotal, + ageclass[0] * cache_prune_min_age, nentries[0], + ageclass[1] * cache_prune_min_age, nentries[1], + ageclass[2] * cache_prune_min_age, nentries[2], + ageclass[3] * cache_prune_min_age, nentries[3], + ageclass[4] * cache_prune_min_age, nentries[4], + nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]), + errhidestmt(true))); +#endif + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -1274,6 +1409,11 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* Update access information for pruning */ + if (ct->naccess < 2) + ct->naccess++; + ct->lastaccess = catcacheclock; + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1819,11 +1959,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, CatCTup *ct; HeapTuple dtp; MemoryContext oldcxt; + int tupsize = 0; /* negative entries have no tuple associated */ if (ntp) { int i; + int tupsize; Assert(!negative); @@ -1842,13 +1984,14 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, /* Allocate memory for CatCTup and the cached tuple in one go */ oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - ct = (CatCTup *) palloc(sizeof(CatCTup) + - MAXIMUM_ALIGNOF + dtp->t_len); + tupsize = sizeof(CatCTup) + MAXIMUM_ALIGNOF + dtp->t_len; + ct = (CatCTup *) palloc(tupsize); ct->tuple.t_len = dtp->t_len; ct->tuple.t_self = dtp->t_self; ct->tuple.t_tableOid = dtp->t_tableOid; ct->tuple.t_data = (HeapTupleHeader) MAXALIGN(((char *) ct) + sizeof(CatCTup)); + ct->size = tupsize; /* copy tuple contents */ memcpy((char *) ct->tuple.t_data, (const char *) dtp->t_data, @@ -1876,8 +2019,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, { Assert(negative); oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - ct = (CatCTup *) palloc(sizeof(CatCTup)); - + tupsize = sizeof(CatCTup); + ct = (CatCTup *) palloc(tupsize); /* * Store keys - they'll point into separately allocated memory if not * by-value. @@ -1898,17 +2041,24 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; + ct->size = tupsize; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); cache->cc_ntup++; CacheHdr->ch_ntup++; + cache->cc_tupsize += tupsize; /* - * If the hash table has become too full, enlarge the buckets array. Quite - * arbitrarily, we enlarge when fill factor > 2. + * If the hash table has become too full, try cleanup by removing + * infrequently used entries to make a room for the new entry. If it + * failed, enlarge the bucket array instead. Quite arbitrarily, we try + * this when fill factor > 2. */ - if (cache->cc_ntup > cache->cc_nbuckets * 2) + if (cache->cc_ntup > cache->cc_nbuckets * 2 && + !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); return ct; diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 8681ada33a..06c589f725 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -81,6 +81,7 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/guc_tables.h" #include "utils/float.h" #include "utils/memutils.h" @@ -2204,6 +2205,28 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"cache_memory_target", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum syscache size to keep."), + gettext_noop("Cache is not pruned before exceeding this size."), + GUC_UNIT_KB + }, + &cache_memory_target, + 0, 0, MAX_KILOBYTES, + NULL, NULL, NULL + }, + + { + {"cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum unused duration of cache entries before removal."), + gettext_noop("Cache entries that live unused for longer than this seconds are considered to be removed."), + GUC_UNIT_S + }, + &cache_prune_min_age, + 600, -1, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index c7f53470df..108d332f2c 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -128,6 +128,8 @@ #work_mem = 4MB # min 64kB #maintenance_work_mem = 64MB # min 1MB #autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem +#cache_memory_target = 0kB # in kB +#cache_prune_min_age = 600s # -1 disables pruning #max_stack_depth = 2MB # min 100kB #shared_memory_type = mmap # the default is the first option # supported by the operating system: diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 65d816a583..5d24809900 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -61,6 +62,7 @@ typedef struct catcache slist_node cc_next; /* list link */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ + int cc_tupsize; /* total amount of catcache tuples */ /* * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS @@ -119,7 +121,9 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ - + int naccess; /* # of access to this entry, up to 2 */ + TimestampTz lastaccess; /* approx. timestamp of the last usage */ + int size; /* palloc'ed size off this tuple */ /* * The tuple may also be a member of at most one CatCList. (If a single * catcache is list-searched with varying numbers of keys, we may have to @@ -189,6 +193,28 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int cache_prune_min_age; +extern int cache_memory_target; + +/* to use as access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* + * SetCatCacheClock - set timestamp for catcache access record + */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + +static inline TimestampTz +GetCatCacheClock(void) +{ + return catcacheclock; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, -- 2.16.3 From 19f9ec0a86f9d0a86e54a39188dd8e75a7d8061a Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 16 Oct 2018 15:48:28 +0900 Subject: [PATCH 2/4] Syscache usage tracking feature. Collects syscache usage statictics and show it using the view pg_stat_syscache. The feature is controlled by the GUC variable track_syscache_usage_interval. --- doc/src/sgml/config.sgml | 15 ++ src/backend/catalog/system_views.sql | 17 +++ src/backend/postmaster/pgstat.c | 201 ++++++++++++++++++++++++-- src/backend/tcop/postgres.c | 23 +++ src/backend/utils/adt/pgstatfuncs.c | 134 +++++++++++++++++ src/backend/utils/cache/catcache.c | 115 +++++++++++---- src/backend/utils/cache/syscache.c | 24 +++ src/backend/utils/init/globals.c | 1 + src/backend/utils/init/postinit.c | 11 ++ src/backend/utils/misc/guc.c | 10 ++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/catalog/pg_proc.dat | 9 ++ src/include/miscadmin.h | 1 + src/include/pgstat.h | 6 +- src/include/utils/catcache.h | 9 +- src/include/utils/syscache.h | 19 +++ src/include/utils/timeout.h | 1 + src/test/regress/expected/rules.out | 24 ++- 18 files changed, 576 insertions(+), 45 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index d0d2374944..5ff3ebeb4e 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -6687,6 +6687,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; </listitem> </varlistentry> + <varlistentry id="guc-track-syscache-usage-interval" xreflabel="track_syscache_usage_interval"> + <term><varname>track_syscache_usage_interval</varname> (<type>integer</type>) + <indexterm> + <primary><varname>track_syscache_usage_interval</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the interval to collect system cache usage statistics in + milliseconds. This parameter is 0 by default, which means disabled. + Only superusers can change this setting. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-track-io-timing" xreflabel="track_io_timing"> <term><varname>track_io_timing</varname> (<type>boolean</type>) <indexterm> diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 3e229c693c..f5d1aaf96f 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -906,6 +906,22 @@ CREATE VIEW pg_stat_progress_vacuum AS FROM pg_stat_get_progress_info('VACUUM') AS S LEFT JOIN pg_database D ON S.datid = D.oid; +CREATE VIEW pg_stat_syscache AS + SELECT + S.pid AS pid, + S.relid::regclass AS relname, + S.indid::regclass AS cache_name, + S.size AS size, + S.ntup AS ntuples, + S.searches AS searches, + S.hits AS hits, + S.neg_hits AS neg_hits, + S.ageclass AS ageclass, + S.last_update AS last_update + FROM pg_stat_activity A + JOIN LATERAL (SELECT A.pid, * FROM pg_get_syscache_stats(A.pid)) S + ON (A.pid = S.pid); + CREATE VIEW pg_user_mappings AS SELECT U.oid AS umid, @@ -1185,6 +1201,7 @@ GRANT EXECUTE ON FUNCTION pg_ls_waldir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_archive_statusdir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_tmpdir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_tmpdir(oid) TO pg_monitor; +GRANT EXECUTE ON FUNCTION pg_get_syscache_stats(int) TO pg_monitor; GRANT pg_read_all_settings TO pg_monitor; GRANT pg_read_all_stats TO pg_monitor; diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 81c6499251..a1939958b7 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -66,6 +66,7 @@ #include "utils/ps_status.h" #include "utils/rel.h" #include "utils/snapmgr.h" +#include "utils/syscache.h" #include "utils/timestamp.h" @@ -124,6 +125,7 @@ bool pgstat_track_activities = false; bool pgstat_track_counts = false; int pgstat_track_functions = TRACK_FUNC_OFF; +int pgstat_track_syscache_usage_interval = 0; int pgstat_track_activity_query_size = 1024; /* ---------- @@ -236,6 +238,11 @@ typedef struct TwoPhasePgStatRecord bool t_truncated; /* was the relation truncated? */ } TwoPhasePgStatRecord; +/* bitmap symbols to specify target file types remove */ +#define PGSTAT_REMFILE_DBSTAT 1 /* remove only databsae stats files */ +#define PGSTAT_REMFILE_SYSCACHE 2 /* remove only syscache stats files */ +#define PGSTAT_REMFILE_ALL 3 /* remove both type of files */ + /* * Info about current "snapshot" of stats file */ @@ -335,6 +342,7 @@ static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len); static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len); static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len); static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len); +static void pgstat_remove_syscache_statsfile(void); /* ------------------------------------------------------------ * Public functions called from postmaster follow @@ -630,10 +638,13 @@ startup_failed: } /* - * subroutine for pgstat_reset_all + * remove stats files + * + * clean up stats files in specified directory. target is one of + * PGSTAT_REFILE_DBSTAT/SYSCACHE/ALL and restricts files to remove. */ static void -pgstat_reset_remove_files(const char *directory) +pgstat_reset_remove_files(const char *directory, int target) { DIR *dir; struct dirent *entry; @@ -644,25 +655,39 @@ pgstat_reset_remove_files(const char *directory) { int nchars; Oid tmp_oid; + int filetype = 0; /* * Skip directory entries that don't match the file names we write. * See get_dbstat_filename for the database-specific pattern. */ if (strncmp(entry->d_name, "global.", 7) == 0) + { + filetype = PGSTAT_REMFILE_DBSTAT; nchars = 7; + } else { + char head[2]; + nchars = 0; - (void) sscanf(entry->d_name, "db_%u.%n", - &tmp_oid, &nchars); - if (nchars <= 0) - continue; + (void) sscanf(entry->d_name, "%c%c_%u.%n", + head, head + 1, &tmp_oid, &nchars); + /* %u allows leading whitespace, so reject that */ - if (strchr("0123456789", entry->d_name[3]) == NULL) + if (nchars < 3 || !isdigit(entry->d_name[3])) continue; + + if (strncmp(head, "db", 2) == 0) + filetype = PGSTAT_REMFILE_DBSTAT; + else if (strncmp(head, "cc", 2) == 0) + filetype = PGSTAT_REMFILE_SYSCACHE; } + /* skip if this is not a target */ + if ((filetype & target) == 0) + continue; + if (strcmp(entry->d_name + nchars, "tmp") != 0 && strcmp(entry->d_name + nchars, "stat") != 0) continue; @@ -683,8 +708,9 @@ pgstat_reset_remove_files(const char *directory) void pgstat_reset_all(void) { - pgstat_reset_remove_files(pgstat_stat_directory); - pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY); + pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_ALL); + pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY, + PGSTAT_REMFILE_ALL); } #ifdef EXEC_BACKEND @@ -2963,6 +2989,10 @@ pgstat_beshutdown_hook(int code, Datum arg) if (OidIsValid(MyDatabaseId)) pgstat_report_stat(true); + /* clear syscache statistics files and temprary settings */ + if (MyBackendId != InvalidBackendId) + pgstat_remove_syscache_statsfile(); + /* * Clear my status entry, following the protocol of bumping st_changecount * before and after. We use a volatile pointer here to ensure the @@ -4287,6 +4317,9 @@ PgstatCollectorMain(int argc, char *argv[]) pgStatRunningInCollector = true; pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true); + /* Remove left-over syscache stats files */ + pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_SYSCACHE); + /* * Loop to process messages until we get SIGQUIT or detect ungraceful * death of our parent postmaster. @@ -6377,3 +6410,153 @@ pgstat_clip_activity(const char *raw_activity) return activity; } + +/* + * return the filename for a syscache stat file; filename is the output + * buffer, of length len. + */ +void +pgstat_get_syscachestat_filename(bool permanent, bool tempname, int backendid, + char *filename, int len) +{ + int printed; + + /* NB -- pgstat_reset_remove_files knows about the pattern this uses */ + printed = snprintf(filename, len, "%s/cc_%u.%s", + permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY : + pgstat_stat_directory, + backendid, + tempname ? "tmp" : "stat"); + if (printed >= len) + elog(ERROR, "overlength pgstat path"); +} + +/* removes syscache stats files of this backend */ +static void +pgstat_remove_syscache_statsfile(void) +{ + char fname[MAXPGPATH]; + + pgstat_get_syscachestat_filename(false, false, MyBackendId, + fname, MAXPGPATH); + unlink(fname); /* don't care of the result */ +} + +/* + * pgstat_write_syscache_stats() - + * Write the syscache statistics files. + * + * If 'force' is false, this function skips writing a file and resturns the + * time remaining in the current interval in milliseconds. If'force' is true, + * writes a file regardless of the remaining time and reset the interval. + */ +long +pgstat_write_syscache_stats(bool force) +{ + static TimestampTz last_report = 0; + TimestampTz now; + long elapsed; + long secs; + int usecs; + int cacheId; + FILE *fpout; + char statfile[MAXPGPATH]; + char tmpfile[MAXPGPATH]; + + /* Return if we don't want it */ + if (!force && pgstat_track_syscache_usage_interval <= 0) + { + /* disabled. remove the statistics file if any */ + if (last_report > 0) + { + last_report = 0; + pgstat_remove_syscache_statsfile(); + } + return 0; + } + + /* Check aginst the interval */ + now = GetCurrentTransactionStopTimestamp(); + TimestampDifference(last_report, now, &secs, &usecs); + elapsed = secs * 1000 + usecs / 1000; + + if (!force && elapsed < pgstat_track_syscache_usage_interval) + { + /* not yet the time, inform the remaining time to the caller */ + return pgstat_track_syscache_usage_interval - elapsed; + } + + /* now update the stats */ + last_report = now; + + pgstat_get_syscachestat_filename(false, true, + MyBackendId, tmpfile, MAXPGPATH); + pgstat_get_syscachestat_filename(false, false, + MyBackendId, statfile, MAXPGPATH); + + /* + * This function can be called from ProcessInterrupts(). Inhibit recursive + * interrupts to avoid recursive entry. + */ + HOLD_INTERRUPTS(); + + fpout = AllocateFile(tmpfile, PG_BINARY_W); + if (fpout == NULL) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not open temporary statistics file \"%s\": %m", + tmpfile))); + /* + * Failure writing this file is not critical. Just skip this time and + * tell caller to wait for the next interval. + */ + RESUME_INTERRUPTS(); + return pgstat_track_syscache_usage_interval; + } + + /* write out every catcache stats */ + for (cacheId = 0 ; cacheId < SysCacheSize ; cacheId++) + { + SysCacheStats *stats; + + stats = SysCacheGetStats(cacheId); + Assert (stats); + + /* write error is checked later using ferror() */ + fputc('T', fpout); + (void)fwrite(&cacheId, sizeof(int), 1, fpout); + (void)fwrite(&last_report, sizeof(TimestampTz), 1, fpout); + (void)fwrite(stats, sizeof(*stats), 1, fpout); + } + fputc('E', fpout); + + if (ferror(fpout)) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not write syscache statistics file \"%s\": %m", + tmpfile))); + FreeFile(fpout); + unlink(tmpfile); + } + else if (FreeFile(fpout) < 0) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not close syscache statistics file \"%s\": %m", + tmpfile))); + unlink(tmpfile); + } + else if (rename(tmpfile, statfile) < 0) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not rename syscache statistics file \"%s\" to \"%s\": %m", + tmpfile, statfile))); + unlink(tmpfile); + } + + RESUME_INTERRUPTS(); + return 0; +} diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 36cfd507b2..fb77a0ce4c 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -3157,6 +3157,12 @@ ProcessInterrupts(void) } + if (IdleSyscacheStatsUpdateTimeoutPending) + { + IdleSyscacheStatsUpdateTimeoutPending = false; + pgstat_write_syscache_stats(true); + } + if (ParallelMessagePending) HandleParallelMessages(); } @@ -3733,6 +3739,7 @@ PostgresMain(int argc, char *argv[], sigjmp_buf local_sigjmp_buf; volatile bool send_ready_for_query = true; bool disable_idle_in_transaction_timeout = false; + bool disable_idle_catcache_update_timeout = false; /* Initialize startup process environment if necessary. */ if (!IsUnderPostmaster) @@ -4173,9 +4180,19 @@ PostgresMain(int argc, char *argv[], } else { + long timeout; + ProcessCompletedNotifies(); pgstat_report_stat(false); + timeout = pgstat_write_syscache_stats(false); + + if (timeout > 0) + { + disable_idle_catcache_update_timeout = true; + enable_timeout_after(IDLE_CATCACHE_UPDATE_TIMEOUT, + timeout); + } set_ps_display("idle", false); pgstat_report_activity(STATE_IDLE, NULL); } @@ -4218,6 +4235,12 @@ PostgresMain(int argc, char *argv[], disable_idle_in_transaction_timeout = false; } + if (disable_idle_catcache_update_timeout) + { + disable_timeout(IDLE_CATCACHE_UPDATE_TIMEOUT, false); + disable_idle_catcache_update_timeout = false; + } + /* * (6) check for any other interesting events that happened while we * slept. diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index b6ba856ebe..6526cfefb4 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -14,6 +14,8 @@ */ #include "postgres.h" +#include <sys/stat.h> + #include "access/htup_details.h" #include "catalog/pg_authid.h" #include "catalog/pg_type.h" @@ -28,6 +30,7 @@ #include "utils/acl.h" #include "utils/builtins.h" #include "utils/inet.h" +#include "utils/syscache.h" #include "utils/timestamp.h" #define UINT32_ACCESS_ONCE(var) ((uint32)(*((volatile uint32 *)&(var)))) @@ -1899,3 +1902,134 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS) PG_RETURN_DATUM(HeapTupleGetDatum( heap_form_tuple(tupdesc, values, nulls))); } + +Datum +pgstat_get_syscache_stats(PG_FUNCTION_ARGS) +{ +#define PG_GET_SYSCACHE_SIZE 9 + int pid = PG_GETARG_INT32(0); + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; + TupleDesc tupdesc; + Tuplestorestate *tupstore; + MemoryContext per_query_ctx; + MemoryContext oldcontext; + PgBackendStatus *beentry; + int beid; + char fname[MAXPGPATH]; + FILE *fpin; + char c; + + if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("set-valued function called in context that cannot accept a set"))); + if (!(rsinfo->allowedModes & SFRM_Materialize)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("materialize mode required, but it is not " \ + "allowed in this context"))); + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + + per_query_ctx = rsinfo->econtext->ecxt_per_query_memory; + + oldcontext = MemoryContextSwitchTo(per_query_ctx); + tupstore = tuplestore_begin_heap(true, false, work_mem); + rsinfo->returnMode = SFRM_Materialize; + rsinfo->setResult = tupstore; + rsinfo->setDesc = tupdesc; + + MemoryContextSwitchTo(oldcontext); + + /* find beentry for given pid*/ + beentry = NULL; + for (beid = 1; + (beentry = pgstat_fetch_stat_beentry(beid)) && + beentry->st_procpid != pid ; + beid++); + + /* + * we silently return empty result on failure or insufficient privileges + */ + if (!beentry || + (!has_privs_of_role(GetUserId(), beentry->st_userid) && + !is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS))) + goto no_data; + + pgstat_get_syscachestat_filename(false, false, beid, fname, MAXPGPATH); + + if ((fpin = AllocateFile(fname, PG_BINARY_R)) == NULL) + { + if (errno != ENOENT) + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not open statistics file \"%s\": %m", + fname))); + /* also return empty on no statistics file */ + goto no_data; + } + + /* read the statistics file into tuplestore */ + while ((c = fgetc(fpin)) == 'T') + { + TimestampTz last_update; + SysCacheStats stats; + int cacheid; + Datum values[PG_GET_SYSCACHE_SIZE]; + bool nulls[PG_GET_SYSCACHE_SIZE] = {0}; + Datum datums[SYSCACHE_STATS_NAGECLASSES * 2]; + bool arrnulls[SYSCACHE_STATS_NAGECLASSES * 2] = {0}; + int dims[] = {SYSCACHE_STATS_NAGECLASSES, 2}; + int lbs[] = {1, 1}; + ArrayType *arr; + int i, j; + + fread(&cacheid, sizeof(int), 1, fpin); + fread(&last_update, sizeof(TimestampTz), 1, fpin); + if (fread(&stats, 1, sizeof(stats), fpin) != sizeof(stats)) + { + ereport(WARNING, + (errmsg("corrupted syscache statistics file \"%s\"", + fname))); + goto no_data; + } + + i = 0; + values[i++] = ObjectIdGetDatum(stats.reloid); + values[i++] = ObjectIdGetDatum(stats.indoid); + values[i++] = Int64GetDatum(stats.size); + values[i++] = Int64GetDatum(stats.ntuples); + values[i++] = Int64GetDatum(stats.nsearches); + values[i++] = Int64GetDatum(stats.nhits); + values[i++] = Int64GetDatum(stats.nneg_hits); + + for (j = 0 ; j < SYSCACHE_STATS_NAGECLASSES ; j++) + { + datums[j * 2] = Int32GetDatum((int32) stats.ageclasses[j]); + datums[j * 2 + 1] = Int32GetDatum((int32) stats.nclass_entries[j]); + } + + arr = construct_md_array(datums, arrnulls, 2, dims, lbs, + INT4OID, sizeof(int32), true, 'i'); + values[i++] = PointerGetDatum(arr); + + values[i++] = TimestampTzGetDatum(last_update); + + Assert (i == PG_GET_SYSCACHE_SIZE); + + tuplestore_putvalues(tupstore, tupdesc, values, nulls); + } + + /* check for the end of file. abandon the result if file is broken */ + if (c != 'E' || fgetc(fpin) != EOF) + tuplestore_clear(tupstore); + + FreeFile(fpin); + +no_data: + tuplestore_donestoring(tupstore); + return (Datum) 0; +} diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 769e173844..1da1589a5d 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -89,6 +89,10 @@ static CatCacheHeader *CacheHdr = NULL; /* Timestamp used for any operation on caches. */ TimestampTz catcacheclock = 0; +/* age classes for pruning */ +static double ageclass[SYSCACHE_STATS_NAGECLASSES] + = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -619,9 +623,7 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue) else CatCacheRemoveCTup(cache, ct); CACHE1_elog(DEBUG2, "CatCacheInvalidate: invalidated"); -#ifdef CATCACHE_STATS cache->cc_invals++; -#endif /* could be multiple matches, so keep looking! */ } } @@ -697,9 +699,7 @@ ResetCatalogCache(CatCache *cache) } else CatCacheRemoveCTup(cache, ct); -#ifdef CATCACHE_STATS cache->cc_invals++; -#endif } } } @@ -906,10 +906,11 @@ CatCacheCleanupOldEntries(CatCache *cp) * cache_prune_min_age. The index of nremoved_entry is the value of the * clock-sweep counter, which takes from 0 up to 2. */ - double ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; - int nentries[] = {0, 0, 0, 0, 0, 0}; + int nentries[SYSCACHE_STATS_NAGECLASSES] = {0, 0, 0, 0, 0, 0}; int nremoved_entry[3] = {0, 0, 0}; int j; + + Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0); #endif /* Return immediately if no pruning is wanted */ @@ -923,7 +924,11 @@ CatCacheCleanupOldEntries(CatCache *cp) if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L) return false; - /* Search the whole hash for entries to remove */ + /* + * Search the whole hash for entries to remove. This is a quite time + * consuming task during catcache lookup, but accetable since now we are + * going to expand the hash table. + */ for (i = 0; i < cp->cc_nbuckets; i++) { dlist_mutable_iter iter; @@ -936,21 +941,21 @@ CatCacheCleanupOldEntries(CatCache *cp) /* - * Calculate the duration from the time of the last access to the - * "current" time. Since catcacheclock is not advanced within a - * transaction, the entries that are accessed within the current - * transaction won't be pruned. + * Calculate the duration from the time from the last access to + * the "current" time. Since catcacheclock is not advanced within + * a transaction, the entries that are accessed within the current + * transaction always get 0 as the result. */ TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); #ifdef CATCACHE_STATS /* count catcache entries for each age class */ ntotal++; - for (j = 0 ; - ageclass[j] != 0.0 && - entry_age > cache_prune_min_age * ageclass[j] ; - j++); - if (ageclass[j] == 0.0) j--; + + j = 0; + while (j < SYSCACHE_STATS_NAGECLASSES - 1 && + entry_age > cache_prune_min_age * ageclass[j]) + j++; nentries[j]++; #endif @@ -981,14 +986,17 @@ CatCacheCleanupOldEntries(CatCache *cp) } #ifdef CATCACHE_STATS + StaticAssertStmt(SYSCACHE_STATS_NAGECLASSES == 6, + "number of syscache age class must be 6"); ereport(DEBUG1, - (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d, 2:%d)", + (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d, rest:%d) naccessed(0:%d,1:%d, 2:%d)", nremoved, ntotal, ageclass[0] * cache_prune_min_age, nentries[0], ageclass[1] * cache_prune_min_age, nentries[1], ageclass[2] * cache_prune_min_age, nentries[2], ageclass[3] * cache_prune_min_age, nentries[3], ageclass[4] * cache_prune_min_age, nentries[4], + nentries[5], nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]), errhidestmt(true))); #endif @@ -1365,9 +1373,7 @@ SearchCatCacheInternal(CatCache *cache, if (unlikely(cache->cc_tupdesc == NULL)) CatalogCacheInitializeCache(cache); -#ifdef CATCACHE_STATS cache->cc_searches++; -#endif /* Initialize local parameter array */ arguments[0] = v1; @@ -1427,9 +1433,7 @@ SearchCatCacheInternal(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_hits++; -#endif return &ct->tuple; } @@ -1438,9 +1442,7 @@ SearchCatCacheInternal(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_neg_hits++; -#endif return NULL; } @@ -1568,9 +1570,7 @@ SearchCatCacheMiss(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): put in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_newloads++; -#endif return &ct->tuple; } @@ -1681,9 +1681,7 @@ SearchCatCacheList(CatCache *cache, Assert(nkeys > 0 && nkeys < cache->cc_nkeys); -#ifdef CATCACHE_STATS cache->cc_lsearches++; -#endif /* Initialize local parameter array */ arguments[0] = v1; @@ -1740,9 +1738,7 @@ SearchCatCacheList(CatCache *cache, CACHE2_elog(DEBUG2, "SearchCatCacheList(%s): found list", cache->cc_relname); -#ifdef CATCACHE_STATS cache->cc_lhits++; -#endif return cl; } @@ -2250,3 +2246,64 @@ PrintCatCacheListLeakWarning(CatCList *list) list->my_cache->cc_relname, list->my_cache->id, list, list->refcount); } + +/* + * CatCacheGetStats - fill in SysCacheStats struct. + * + * This is a support routine for SysCacheGetStats, substantially fills in the + * result. The classification here is based on the same criteria to + * CatCacheCleanupOldEntries(). + */ +void +CatCacheGetStats(CatCache *cache, SysCacheStats *stats) +{ + int i, j; + + Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0); + + /* fill in the stats struct */ + stats->size = cache->cc_tupsize + cache->cc_nbuckets * sizeof(dlist_head); + stats->ntuples = cache->cc_ntup; + stats->nsearches = cache->cc_searches; + stats->nhits = cache->cc_hits; + stats->nneg_hits = cache->cc_neg_hits; + + /* cache_prune_min_age can be changed on-session, fill it every time */ + for (i = 0 ; i < SYSCACHE_STATS_NAGECLASSES ; i++) + stats->ageclasses[i] = (int) (cache_prune_min_age * ageclass[i]); + + /* + * nth element in nclass_entries stores the number of cache entries that + * have lived unaccessed for corresponding multiple in ageclass of + * cache_prune_min_age. + */ + memset(stats->nclass_entries, 0, sizeof(int) * SYSCACHE_STATS_NAGECLASSES); + + /* Scan the whole hash */ + for (i = 0; i < cache->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cache->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + /* + * Calculate the duration from the time from the last access to + * the "current" time. Since catcacheclock is not advanced within + * a transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + + j = 0; + while (j < SYSCACHE_STATS_NAGECLASSES - 1 && + entry_age > stats->ageclasses[j]) + j++; + + stats->nclass_entries[j]++; + } + } +} diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c index ac98c19155..7b38a06708 100644 --- a/src/backend/utils/cache/syscache.c +++ b/src/backend/utils/cache/syscache.c @@ -20,6 +20,9 @@ */ #include "postgres.h" +#include <sys/stat.h> +#include <unistd.h> + #include "access/htup_details.h" #include "access/sysattr.h" #include "catalog/indexing.h" @@ -1534,6 +1537,27 @@ RelationSupportsSysCache(Oid relid) return false; } +/* + * SysCacheGetStats - returns stats of specified syscache + * + * This routine returns the address of its local static memory. + */ +SysCacheStats * +SysCacheGetStats(int cacheId) +{ + static SysCacheStats stats; + + Assert(cacheId >=0 && cacheId < SysCacheSize); + + memset(&stats, 0, sizeof(stats)); + + stats.reloid = cacheinfo[cacheId].reloid; + stats.indoid = cacheinfo[cacheId].indoid; + + CatCacheGetStats(SysCache[cacheId], &stats); + + return &stats; +} /* * OID comparator for pg_qsort diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index fd51934aaf..f039ecd805 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false; volatile sig_atomic_t ProcDiePending = false; volatile sig_atomic_t ClientConnectionLost = false; volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false; +volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending = false; volatile sig_atomic_t ConfigReloadPending = false; volatile uint32 InterruptHoldoffCount = 0; volatile uint32 QueryCancelHoldoffCount = 0; diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index c0b6231458..dee7f19475 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg); static void StatementTimeoutHandler(void); static void LockTimeoutHandler(void); static void IdleInTransactionSessionTimeoutHandler(void); +static void IdleSyscacheStatsUpdateTimeoutHandler(void); static bool ThereIsAtLeastOneRole(void); static void process_startup_options(Port *port, bool am_superuser); static void process_settings(Oid databaseid, Oid roleid); @@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler); RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT, IdleInTransactionSessionTimeoutHandler); + RegisterTimeout(IDLE_CATCACHE_UPDATE_TIMEOUT, + IdleSyscacheStatsUpdateTimeoutHandler); } /* @@ -1239,6 +1242,14 @@ IdleInTransactionSessionTimeoutHandler(void) SetLatch(MyLatch); } +static void +IdleSyscacheStatsUpdateTimeoutHandler(void) +{ + IdleSyscacheStatsUpdateTimeoutPending = true; + InterruptPending = true; + SetLatch(MyLatch); +} + /* * Returns true if at least one role is defined in this database cluster. */ diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 06c589f725..32e41253a6 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -3168,6 +3168,16 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"track_syscache_usage_interval", PGC_SUSET, STATS_COLLECTOR, + gettext_noop("Sets the interval between syscache usage collection, in milliseconds. Zero disables syscache usagetracking."), + NULL + }, + &pgstat_track_syscache_usage_interval, + 0, 0, INT_MAX / 2, + NULL, NULL, NULL + }, + { {"gin_pending_list_limit", PGC_USERSET, CLIENT_CONN_STATEMENT, gettext_noop("Sets the maximum size of the pending list for GIN index."), diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 108d332f2c..4d4fb42251 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -560,6 +560,7 @@ #track_io_timing = off #track_functions = none # none, pl, all #track_activity_query_size = 1024 # (change requires restart) +#track_syscache_usage_interval = 0 # zero disables tracking #stats_temp_directory = 'pg_stat_tmp' diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index b8de13f03b..6099a828d2 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -9669,6 +9669,15 @@ proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}', proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}', prosrc => 'pg_get_replication_slots' }, +{ oid => '3425', + descr => 'syscache statistics', + proname => 'pg_get_syscache_stats', prorows => '100', proisstrict => 'f', + proretset => 't', provolatile => 'v', prorettype => 'record', + proargtypes => 'int4', + proallargtypes => '{int4,oid,oid,int8,int8,int8,int8,int8,_int4,timestamptz}', + proargmodes => '{i,o,o,o,o,o,o,o,o,o}', + proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,last_update}', + prosrc => 'pgstat_get_syscache_stats' }, { oid => '3786', descr => 'set up a logical replication slot', proname => 'pg_create_logical_replication_slot', provolatile => 'v', proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool', diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index c9e35003a5..69b9a976f0 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending; extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending; extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending; extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending; +extern PGDLLIMPORT volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending; extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending; extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost; diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 88a75fb798..b6bfd7d644 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -1144,6 +1144,7 @@ extern bool pgstat_track_activities; extern bool pgstat_track_counts; extern int pgstat_track_functions; extern PGDLLIMPORT int pgstat_track_activity_query_size; +extern int pgstat_track_syscache_usage_interval; extern char *pgstat_stat_directory; extern char *pgstat_stat_tmpname; extern char *pgstat_stat_filename; @@ -1228,7 +1229,8 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id); extern void pgstat_initstats(Relation rel); extern char *pgstat_clip_activity(const char *raw_activity); - +extern void pgstat_get_syscachestat_filename(bool permanent, + bool tempname, int backendid, char *filename, int len); /* ---------- * pgstat_report_wait_start() - * @@ -1363,5 +1365,5 @@ extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid); extern int pgstat_fetch_stat_numbackends(void); extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void); extern PgStat_GlobalStats *pgstat_fetch_global(void); - +extern long pgstat_write_syscache_stats(bool force); #endif /* PGSTAT_H */ diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 5d24809900..4d51975920 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -65,10 +65,8 @@ typedef struct catcache int cc_tupsize; /* total amount of catcache tuples */ /* - * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS - * doesn't break ABI for other modules + * Statistics entries */ -#ifdef CATCACHE_STATS long cc_searches; /* total # searches against this cache */ long cc_hits; /* # of matches against existing entry */ long cc_neg_hits; /* # of matches against negative entry */ @@ -81,7 +79,6 @@ typedef struct catcache long cc_invals; /* # of entries invalidated from cache */ long cc_lsearches; /* total # list-searches */ long cc_lhits; /* # of matches against existing lists */ -#endif } CatCache; @@ -254,4 +251,8 @@ extern void PrepareToInvalidateCacheTuple(Relation relation, extern void PrintCatCacheLeakWarning(HeapTuple tuple); extern void PrintCatCacheListLeakWarning(CatCList *list); +/* defined in syscache.h */ +typedef struct syscachestats SysCacheStats; +extern void CatCacheGetStats(CatCache *cache, SysCacheStats *syscachestats); + #endif /* CATCACHE_H */ diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h index 95ee48954e..71b399c902 100644 --- a/src/include/utils/syscache.h +++ b/src/include/utils/syscache.h @@ -112,6 +112,24 @@ enum SysCacheIdentifier #define SysCacheSize (USERMAPPINGUSERSERVER + 1) }; +#define SYSCACHE_STATS_NAGECLASSES 6 +/* Struct for catcache tracking information */ +typedef struct syscachestats +{ + Oid reloid; /* target relation */ + Oid indoid; /* index */ + size_t size; /* size of the catcache */ + int ntuples; /* number of tuples resides in the catcache */ + int nsearches; /* number of searches */ + int nhits; /* number of cache hits */ + int nneg_hits; /* number of negative cache hits */ + /* age classes in seconds */ + int ageclasses[SYSCACHE_STATS_NAGECLASSES]; + /* number of tuples fall into the corresponding age class */ + int nclass_entries[SYSCACHE_STATS_NAGECLASSES]; +} SysCacheStats; + + extern void InitCatalogCache(void); extern void InitCatalogCachePhase2(void); @@ -164,6 +182,7 @@ extern void SysCacheInvalidate(int cacheId, uint32 hashValue); extern bool RelationInvalidatesSnapshotsOnly(Oid relid); extern bool RelationHasSysCache(Oid relid); extern bool RelationSupportsSysCache(Oid relid); +extern SysCacheStats *SysCacheGetStats(int cacheId); /* * The use of the macros below rather than direct calls to the corresponding diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h index 9244a2a7b7..0ab441a364 100644 --- a/src/include/utils/timeout.h +++ b/src/include/utils/timeout.h @@ -31,6 +31,7 @@ typedef enum TimeoutId STANDBY_TIMEOUT, STANDBY_LOCK_TIMEOUT, IDLE_IN_TRANSACTION_SESSION_TIMEOUT, + IDLE_CATCACHE_UPDATE_TIMEOUT, /* First user-definable timeout reason */ USER_TIMEOUT, /* Maximum number of timeout reasons */ diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out index 2c8e21baa7..7bd77e9972 100644 --- a/src/test/regress/expected/rules.out +++ b/src/test/regress/expected/rules.out @@ -1921,6 +1921,28 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid, pg_stat_all_tables.autoanalyze_count FROM pg_stat_all_tables WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname~ '^pg_toast'::text)); +pg_stat_syscache| SELECT s.pid, + (s.relid)::regclass AS relname, + (s.indid)::regclass AS cache_name, + s.size, + s.ntup AS ntuples, + s.searches, + s.hits, + s.neg_hits, + s.ageclass, + s.last_update + FROM (pg_stat_activity a + JOIN LATERAL ( SELECT a.pid, + pg_get_syscache_stats.relid, + pg_get_syscache_stats.indid, + pg_get_syscache_stats.size, + pg_get_syscache_stats.ntup, + pg_get_syscache_stats.searches, + pg_get_syscache_stats.hits, + pg_get_syscache_stats.neg_hits, + pg_get_syscache_stats.ageclass, + pg_get_syscache_stats.last_update + FROM pg_get_syscache_stats(a.pid) pg_get_syscache_stats(relid, indid, size, ntup, searches, hits, neg_hits, ageclass,last_update)) s ON ((a.pid = s.pid))); pg_stat_user_functions| SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, @@ -2352,7 +2374,7 @@ pg_settings|pg_settings_n|CREATE RULE pg_settings_n AS ON UPDATE TO pg_catalog.pg_settings DO INSTEAD NOTHING; pg_settings|pg_settings_u|CREATE RULE pg_settings_u AS ON UPDATE TO pg_catalog.pg_settings - WHERE (new.name = old.name) DO SELECT set_config(old.name, new.setting, false) AS set_config; + WHERE (new.name = old.name) DO SELECT set_config(old.name, new.setting, false, false) AS set_config; rtest_emp|rtest_emp_del|CREATE RULE rtest_emp_del AS ON DELETE TO public.rtest_emp DO INSERT INTO rtest_emplog (ename, who, action, newsal, oldsal) VALUES (old.ename, CURRENT_USER, 'fired'::bpchar, '$0.00'::money, old.salary); -- 2.16.3 From 83444ebafff25babd94c48080b5ba420a27db430 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Wed, 6 Feb 2019 14:36:29 +0900 Subject: [PATCH 3/4] PoC add prune-by-number-of-entries feature Adds prune based on the number of cache entries on top of the current pruning patch. It is controlled by two GUC variables. cache_entry_limit: limit of the number of entries per catcache cache_entry_limit_prune_ratio: how much of entries to remove at pruning --- src/backend/utils/cache/catcache.c | 107 ++++++++++++++++++++++++++++++++++++- src/backend/utils/misc/guc.c | 40 ++++++++++++++ src/include/utils/catcache.h | 2 + 3 files changed, 148 insertions(+), 1 deletion(-) diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 1da1589a5d..70ae5da988 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -77,6 +77,11 @@ */ int cache_memory_target = 0; + +/* PoC entry limit */ +int cache_entry_limit = 0; +double cache_entry_limit_prune_ratio = 0.8; + /* GUC variable to define the minimum age of entries that will be cosidered to * be evicted in seconds. This variable is shared among various cache * mechanisms. @@ -882,6 +887,102 @@ InitCatCache(int id, return cp; } +/* + * CatCacheCleanupOldEntriesByNum - + * Poc remove infrequently-used entries by number of entries. + */ +static bool +CatCacheCleanupOldEntriesByNum(CatCache *cp, int cache_entry_limit) +{ + int i; + int n; + int oldndelelem = cp->cc_ntup; + int ndelelem; + CatCTup **ct_array; + + ndelelem = oldndelelem - (int)(cache_entry_limit * cache_entry_limit_prune_ratio); + + /* lower limit: quite arbitrary */ + if (ndelelem < 256) + ndelelem = 256; + + /* + * partial sort array: [0] contains latest access entry + * [1] contains ealiest access entry + */ + ct_array = (CatCTup **) palloc(ndelelem * sizeof(CatCTup*)); + n = 0; + + /* + * Collect entries to be removed, which have older lastaccess. + * Using heap bound sort like tuplesort.c. + */ + for (i = 0; i < cp->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + + /* Don't remove referenced entries */ + if (ct->refcount != 0 || + (ct->c_list && ct->c_list->refcount != 0)) + continue; + + if (n < ndelelem) + { + /* Fill up the min heap array */ + int j = n++; + + while (j > 0) + { + int i = (j - 1) >> 1; + + if (ct->lastaccess >= ct_array[i]->lastaccess) + break; + ct_array[j] = ct_array[i]; + j = i; + } + ct_array[j] = ct; + } + else if (ct->lastaccess > ct_array[0]->lastaccess) + { + /* older than the oldest in the array, add it */ + unsigned int i; + + i = 0; + + for (;;) + { + unsigned int j = 2 * i + 1; + + if (j >= n) + break; + if (j + 1 < n && + ct_array[j]->lastaccess > ct_array[j + 1]->lastaccess) + j++; + if (ct->lastaccess <= ct_array[j]->lastaccess) + break; + ct_array[i] = ct_array[j]; + i = j; + } + ct_array[i] = ct; + } + } + } + + /* Now we have the list of elements to be deleted */ + for (i = 0 ; i < n && ct_array[i]; i++) + CatCacheRemoveCTup(cp, ct_array[i]); + + pfree(ct_array); + + elog(LOG, "Catcache pruned by entry number: id=%d, %d => %d", cp->id, oldndelelem, cp->cc_ntup); + + return true; +} + /* * CatCacheCleanupOldEntries - Remove infrequently-used entries * @@ -923,7 +1024,7 @@ CatCacheCleanupOldEntries(CatCache *cp) hash_size = cp->cc_nbuckets * sizeof(dlist_head); if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L) return false; - + /* * Search the whole hash for entries to remove. This is a quite time * consuming task during catcache lookup, but accetable since now we are @@ -2047,6 +2148,10 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, CacheHdr->ch_ntup++; cache->cc_tupsize += tupsize; + /* cap number of entries */ + if (cache_entry_limit > 0 && cache->cc_ntup > cache_entry_limit) + CatCacheCleanupOldEntriesByNum(cache, cache_entry_limit); + /* * If the hash table has become too full, try cleanup by removing * infrequently used entries to make a room for the new entry. If it diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 32e41253a6..7bb239a07e 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -2227,6 +2227,36 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the maximum entries of catcache."), + NULL + }, + &cache_entry_limit, + 0, 0, INT_MAX, + NULL, NULL, NULL + }, + + { + {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the maximum entries of catcache."), + NULL + }, + &cache_entry_limit, + 0, 0, INT_MAX, + NULL, NULL, NULL + }, + + { + {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the maximum entries of catcache."), + NULL + }, + &cache_entry_limit, + 0, 0, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if @@ -3401,6 +3431,16 @@ static struct config_real ConfigureNamesReal[] = NULL, NULL, NULL }, + { + {"cache_entry_limit_prune_ratio", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the maximum entries of catcache."), + NULL + }, + &cache_entry_limit_prune_ratio, + 0.8, 0.0, 1.0, + NULL, NULL, NULL + }, + /* End-of-list marker */ { {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 4d51975920..1f7fb51ac0 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -193,6 +193,8 @@ extern PGDLLIMPORT MemoryContext CacheMemoryContext; /* for guc.c, not PGDLLPMPORT'ed */ extern int cache_prune_min_age; extern int cache_memory_target; +extern int cache_entry_limit; +extern double cache_entry_limit_prune_ratio; /* to use as access timestamp of catcache entries */ extern TimestampTz catcacheclock; -- 2.16.3 #! /usr/bin/perl print "set track_syscache_usage_interval to 1000;\n"; ## for time-based pruning #print "set cache_prune_min_age to '5s';\n"; #print "set cache_memory_target to '0';\n"; ## for limit-based pruning print "set cache_memory_target to '100MB';\n"; print "set cache_entry_limit to 10000;\n"; print "set cache_entry_limit_prune_ratio to 0.6;\n"; while (1) { print "begin; create temp table t1 (a int, b int, c int, d int, e int, f int, g int, h int, i int, j int) on commit drop;insert into t1 values (1, 2, 3, 4, 5, 6, 7, 8, 9, 10); select * from t1; commit;\n"; }
At Wed, 06 Feb 2019 15:16:53 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190206.151653.117382256.horiguchi.kyotaro@lab.ntt.co.jp> > > The two should have the same extent of impact on performance when > > disabled. I'll take numbers briefly using pgbench. (pgbench -j 10 -c 10 -T 120) x 5 times for each. A: unpached : 118.58 tps (stddev 0.44) B: pached-not-used[1] : 118.41 tps (stddev 0.29) C: patched-timedprune[2]: 118.41 tps (stddev 0.51) D: patched-capped...... : none[3] [1]: cache_prune_min_age = 0, cache_entry_limit = 0 [2]: cache_prune_min_age = 100, cache_entry_limit = 0 (Prunes every 100ms) [3] I didin't find a sane benchmark for the capping case using vanilla pgbench. It doesn't seem to me showing significant degradation on *my* box... # I found a bug that can remove newly created entry. So v11. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 288f499393a1b6dd8c37781205fd7e553974fa1d Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 16 Oct 2018 13:04:30 +0900 Subject: [PATCH 1/4] Remove entries that haven't been used for a certain time Catcache entries can be left alone for several reasons. It is not desirable that they eat up memory. With this patch, This adds consideration of removal of entries that haven't been used for a certain time before enlarging the hash array. --- doc/src/sgml/config.sgml | 38 ++++++ src/backend/access/transam/xact.c | 5 + src/backend/utils/cache/catcache.c | 168 ++++++++++++++++++++++++-- src/backend/utils/misc/guc.c | 23 ++++ src/backend/utils/misc/postgresql.conf.sample | 2 + src/include/utils/catcache.h | 28 ++++- 6 files changed, 256 insertions(+), 8 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 9b7a7388d5..d0d2374944 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1662,6 +1662,44 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target"> + <term><varname>syscache_memory_target</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_memory_target</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the maximum amount of memory to which syscache is expanded + without pruning. The value defaults to 0, indicating that pruning is + always considered. After exceeding this size, syscache pruning is + considered according to + <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep + certain amount of syscache entries with intermittent usage, try + increase this setting. + </para> + </listitem> + </varlistentry> + + <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age"> + <term><varname>syscache_prune_min_age</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the minimum amount of unused time in seconds at which a + syscache entry is considered to be removed. -1 indicates that syscache + pruning is disabled at all. The value defaults to 600 seconds + (<literal>10 minutes</literal>). The syscache entries that are not + used for the duration can be removed to prevent syscache bloat. This + behavior is suppressed until the size of syscache exceeds + <xref linkend="guc-syscache-memory-target"/>. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth"> <term><varname>max_stack_depth</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 92bda87804..ddc433c59e 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -734,7 +734,12 @@ void SetCurrentStatementStartTimestamp(void) { if (!IsParallelWorker()) + { stmtStartTimestamp = GetCurrentTimestamp(); + + /* Set this timestamp as aproximated current time */ + SetCatCacheClock(stmtStartTimestamp); + } else Assert(stmtStartTimestamp != 0); } diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 258a1d64cc..5106ed896a 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -71,9 +71,24 @@ #define CACHE6_elog(a,b,c,d,e,f,g) #endif +/* + * GUC variable to define the minimum size of hash to cosider entry eviction. + * This variable is shared among various cache mechanisms. + */ +int cache_memory_target = 0; + +/* GUC variable to define the minimum age of entries that will be cosidered to + * be evicted in seconds. This variable is shared among various cache + * mechanisms. + */ +int cache_prune_min_age = 600; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Timestamp used for any operation on caches. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -490,6 +505,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys, cache->cc_keyno, ct->keys); + cache->cc_tupsize -= ct->size; pfree(ct); --cache->cc_ntup; @@ -841,6 +857,7 @@ InitCatCache(int id, cp->cc_nkeys = nkeys; for (i = 0; i < nkeys; ++i) cp->cc_keyno[i] = key[i]; + cp->cc_tupsize = 0; /* * new cache is initialized as far as we can go for now. print some @@ -858,9 +875,127 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initilize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries can be left alone for several reasons. We remove them if + * they are not accessed for a certain time to prevent catcache from + * bloating. The eviction is performed with the similar algorithm with buffer + * eviction using access counter. Entries that are accessed several times can + * live longer than those that have had no access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int i; + int nremoved = 0; + size_t hash_size; +#ifdef CATCACHE_STATS + /* These variables are only for debugging purpose */ + int ntotal = 0; + /* + * nth element in nentries stores the number of cache entries that have + * lived unaccessed for corresponding multiple in ageclass of + * cache_prune_min_age. The index of nremoved_entry is the value of the + * clock-sweep counter, which takes from 0 up to 2. + */ + double ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; + int nentries[] = {0, 0, 0, 0, 0, 0}; + int nremoved_entry[3] = {0, 0, 0}; + int j; +#endif + + /* Return immediately if no pruning is wanted */ + if (cache_prune_min_age < 0) + return false; + + /* + * Return without pruning if the size of the hash is below the target. + */ + hash_size = cp->cc_nbuckets * sizeof(dlist_head); + if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L) + return false; + + /* Search the whole hash for entries to remove */ + for (i = 0; i < cp->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + + /* + * Calculate the duration from the time of the last access to the + * "current" time. Since catcacheclock is not advanced within a + * transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + +#ifdef CATCACHE_STATS + /* count catcache entries for each age class */ + ntotal++; + for (j = 0 ; + ageclass[j] != 0.0 && + entry_age > cache_prune_min_age * ageclass[j] ; + j++); + if (ageclass[j] == 0.0) j--; + nentries[j]++; +#endif + + /* + * Try to remove entries older than cache_prune_min_age seconds. + * Entries that are not accessed after last pruning are removed in + * that seconds, and that has been accessed several times are + * removed after leaving alone for up to three times of the + * duration. We don't try shrink buckets since pruning effectively + * caps catcache expansion in the long term. + */ + if (entry_age > cache_prune_min_age) + { +#ifdef CATCACHE_STATS + Assert (ct->naccess >= 0 && ct->naccess <= 2); + nremoved_entry[ct->naccess]++; +#endif + if (ct->naccess > 0) + ct->naccess--; + else if (ct->refcount == 0 && + (!ct->c_list || ct->c_list->refcount == 0)) + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + } + } + } + } + +#ifdef CATCACHE_STATS + ereport(DEBUG1, + (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d, 2:%d)", + nremoved, ntotal, + ageclass[0] * cache_prune_min_age, nentries[0], + ageclass[1] * cache_prune_min_age, nentries[1], + ageclass[2] * cache_prune_min_age, nentries[2], + ageclass[3] * cache_prune_min_age, nentries[3], + ageclass[4] * cache_prune_min_age, nentries[4], + nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]), + errhidestmt(true))); +#endif + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -1274,6 +1409,11 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* Update access information for pruning */ + if (ct->naccess < 2) + ct->naccess++; + ct->lastaccess = catcacheclock; + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1819,11 +1959,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, CatCTup *ct; HeapTuple dtp; MemoryContext oldcxt; + int tupsize = 0; /* negative entries have no tuple associated */ if (ntp) { int i; + int tupsize; Assert(!negative); @@ -1842,13 +1984,14 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, /* Allocate memory for CatCTup and the cached tuple in one go */ oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - ct = (CatCTup *) palloc(sizeof(CatCTup) + - MAXIMUM_ALIGNOF + dtp->t_len); + tupsize = sizeof(CatCTup) + MAXIMUM_ALIGNOF + dtp->t_len; + ct = (CatCTup *) palloc(tupsize); ct->tuple.t_len = dtp->t_len; ct->tuple.t_self = dtp->t_self; ct->tuple.t_tableOid = dtp->t_tableOid; ct->tuple.t_data = (HeapTupleHeader) MAXALIGN(((char *) ct) + sizeof(CatCTup)); + ct->size = tupsize; /* copy tuple contents */ memcpy((char *) ct->tuple.t_data, (const char *) dtp->t_data, @@ -1876,8 +2019,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, { Assert(negative); oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - ct = (CatCTup *) palloc(sizeof(CatCTup)); - + tupsize = sizeof(CatCTup); + ct = (CatCTup *) palloc(tupsize); /* * Store keys - they'll point into separately allocated memory if not * by-value. @@ -1898,19 +2041,30 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; + ct->size = tupsize; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); cache->cc_ntup++; CacheHdr->ch_ntup++; + cache->cc_tupsize += tupsize; + /* increase refcount so that this survives pruning */ + ct->refcount++; /* - * If the hash table has become too full, enlarge the buckets array. Quite - * arbitrarily, we enlarge when fill factor > 2. + * If the hash table has become too full, try cleanup by removing + * infrequently used entries to make a room for the new entry. If it + * failed, enlarge the bucket array instead. Quite arbitrarily, we try + * this when fill factor > 2. */ - if (cache->cc_ntup > cache->cc_nbuckets * 2) + if (cache->cc_ntup > cache->cc_nbuckets * 2 && + !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); + ct->refcount--; + return ct; } diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 8681ada33a..06c589f725 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -81,6 +81,7 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/guc_tables.h" #include "utils/float.h" #include "utils/memutils.h" @@ -2204,6 +2205,28 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"cache_memory_target", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum syscache size to keep."), + gettext_noop("Cache is not pruned before exceeding this size."), + GUC_UNIT_KB + }, + &cache_memory_target, + 0, 0, MAX_KILOBYTES, + NULL, NULL, NULL + }, + + { + {"cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum unused duration of cache entries before removal."), + gettext_noop("Cache entries that live unused for longer than this seconds are considered to be removed."), + GUC_UNIT_S + }, + &cache_prune_min_age, + 600, -1, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index c7f53470df..108d332f2c 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -128,6 +128,8 @@ #work_mem = 4MB # min 64kB #maintenance_work_mem = 64MB # min 1MB #autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem +#cache_memory_target = 0kB # in kB +#cache_prune_min_age = 600s # -1 disables pruning #max_stack_depth = 2MB # min 100kB #shared_memory_type = mmap # the default is the first option # supported by the operating system: diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 65d816a583..5d24809900 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -61,6 +62,7 @@ typedef struct catcache slist_node cc_next; /* list link */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ + int cc_tupsize; /* total amount of catcache tuples */ /* * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS @@ -119,7 +121,9 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ - + int naccess; /* # of access to this entry, up to 2 */ + TimestampTz lastaccess; /* approx. timestamp of the last usage */ + int size; /* palloc'ed size off this tuple */ /* * The tuple may also be a member of at most one CatCList. (If a single * catcache is list-searched with varying numbers of keys, we may have to @@ -189,6 +193,28 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int cache_prune_min_age; +extern int cache_memory_target; + +/* to use as access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* + * SetCatCacheClock - set timestamp for catcache access record + */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + +static inline TimestampTz +GetCatCacheClock(void) +{ + return catcacheclock; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, -- 2.16.3 From 1ee885cef5cc66a1246e4929954cdcc1949f162a Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 16 Oct 2018 15:48:28 +0900 Subject: [PATCH 2/4] Syscache usage tracking feature. Collects syscache usage statictics and show it using the view pg_stat_syscache. The feature is controlled by the GUC variable track_syscache_usage_interval. --- doc/src/sgml/config.sgml | 15 ++ src/backend/catalog/system_views.sql | 17 +++ src/backend/postmaster/pgstat.c | 201 ++++++++++++++++++++++++-- src/backend/tcop/postgres.c | 23 +++ src/backend/utils/adt/pgstatfuncs.c | 134 +++++++++++++++++ src/backend/utils/cache/catcache.c | 115 +++++++++++---- src/backend/utils/cache/syscache.c | 24 +++ src/backend/utils/init/globals.c | 1 + src/backend/utils/init/postinit.c | 11 ++ src/backend/utils/misc/guc.c | 10 ++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/catalog/pg_proc.dat | 9 ++ src/include/miscadmin.h | 1 + src/include/pgstat.h | 6 +- src/include/utils/catcache.h | 9 +- src/include/utils/syscache.h | 19 +++ src/include/utils/timeout.h | 1 + src/test/regress/expected/rules.out | 24 ++- 18 files changed, 576 insertions(+), 45 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index d0d2374944..5ff3ebeb4e 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -6687,6 +6687,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; </listitem> </varlistentry> + <varlistentry id="guc-track-syscache-usage-interval" xreflabel="track_syscache_usage_interval"> + <term><varname>track_syscache_usage_interval</varname> (<type>integer</type>) + <indexterm> + <primary><varname>track_syscache_usage_interval</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the interval to collect system cache usage statistics in + milliseconds. This parameter is 0 by default, which means disabled. + Only superusers can change this setting. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-track-io-timing" xreflabel="track_io_timing"> <term><varname>track_io_timing</varname> (<type>boolean</type>) <indexterm> diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 3e229c693c..f5d1aaf96f 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -906,6 +906,22 @@ CREATE VIEW pg_stat_progress_vacuum AS FROM pg_stat_get_progress_info('VACUUM') AS S LEFT JOIN pg_database D ON S.datid = D.oid; +CREATE VIEW pg_stat_syscache AS + SELECT + S.pid AS pid, + S.relid::regclass AS relname, + S.indid::regclass AS cache_name, + S.size AS size, + S.ntup AS ntuples, + S.searches AS searches, + S.hits AS hits, + S.neg_hits AS neg_hits, + S.ageclass AS ageclass, + S.last_update AS last_update + FROM pg_stat_activity A + JOIN LATERAL (SELECT A.pid, * FROM pg_get_syscache_stats(A.pid)) S + ON (A.pid = S.pid); + CREATE VIEW pg_user_mappings AS SELECT U.oid AS umid, @@ -1185,6 +1201,7 @@ GRANT EXECUTE ON FUNCTION pg_ls_waldir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_archive_statusdir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_tmpdir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_tmpdir(oid) TO pg_monitor; +GRANT EXECUTE ON FUNCTION pg_get_syscache_stats(int) TO pg_monitor; GRANT pg_read_all_settings TO pg_monitor; GRANT pg_read_all_stats TO pg_monitor; diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 81c6499251..a1939958b7 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -66,6 +66,7 @@ #include "utils/ps_status.h" #include "utils/rel.h" #include "utils/snapmgr.h" +#include "utils/syscache.h" #include "utils/timestamp.h" @@ -124,6 +125,7 @@ bool pgstat_track_activities = false; bool pgstat_track_counts = false; int pgstat_track_functions = TRACK_FUNC_OFF; +int pgstat_track_syscache_usage_interval = 0; int pgstat_track_activity_query_size = 1024; /* ---------- @@ -236,6 +238,11 @@ typedef struct TwoPhasePgStatRecord bool t_truncated; /* was the relation truncated? */ } TwoPhasePgStatRecord; +/* bitmap symbols to specify target file types remove */ +#define PGSTAT_REMFILE_DBSTAT 1 /* remove only databsae stats files */ +#define PGSTAT_REMFILE_SYSCACHE 2 /* remove only syscache stats files */ +#define PGSTAT_REMFILE_ALL 3 /* remove both type of files */ + /* * Info about current "snapshot" of stats file */ @@ -335,6 +342,7 @@ static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len); static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len); static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len); static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len); +static void pgstat_remove_syscache_statsfile(void); /* ------------------------------------------------------------ * Public functions called from postmaster follow @@ -630,10 +638,13 @@ startup_failed: } /* - * subroutine for pgstat_reset_all + * remove stats files + * + * clean up stats files in specified directory. target is one of + * PGSTAT_REFILE_DBSTAT/SYSCACHE/ALL and restricts files to remove. */ static void -pgstat_reset_remove_files(const char *directory) +pgstat_reset_remove_files(const char *directory, int target) { DIR *dir; struct dirent *entry; @@ -644,25 +655,39 @@ pgstat_reset_remove_files(const char *directory) { int nchars; Oid tmp_oid; + int filetype = 0; /* * Skip directory entries that don't match the file names we write. * See get_dbstat_filename for the database-specific pattern. */ if (strncmp(entry->d_name, "global.", 7) == 0) + { + filetype = PGSTAT_REMFILE_DBSTAT; nchars = 7; + } else { + char head[2]; + nchars = 0; - (void) sscanf(entry->d_name, "db_%u.%n", - &tmp_oid, &nchars); - if (nchars <= 0) - continue; + (void) sscanf(entry->d_name, "%c%c_%u.%n", + head, head + 1, &tmp_oid, &nchars); + /* %u allows leading whitespace, so reject that */ - if (strchr("0123456789", entry->d_name[3]) == NULL) + if (nchars < 3 || !isdigit(entry->d_name[3])) continue; + + if (strncmp(head, "db", 2) == 0) + filetype = PGSTAT_REMFILE_DBSTAT; + else if (strncmp(head, "cc", 2) == 0) + filetype = PGSTAT_REMFILE_SYSCACHE; } + /* skip if this is not a target */ + if ((filetype & target) == 0) + continue; + if (strcmp(entry->d_name + nchars, "tmp") != 0 && strcmp(entry->d_name + nchars, "stat") != 0) continue; @@ -683,8 +708,9 @@ pgstat_reset_remove_files(const char *directory) void pgstat_reset_all(void) { - pgstat_reset_remove_files(pgstat_stat_directory); - pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY); + pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_ALL); + pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY, + PGSTAT_REMFILE_ALL); } #ifdef EXEC_BACKEND @@ -2963,6 +2989,10 @@ pgstat_beshutdown_hook(int code, Datum arg) if (OidIsValid(MyDatabaseId)) pgstat_report_stat(true); + /* clear syscache statistics files and temprary settings */ + if (MyBackendId != InvalidBackendId) + pgstat_remove_syscache_statsfile(); + /* * Clear my status entry, following the protocol of bumping st_changecount * before and after. We use a volatile pointer here to ensure the @@ -4287,6 +4317,9 @@ PgstatCollectorMain(int argc, char *argv[]) pgStatRunningInCollector = true; pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true); + /* Remove left-over syscache stats files */ + pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_SYSCACHE); + /* * Loop to process messages until we get SIGQUIT or detect ungraceful * death of our parent postmaster. @@ -6377,3 +6410,153 @@ pgstat_clip_activity(const char *raw_activity) return activity; } + +/* + * return the filename for a syscache stat file; filename is the output + * buffer, of length len. + */ +void +pgstat_get_syscachestat_filename(bool permanent, bool tempname, int backendid, + char *filename, int len) +{ + int printed; + + /* NB -- pgstat_reset_remove_files knows about the pattern this uses */ + printed = snprintf(filename, len, "%s/cc_%u.%s", + permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY : + pgstat_stat_directory, + backendid, + tempname ? "tmp" : "stat"); + if (printed >= len) + elog(ERROR, "overlength pgstat path"); +} + +/* removes syscache stats files of this backend */ +static void +pgstat_remove_syscache_statsfile(void) +{ + char fname[MAXPGPATH]; + + pgstat_get_syscachestat_filename(false, false, MyBackendId, + fname, MAXPGPATH); + unlink(fname); /* don't care of the result */ +} + +/* + * pgstat_write_syscache_stats() - + * Write the syscache statistics files. + * + * If 'force' is false, this function skips writing a file and resturns the + * time remaining in the current interval in milliseconds. If'force' is true, + * writes a file regardless of the remaining time and reset the interval. + */ +long +pgstat_write_syscache_stats(bool force) +{ + static TimestampTz last_report = 0; + TimestampTz now; + long elapsed; + long secs; + int usecs; + int cacheId; + FILE *fpout; + char statfile[MAXPGPATH]; + char tmpfile[MAXPGPATH]; + + /* Return if we don't want it */ + if (!force && pgstat_track_syscache_usage_interval <= 0) + { + /* disabled. remove the statistics file if any */ + if (last_report > 0) + { + last_report = 0; + pgstat_remove_syscache_statsfile(); + } + return 0; + } + + /* Check aginst the interval */ + now = GetCurrentTransactionStopTimestamp(); + TimestampDifference(last_report, now, &secs, &usecs); + elapsed = secs * 1000 + usecs / 1000; + + if (!force && elapsed < pgstat_track_syscache_usage_interval) + { + /* not yet the time, inform the remaining time to the caller */ + return pgstat_track_syscache_usage_interval - elapsed; + } + + /* now update the stats */ + last_report = now; + + pgstat_get_syscachestat_filename(false, true, + MyBackendId, tmpfile, MAXPGPATH); + pgstat_get_syscachestat_filename(false, false, + MyBackendId, statfile, MAXPGPATH); + + /* + * This function can be called from ProcessInterrupts(). Inhibit recursive + * interrupts to avoid recursive entry. + */ + HOLD_INTERRUPTS(); + + fpout = AllocateFile(tmpfile, PG_BINARY_W); + if (fpout == NULL) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not open temporary statistics file \"%s\": %m", + tmpfile))); + /* + * Failure writing this file is not critical. Just skip this time and + * tell caller to wait for the next interval. + */ + RESUME_INTERRUPTS(); + return pgstat_track_syscache_usage_interval; + } + + /* write out every catcache stats */ + for (cacheId = 0 ; cacheId < SysCacheSize ; cacheId++) + { + SysCacheStats *stats; + + stats = SysCacheGetStats(cacheId); + Assert (stats); + + /* write error is checked later using ferror() */ + fputc('T', fpout); + (void)fwrite(&cacheId, sizeof(int), 1, fpout); + (void)fwrite(&last_report, sizeof(TimestampTz), 1, fpout); + (void)fwrite(stats, sizeof(*stats), 1, fpout); + } + fputc('E', fpout); + + if (ferror(fpout)) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not write syscache statistics file \"%s\": %m", + tmpfile))); + FreeFile(fpout); + unlink(tmpfile); + } + else if (FreeFile(fpout) < 0) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not close syscache statistics file \"%s\": %m", + tmpfile))); + unlink(tmpfile); + } + else if (rename(tmpfile, statfile) < 0) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not rename syscache statistics file \"%s\" to \"%s\": %m", + tmpfile, statfile))); + unlink(tmpfile); + } + + RESUME_INTERRUPTS(); + return 0; +} diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 36cfd507b2..fb77a0ce4c 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -3157,6 +3157,12 @@ ProcessInterrupts(void) } + if (IdleSyscacheStatsUpdateTimeoutPending) + { + IdleSyscacheStatsUpdateTimeoutPending = false; + pgstat_write_syscache_stats(true); + } + if (ParallelMessagePending) HandleParallelMessages(); } @@ -3733,6 +3739,7 @@ PostgresMain(int argc, char *argv[], sigjmp_buf local_sigjmp_buf; volatile bool send_ready_for_query = true; bool disable_idle_in_transaction_timeout = false; + bool disable_idle_catcache_update_timeout = false; /* Initialize startup process environment if necessary. */ if (!IsUnderPostmaster) @@ -4173,9 +4180,19 @@ PostgresMain(int argc, char *argv[], } else { + long timeout; + ProcessCompletedNotifies(); pgstat_report_stat(false); + timeout = pgstat_write_syscache_stats(false); + + if (timeout > 0) + { + disable_idle_catcache_update_timeout = true; + enable_timeout_after(IDLE_CATCACHE_UPDATE_TIMEOUT, + timeout); + } set_ps_display("idle", false); pgstat_report_activity(STATE_IDLE, NULL); } @@ -4218,6 +4235,12 @@ PostgresMain(int argc, char *argv[], disable_idle_in_transaction_timeout = false; } + if (disable_idle_catcache_update_timeout) + { + disable_timeout(IDLE_CATCACHE_UPDATE_TIMEOUT, false); + disable_idle_catcache_update_timeout = false; + } + /* * (6) check for any other interesting events that happened while we * slept. diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index b6ba856ebe..6526cfefb4 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -14,6 +14,8 @@ */ #include "postgres.h" +#include <sys/stat.h> + #include "access/htup_details.h" #include "catalog/pg_authid.h" #include "catalog/pg_type.h" @@ -28,6 +30,7 @@ #include "utils/acl.h" #include "utils/builtins.h" #include "utils/inet.h" +#include "utils/syscache.h" #include "utils/timestamp.h" #define UINT32_ACCESS_ONCE(var) ((uint32)(*((volatile uint32 *)&(var)))) @@ -1899,3 +1902,134 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS) PG_RETURN_DATUM(HeapTupleGetDatum( heap_form_tuple(tupdesc, values, nulls))); } + +Datum +pgstat_get_syscache_stats(PG_FUNCTION_ARGS) +{ +#define PG_GET_SYSCACHE_SIZE 9 + int pid = PG_GETARG_INT32(0); + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; + TupleDesc tupdesc; + Tuplestorestate *tupstore; + MemoryContext per_query_ctx; + MemoryContext oldcontext; + PgBackendStatus *beentry; + int beid; + char fname[MAXPGPATH]; + FILE *fpin; + char c; + + if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("set-valued function called in context that cannot accept a set"))); + if (!(rsinfo->allowedModes & SFRM_Materialize)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("materialize mode required, but it is not " \ + "allowed in this context"))); + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + + per_query_ctx = rsinfo->econtext->ecxt_per_query_memory; + + oldcontext = MemoryContextSwitchTo(per_query_ctx); + tupstore = tuplestore_begin_heap(true, false, work_mem); + rsinfo->returnMode = SFRM_Materialize; + rsinfo->setResult = tupstore; + rsinfo->setDesc = tupdesc; + + MemoryContextSwitchTo(oldcontext); + + /* find beentry for given pid*/ + beentry = NULL; + for (beid = 1; + (beentry = pgstat_fetch_stat_beentry(beid)) && + beentry->st_procpid != pid ; + beid++); + + /* + * we silently return empty result on failure or insufficient privileges + */ + if (!beentry || + (!has_privs_of_role(GetUserId(), beentry->st_userid) && + !is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS))) + goto no_data; + + pgstat_get_syscachestat_filename(false, false, beid, fname, MAXPGPATH); + + if ((fpin = AllocateFile(fname, PG_BINARY_R)) == NULL) + { + if (errno != ENOENT) + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not open statistics file \"%s\": %m", + fname))); + /* also return empty on no statistics file */ + goto no_data; + } + + /* read the statistics file into tuplestore */ + while ((c = fgetc(fpin)) == 'T') + { + TimestampTz last_update; + SysCacheStats stats; + int cacheid; + Datum values[PG_GET_SYSCACHE_SIZE]; + bool nulls[PG_GET_SYSCACHE_SIZE] = {0}; + Datum datums[SYSCACHE_STATS_NAGECLASSES * 2]; + bool arrnulls[SYSCACHE_STATS_NAGECLASSES * 2] = {0}; + int dims[] = {SYSCACHE_STATS_NAGECLASSES, 2}; + int lbs[] = {1, 1}; + ArrayType *arr; + int i, j; + + fread(&cacheid, sizeof(int), 1, fpin); + fread(&last_update, sizeof(TimestampTz), 1, fpin); + if (fread(&stats, 1, sizeof(stats), fpin) != sizeof(stats)) + { + ereport(WARNING, + (errmsg("corrupted syscache statistics file \"%s\"", + fname))); + goto no_data; + } + + i = 0; + values[i++] = ObjectIdGetDatum(stats.reloid); + values[i++] = ObjectIdGetDatum(stats.indoid); + values[i++] = Int64GetDatum(stats.size); + values[i++] = Int64GetDatum(stats.ntuples); + values[i++] = Int64GetDatum(stats.nsearches); + values[i++] = Int64GetDatum(stats.nhits); + values[i++] = Int64GetDatum(stats.nneg_hits); + + for (j = 0 ; j < SYSCACHE_STATS_NAGECLASSES ; j++) + { + datums[j * 2] = Int32GetDatum((int32) stats.ageclasses[j]); + datums[j * 2 + 1] = Int32GetDatum((int32) stats.nclass_entries[j]); + } + + arr = construct_md_array(datums, arrnulls, 2, dims, lbs, + INT4OID, sizeof(int32), true, 'i'); + values[i++] = PointerGetDatum(arr); + + values[i++] = TimestampTzGetDatum(last_update); + + Assert (i == PG_GET_SYSCACHE_SIZE); + + tuplestore_putvalues(tupstore, tupdesc, values, nulls); + } + + /* check for the end of file. abandon the result if file is broken */ + if (c != 'E' || fgetc(fpin) != EOF) + tuplestore_clear(tupstore); + + FreeFile(fpin); + +no_data: + tuplestore_donestoring(tupstore); + return (Datum) 0; +} diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 5106ed896a..950576fea0 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -89,6 +89,10 @@ static CatCacheHeader *CacheHdr = NULL; /* Timestamp used for any operation on caches. */ TimestampTz catcacheclock = 0; +/* age classes for pruning */ +static double ageclass[SYSCACHE_STATS_NAGECLASSES] + = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -619,9 +623,7 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue) else CatCacheRemoveCTup(cache, ct); CACHE1_elog(DEBUG2, "CatCacheInvalidate: invalidated"); -#ifdef CATCACHE_STATS cache->cc_invals++; -#endif /* could be multiple matches, so keep looking! */ } } @@ -697,9 +699,7 @@ ResetCatalogCache(CatCache *cache) } else CatCacheRemoveCTup(cache, ct); -#ifdef CATCACHE_STATS cache->cc_invals++; -#endif } } } @@ -906,10 +906,11 @@ CatCacheCleanupOldEntries(CatCache *cp) * cache_prune_min_age. The index of nremoved_entry is the value of the * clock-sweep counter, which takes from 0 up to 2. */ - double ageclass[] = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; - int nentries[] = {0, 0, 0, 0, 0, 0}; + int nentries[SYSCACHE_STATS_NAGECLASSES] = {0, 0, 0, 0, 0, 0}; int nremoved_entry[3] = {0, 0, 0}; int j; + + Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0); #endif /* Return immediately if no pruning is wanted */ @@ -923,7 +924,11 @@ CatCacheCleanupOldEntries(CatCache *cp) if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L) return false; - /* Search the whole hash for entries to remove */ + /* + * Search the whole hash for entries to remove. This is a quite time + * consuming task during catcache lookup, but accetable since now we are + * going to expand the hash table. + */ for (i = 0; i < cp->cc_nbuckets; i++) { dlist_mutable_iter iter; @@ -936,21 +941,21 @@ CatCacheCleanupOldEntries(CatCache *cp) /* - * Calculate the duration from the time of the last access to the - * "current" time. Since catcacheclock is not advanced within a - * transaction, the entries that are accessed within the current - * transaction won't be pruned. + * Calculate the duration from the time from the last access to + * the "current" time. Since catcacheclock is not advanced within + * a transaction, the entries that are accessed within the current + * transaction always get 0 as the result. */ TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); #ifdef CATCACHE_STATS /* count catcache entries for each age class */ ntotal++; - for (j = 0 ; - ageclass[j] != 0.0 && - entry_age > cache_prune_min_age * ageclass[j] ; - j++); - if (ageclass[j] == 0.0) j--; + + j = 0; + while (j < SYSCACHE_STATS_NAGECLASSES - 1 && + entry_age > cache_prune_min_age * ageclass[j]) + j++; nentries[j]++; #endif @@ -981,14 +986,17 @@ CatCacheCleanupOldEntries(CatCache *cp) } #ifdef CATCACHE_STATS + StaticAssertStmt(SYSCACHE_STATS_NAGECLASSES == 6, + "number of syscache age class must be 6"); ereport(DEBUG1, - (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d) naccessed(0:%d, 1:%d, 2:%d)", + (errmsg ("removed %d/%d, age(-%.0fs:%d, -%.0fs:%d, *-%.0fs:%d, -%.0fs:%d, -%.0fs:%d, rest:%d) naccessed(0:%d,1:%d, 2:%d)", nremoved, ntotal, ageclass[0] * cache_prune_min_age, nentries[0], ageclass[1] * cache_prune_min_age, nentries[1], ageclass[2] * cache_prune_min_age, nentries[2], ageclass[3] * cache_prune_min_age, nentries[3], ageclass[4] * cache_prune_min_age, nentries[4], + nentries[5], nremoved_entry[0], nremoved_entry[1], nremoved_entry[2]), errhidestmt(true))); #endif @@ -1365,9 +1373,7 @@ SearchCatCacheInternal(CatCache *cache, if (unlikely(cache->cc_tupdesc == NULL)) CatalogCacheInitializeCache(cache); -#ifdef CATCACHE_STATS cache->cc_searches++; -#endif /* Initialize local parameter array */ arguments[0] = v1; @@ -1427,9 +1433,7 @@ SearchCatCacheInternal(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_hits++; -#endif return &ct->tuple; } @@ -1438,9 +1442,7 @@ SearchCatCacheInternal(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_neg_hits++; -#endif return NULL; } @@ -1568,9 +1570,7 @@ SearchCatCacheMiss(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): put in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_newloads++; -#endif return &ct->tuple; } @@ -1681,9 +1681,7 @@ SearchCatCacheList(CatCache *cache, Assert(nkeys > 0 && nkeys < cache->cc_nkeys); -#ifdef CATCACHE_STATS cache->cc_lsearches++; -#endif /* Initialize local parameter array */ arguments[0] = v1; @@ -1740,9 +1738,7 @@ SearchCatCacheList(CatCache *cache, CACHE2_elog(DEBUG2, "SearchCatCacheList(%s): found list", cache->cc_relname); -#ifdef CATCACHE_STATS cache->cc_lhits++; -#endif return cl; } @@ -2254,3 +2250,64 @@ PrintCatCacheListLeakWarning(CatCList *list) list->my_cache->cc_relname, list->my_cache->id, list, list->refcount); } + +/* + * CatCacheGetStats - fill in SysCacheStats struct. + * + * This is a support routine for SysCacheGetStats, substantially fills in the + * result. The classification here is based on the same criteria to + * CatCacheCleanupOldEntries(). + */ +void +CatCacheGetStats(CatCache *cache, SysCacheStats *stats) +{ + int i, j; + + Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0); + + /* fill in the stats struct */ + stats->size = cache->cc_tupsize + cache->cc_nbuckets * sizeof(dlist_head); + stats->ntuples = cache->cc_ntup; + stats->nsearches = cache->cc_searches; + stats->nhits = cache->cc_hits; + stats->nneg_hits = cache->cc_neg_hits; + + /* cache_prune_min_age can be changed on-session, fill it every time */ + for (i = 0 ; i < SYSCACHE_STATS_NAGECLASSES ; i++) + stats->ageclasses[i] = (int) (cache_prune_min_age * ageclass[i]); + + /* + * nth element in nclass_entries stores the number of cache entries that + * have lived unaccessed for corresponding multiple in ageclass of + * cache_prune_min_age. + */ + memset(stats->nclass_entries, 0, sizeof(int) * SYSCACHE_STATS_NAGECLASSES); + + /* Scan the whole hash */ + for (i = 0; i < cache->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cache->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + /* + * Calculate the duration from the time from the last access to + * the "current" time. Since catcacheclock is not advanced within + * a transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + + j = 0; + while (j < SYSCACHE_STATS_NAGECLASSES - 1 && + entry_age > stats->ageclasses[j]) + j++; + + stats->nclass_entries[j]++; + } + } +} diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c index ac98c19155..7b38a06708 100644 --- a/src/backend/utils/cache/syscache.c +++ b/src/backend/utils/cache/syscache.c @@ -20,6 +20,9 @@ */ #include "postgres.h" +#include <sys/stat.h> +#include <unistd.h> + #include "access/htup_details.h" #include "access/sysattr.h" #include "catalog/indexing.h" @@ -1534,6 +1537,27 @@ RelationSupportsSysCache(Oid relid) return false; } +/* + * SysCacheGetStats - returns stats of specified syscache + * + * This routine returns the address of its local static memory. + */ +SysCacheStats * +SysCacheGetStats(int cacheId) +{ + static SysCacheStats stats; + + Assert(cacheId >=0 && cacheId < SysCacheSize); + + memset(&stats, 0, sizeof(stats)); + + stats.reloid = cacheinfo[cacheId].reloid; + stats.indoid = cacheinfo[cacheId].indoid; + + CatCacheGetStats(SysCache[cacheId], &stats); + + return &stats; +} /* * OID comparator for pg_qsort diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index fd51934aaf..f039ecd805 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false; volatile sig_atomic_t ProcDiePending = false; volatile sig_atomic_t ClientConnectionLost = false; volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false; +volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending = false; volatile sig_atomic_t ConfigReloadPending = false; volatile uint32 InterruptHoldoffCount = 0; volatile uint32 QueryCancelHoldoffCount = 0; diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index c0b6231458..dee7f19475 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg); static void StatementTimeoutHandler(void); static void LockTimeoutHandler(void); static void IdleInTransactionSessionTimeoutHandler(void); +static void IdleSyscacheStatsUpdateTimeoutHandler(void); static bool ThereIsAtLeastOneRole(void); static void process_startup_options(Port *port, bool am_superuser); static void process_settings(Oid databaseid, Oid roleid); @@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler); RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT, IdleInTransactionSessionTimeoutHandler); + RegisterTimeout(IDLE_CATCACHE_UPDATE_TIMEOUT, + IdleSyscacheStatsUpdateTimeoutHandler); } /* @@ -1239,6 +1242,14 @@ IdleInTransactionSessionTimeoutHandler(void) SetLatch(MyLatch); } +static void +IdleSyscacheStatsUpdateTimeoutHandler(void) +{ + IdleSyscacheStatsUpdateTimeoutPending = true; + InterruptPending = true; + SetLatch(MyLatch); +} + /* * Returns true if at least one role is defined in this database cluster. */ diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 06c589f725..32e41253a6 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -3168,6 +3168,16 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"track_syscache_usage_interval", PGC_SUSET, STATS_COLLECTOR, + gettext_noop("Sets the interval between syscache usage collection, in milliseconds. Zero disables syscache usagetracking."), + NULL + }, + &pgstat_track_syscache_usage_interval, + 0, 0, INT_MAX / 2, + NULL, NULL, NULL + }, + { {"gin_pending_list_limit", PGC_USERSET, CLIENT_CONN_STATEMENT, gettext_noop("Sets the maximum size of the pending list for GIN index."), diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 108d332f2c..4d4fb42251 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -560,6 +560,7 @@ #track_io_timing = off #track_functions = none # none, pl, all #track_activity_query_size = 1024 # (change requires restart) +#track_syscache_usage_interval = 0 # zero disables tracking #stats_temp_directory = 'pg_stat_tmp' diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index b8de13f03b..6099a828d2 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -9669,6 +9669,15 @@ proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}', proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}', prosrc => 'pg_get_replication_slots' }, +{ oid => '3425', + descr => 'syscache statistics', + proname => 'pg_get_syscache_stats', prorows => '100', proisstrict => 'f', + proretset => 't', provolatile => 'v', prorettype => 'record', + proargtypes => 'int4', + proallargtypes => '{int4,oid,oid,int8,int8,int8,int8,int8,_int4,timestamptz}', + proargmodes => '{i,o,o,o,o,o,o,o,o,o}', + proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,last_update}', + prosrc => 'pgstat_get_syscache_stats' }, { oid => '3786', descr => 'set up a logical replication slot', proname => 'pg_create_logical_replication_slot', provolatile => 'v', proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool', diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index c9e35003a5..69b9a976f0 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending; extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending; extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending; extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending; +extern PGDLLIMPORT volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending; extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending; extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost; diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 88a75fb798..b6bfd7d644 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -1144,6 +1144,7 @@ extern bool pgstat_track_activities; extern bool pgstat_track_counts; extern int pgstat_track_functions; extern PGDLLIMPORT int pgstat_track_activity_query_size; +extern int pgstat_track_syscache_usage_interval; extern char *pgstat_stat_directory; extern char *pgstat_stat_tmpname; extern char *pgstat_stat_filename; @@ -1228,7 +1229,8 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id); extern void pgstat_initstats(Relation rel); extern char *pgstat_clip_activity(const char *raw_activity); - +extern void pgstat_get_syscachestat_filename(bool permanent, + bool tempname, int backendid, char *filename, int len); /* ---------- * pgstat_report_wait_start() - * @@ -1363,5 +1365,5 @@ extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid); extern int pgstat_fetch_stat_numbackends(void); extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void); extern PgStat_GlobalStats *pgstat_fetch_global(void); - +extern long pgstat_write_syscache_stats(bool force); #endif /* PGSTAT_H */ diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 5d24809900..4d51975920 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -65,10 +65,8 @@ typedef struct catcache int cc_tupsize; /* total amount of catcache tuples */ /* - * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS - * doesn't break ABI for other modules + * Statistics entries */ -#ifdef CATCACHE_STATS long cc_searches; /* total # searches against this cache */ long cc_hits; /* # of matches against existing entry */ long cc_neg_hits; /* # of matches against negative entry */ @@ -81,7 +79,6 @@ typedef struct catcache long cc_invals; /* # of entries invalidated from cache */ long cc_lsearches; /* total # list-searches */ long cc_lhits; /* # of matches against existing lists */ -#endif } CatCache; @@ -254,4 +251,8 @@ extern void PrepareToInvalidateCacheTuple(Relation relation, extern void PrintCatCacheLeakWarning(HeapTuple tuple); extern void PrintCatCacheListLeakWarning(CatCList *list); +/* defined in syscache.h */ +typedef struct syscachestats SysCacheStats; +extern void CatCacheGetStats(CatCache *cache, SysCacheStats *syscachestats); + #endif /* CATCACHE_H */ diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h index 95ee48954e..71b399c902 100644 --- a/src/include/utils/syscache.h +++ b/src/include/utils/syscache.h @@ -112,6 +112,24 @@ enum SysCacheIdentifier #define SysCacheSize (USERMAPPINGUSERSERVER + 1) }; +#define SYSCACHE_STATS_NAGECLASSES 6 +/* Struct for catcache tracking information */ +typedef struct syscachestats +{ + Oid reloid; /* target relation */ + Oid indoid; /* index */ + size_t size; /* size of the catcache */ + int ntuples; /* number of tuples resides in the catcache */ + int nsearches; /* number of searches */ + int nhits; /* number of cache hits */ + int nneg_hits; /* number of negative cache hits */ + /* age classes in seconds */ + int ageclasses[SYSCACHE_STATS_NAGECLASSES]; + /* number of tuples fall into the corresponding age class */ + int nclass_entries[SYSCACHE_STATS_NAGECLASSES]; +} SysCacheStats; + + extern void InitCatalogCache(void); extern void InitCatalogCachePhase2(void); @@ -164,6 +182,7 @@ extern void SysCacheInvalidate(int cacheId, uint32 hashValue); extern bool RelationInvalidatesSnapshotsOnly(Oid relid); extern bool RelationHasSysCache(Oid relid); extern bool RelationSupportsSysCache(Oid relid); +extern SysCacheStats *SysCacheGetStats(int cacheId); /* * The use of the macros below rather than direct calls to the corresponding diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h index 9244a2a7b7..0ab441a364 100644 --- a/src/include/utils/timeout.h +++ b/src/include/utils/timeout.h @@ -31,6 +31,7 @@ typedef enum TimeoutId STANDBY_TIMEOUT, STANDBY_LOCK_TIMEOUT, IDLE_IN_TRANSACTION_SESSION_TIMEOUT, + IDLE_CATCACHE_UPDATE_TIMEOUT, /* First user-definable timeout reason */ USER_TIMEOUT, /* Maximum number of timeout reasons */ diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out index 2c8e21baa7..7bd77e9972 100644 --- a/src/test/regress/expected/rules.out +++ b/src/test/regress/expected/rules.out @@ -1921,6 +1921,28 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid, pg_stat_all_tables.autoanalyze_count FROM pg_stat_all_tables WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname~ '^pg_toast'::text)); +pg_stat_syscache| SELECT s.pid, + (s.relid)::regclass AS relname, + (s.indid)::regclass AS cache_name, + s.size, + s.ntup AS ntuples, + s.searches, + s.hits, + s.neg_hits, + s.ageclass, + s.last_update + FROM (pg_stat_activity a + JOIN LATERAL ( SELECT a.pid, + pg_get_syscache_stats.relid, + pg_get_syscache_stats.indid, + pg_get_syscache_stats.size, + pg_get_syscache_stats.ntup, + pg_get_syscache_stats.searches, + pg_get_syscache_stats.hits, + pg_get_syscache_stats.neg_hits, + pg_get_syscache_stats.ageclass, + pg_get_syscache_stats.last_update + FROM pg_get_syscache_stats(a.pid) pg_get_syscache_stats(relid, indid, size, ntup, searches, hits, neg_hits, ageclass,last_update)) s ON ((a.pid = s.pid))); pg_stat_user_functions| SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, @@ -2352,7 +2374,7 @@ pg_settings|pg_settings_n|CREATE RULE pg_settings_n AS ON UPDATE TO pg_catalog.pg_settings DO INSTEAD NOTHING; pg_settings|pg_settings_u|CREATE RULE pg_settings_u AS ON UPDATE TO pg_catalog.pg_settings - WHERE (new.name = old.name) DO SELECT set_config(old.name, new.setting, false) AS set_config; + WHERE (new.name = old.name) DO SELECT set_config(old.name, new.setting, false, false) AS set_config; rtest_emp|rtest_emp_del|CREATE RULE rtest_emp_del AS ON DELETE TO public.rtest_emp DO INSERT INTO rtest_emplog (ename, who, action, newsal, oldsal) VALUES (old.ename, CURRENT_USER, 'fired'::bpchar, '$0.00'::money, old.salary); -- 2.16.3 From c1a947892dd3f96cc6200c4a27b0c8a24d1c3469 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Wed, 6 Feb 2019 14:36:29 +0900 Subject: [PATCH 3/4] PoC add prune-by-number-of-entries feature Adds prune based on the number of cache entries on top of the current pruning patch. It is controlled by two GUC variables. cache_entry_limit: limit of the number of entries per catcache cache_entry_limit_prune_ratio: how much of entries to remove at pruning --- src/backend/utils/cache/catcache.c | 108 ++++++++++++++++++++++++++++++++++++- src/backend/utils/misc/guc.c | 40 ++++++++++++++ src/include/utils/catcache.h | 2 + 3 files changed, 149 insertions(+), 1 deletion(-) diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 950576fea0..ecea5b603c 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -77,6 +77,11 @@ */ int cache_memory_target = 0; + +/* PoC entry limit */ +int cache_entry_limit = 0; +double cache_entry_limit_prune_ratio = 0.8; + /* GUC variable to define the minimum age of entries that will be cosidered to * be evicted in seconds. This variable is shared among various cache * mechanisms. @@ -882,6 +887,102 @@ InitCatCache(int id, return cp; } +/* + * CatCacheCleanupOldEntriesByNum - + * Poc remove infrequently-used entries by number of entries. + */ +static bool +CatCacheCleanupOldEntriesByNum(CatCache *cp, int cache_entry_limit) +{ + int i; + int n; + int oldndelelem = cp->cc_ntup; + int ndelelem; + CatCTup **ct_array; + + ndelelem = oldndelelem - (int)(cache_entry_limit * cache_entry_limit_prune_ratio); + + /* lower limit: quite arbitrary */ + if (ndelelem < 256) + ndelelem = 256; + + /* + * partial sort array: [0] contains latest access entry + * [1] contains ealiest access entry + */ + ct_array = (CatCTup **) palloc(ndelelem * sizeof(CatCTup*)); + n = 0; + + /* + * Collect entries to be removed, which have older lastaccess. + * Using heap bound sort like tuplesort.c. + */ + for (i = 0; i < cp->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + + /* Don't remove referenced entries */ + if (ct->refcount != 0 || + (ct->c_list && ct->c_list->refcount != 0)) + continue; + + if (n < ndelelem) + { + /* Fill up the min heap array */ + int j = n++; + + while (j > 0) + { + int i = (j - 1) >> 1; + + if (ct->lastaccess >= ct_array[i]->lastaccess) + break; + ct_array[j] = ct_array[i]; + j = i; + } + ct_array[j] = ct; + } + else if (ct->lastaccess > ct_array[0]->lastaccess) + { + /* older than the oldest in the array, add it */ + unsigned int i; + + i = 0; + + for (;;) + { + unsigned int j = 2 * i + 1; + + if (j >= n) + break; + if (j + 1 < n && + ct_array[j]->lastaccess > ct_array[j + 1]->lastaccess) + j++; + if (ct->lastaccess <= ct_array[j]->lastaccess) + break; + ct_array[i] = ct_array[j]; + i = j; + } + ct_array[i] = ct; + } + } + } + + /* Now we have the list of elements to be deleted */ + for (i = 0 ; i < n && ct_array[i]; i++) + CatCacheRemoveCTup(cp, ct_array[i]); + + pfree(ct_array); + + elog(LOG, "Catcache pruned by entry number: id=%d, %d => %d", cp->id, oldndelelem, cp->cc_ntup); + + return true; +} + /* * CatCacheCleanupOldEntries - Remove infrequently-used entries * @@ -923,7 +1024,7 @@ CatCacheCleanupOldEntries(CatCache *cp) hash_size = cp->cc_nbuckets * sizeof(dlist_head); if (hash_size + cp->cc_tupsize < (Size) cache_memory_target * 1024L) return false; - + /* * Search the whole hash for entries to remove. This is a quite time * consuming task during catcache lookup, but accetable since now we are @@ -2049,6 +2150,11 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, /* increase refcount so that this survives pruning */ ct->refcount++; + + /* cap number of entries */ + if (cache_entry_limit > 0 && cache->cc_ntup > cache_entry_limit) + CatCacheCleanupOldEntriesByNum(cache, cache_entry_limit); + /* * If the hash table has become too full, try cleanup by removing * infrequently used entries to make a room for the new entry. If it diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 32e41253a6..7bb239a07e 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -2227,6 +2227,36 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the maximum entries of catcache."), + NULL + }, + &cache_entry_limit, + 0, 0, INT_MAX, + NULL, NULL, NULL + }, + + { + {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the maximum entries of catcache."), + NULL + }, + &cache_entry_limit, + 0, 0, INT_MAX, + NULL, NULL, NULL + }, + + { + {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the maximum entries of catcache."), + NULL + }, + &cache_entry_limit, + 0, 0, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if @@ -3401,6 +3431,16 @@ static struct config_real ConfigureNamesReal[] = NULL, NULL, NULL }, + { + {"cache_entry_limit_prune_ratio", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the maximum entries of catcache."), + NULL + }, + &cache_entry_limit_prune_ratio, + 0.8, 0.0, 1.0, + NULL, NULL, NULL + }, + /* End-of-list marker */ { {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 4d51975920..1f7fb51ac0 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -193,6 +193,8 @@ extern PGDLLIMPORT MemoryContext CacheMemoryContext; /* for guc.c, not PGDLLPMPORT'ed */ extern int cache_prune_min_age; extern int cache_memory_target; +extern int cache_entry_limit; +extern double cache_entry_limit_prune_ratio; /* to use as access timestamp of catcache entries */ extern TimestampTz catcacheclock; -- 2.16.3
On 2019-02-06 17:37:04 +0900, Kyotaro HORIGUCHI wrote: > At Wed, 06 Feb 2019 15:16:53 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in<20190206.151653.117382256.horiguchi.kyotaro@lab.ntt.co.jp> > > > The two should have the same extent of impact on performance when > > > disabled. I'll take numbers briefly using pgbench. > > (pgbench -j 10 -c 10 -T 120) x 5 times for each. > > A: unpached : 118.58 tps (stddev 0.44) > B: pached-not-used[1] : 118.41 tps (stddev 0.29) > C: patched-timedprune[2]: 118.41 tps (stddev 0.51) > D: patched-capped...... : none[3] > > [1]: cache_prune_min_age = 0, cache_entry_limit = 0 > > [2]: cache_prune_min_age = 100, cache_entry_limit = 0 > (Prunes every 100ms) > > [3] I didin't find a sane benchmark for the capping case using > vanilla pgbench. > > It doesn't seem to me showing significant degradation on *my* > box... > > # I found a bug that can remove newly created entry. So v11. This seems to just benchmark your disk speed, no? ISTM you need to measure readonly performance, not read/write. And with plenty more tables than just standard pgbench -S. Greetings, Andres Freund
At Tue, 5 Feb 2019 19:05:26 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in <20190205220526.GA1442@alvherre.pgsql> > On 2019-Feb-05, Tomas Vondra wrote: > > > I don't think we need to remove the expired entries right away, if there > > are only very few of them. The cleanup requires walking the hash table, > > which means significant fixed cost. So if there are only few expired > > entries (say, less than 25% of the cache), we can just leave them around > > and clean them if we happen to stumble on them (although that may not be > > possible with dynahash, which has no concept of expiration) of before > > enlarging the hash table. > > I think seqscanning the hash table is going to be too slow; Ideriha-san > idea of having a dlist with the entries in LRU order (where each entry > is moved to head of list when it is touched) seemed good: it allows you > to evict older ones when the time comes, without having to scan the rest > of the entries. Having a dlist means two more pointers on each cache > entry AFAIR, so it's not a huge amount of memory. Ah, I had a separate list in my mind. Sounds reasonable to have pointers in cache entry. But I'm not sure how much additional dlist_* impact. The attached is the new version with the following properties: - Both prune-by-age and hard limiting feature. (Merged into single function, single scan) Debug tracking feature in CatCacheCleanupOldEntries is removed since it no longer runs a full scan. Prune-by-age can be a single-setup-for-all-cache feature but the hard limit is obviously not. We could use reloptions for the purpose (which is not currently available on pg_class and pg_attribute:p). I'll add that if there's no strong objection. Or is there anyone comes up with something sutable for the purpose? - Using LRU to get rid of full scan. I added new API dlist_move_to_tail which was needed to construct LRU. I'm going to retake numbers with search-only queries. > > So if we want to address this case too (and we probably want), we may > > need to discard the old cache memory context someho (e.g. rebuild the > > cache in a new one, and copy the non-expired entries). Which is a nice > > opportunity to do the "full" cleanup, of course. > > Yeah, we probably don't want to do this super frequently though. MemoryContext per cache? regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 72a569703662b93fb57c55c337b16107ebccfce3 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Thu, 7 Feb 2019 14:56:07 +0900 Subject: [PATCH 1/4] Add dlist_move_tail We have dlist_push_head/tail and dlist_move_head but not dlist_move_tail. Add it. --- src/include/lib/ilist.h | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/src/include/lib/ilist.h b/src/include/lib/ilist.h index b1a5974ee4..659ab1ac87 100644 --- a/src/include/lib/ilist.h +++ b/src/include/lib/ilist.h @@ -394,6 +394,25 @@ dlist_move_head(dlist_head *head, dlist_node *node) dlist_check(head); } +/* + * Move element from its current position in the list to the tail position in + * the same list. + * + * Undefined behaviour if 'node' is not already part of the list. + */ +static inline void +dlist_move_tail(dlist_head *head, dlist_node *node) +{ + /* fast path if it's already at the tail */ + if (head->head.prev == node) + return; + + dlist_delete(node); + dlist_push_tail(head, node); + + dlist_check(head); +} + /* * Check whether 'node' has a following node. * Caution: unreliable if 'node' is not in the list. -- 2.16.3 From 5919f1495f27faefdc09abe65fd6e374fa83d9ff Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 16 Oct 2018 13:04:30 +0900 Subject: [PATCH 2/4] Remove entries that haven't been used for a certain time Catcache entries can be left alone for several reasons. It is not desirable that they eat up memory. With this patch, This adds consideration of removal of entries that haven't been used for a certain time before enlarging the hash array. This also can put a hard limit on the number of catcache entries. --- doc/src/sgml/config.sgml | 38 ++++++ src/backend/access/transam/xact.c | 5 + src/backend/utils/cache/catcache.c | 190 +++++++++++++++++++++++++- src/backend/utils/misc/guc.c | 63 +++++++++ src/backend/utils/misc/postgresql.conf.sample | 2 + src/include/utils/catcache.h | 32 ++++- 6 files changed, 322 insertions(+), 8 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 9b7a7388d5..d0d2374944 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1662,6 +1662,44 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target"> + <term><varname>syscache_memory_target</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_memory_target</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the maximum amount of memory to which syscache is expanded + without pruning. The value defaults to 0, indicating that pruning is + always considered. After exceeding this size, syscache pruning is + considered according to + <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep + certain amount of syscache entries with intermittent usage, try + increase this setting. + </para> + </listitem> + </varlistentry> + + <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age"> + <term><varname>syscache_prune_min_age</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the minimum amount of unused time in seconds at which a + syscache entry is considered to be removed. -1 indicates that syscache + pruning is disabled at all. The value defaults to 600 seconds + (<literal>10 minutes</literal>). The syscache entries that are not + used for the duration can be removed to prevent syscache bloat. This + behavior is suppressed until the size of syscache exceeds + <xref linkend="guc-syscache-memory-target"/>. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth"> <term><varname>max_stack_depth</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 92bda87804..ddc433c59e 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -734,7 +734,12 @@ void SetCurrentStatementStartTimestamp(void) { if (!IsParallelWorker()) + { stmtStartTimestamp = GetCurrentTimestamp(); + + /* Set this timestamp as aproximated current time */ + SetCatCacheClock(stmtStartTimestamp); + } else Assert(stmtStartTimestamp != 0); } diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 258a1d64cc..0a56390352 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -71,9 +71,32 @@ #define CACHE6_elog(a,b,c,d,e,f,g) #endif +/* + * GUC variable to define the minimum size of hash to cosider entry eviction. + * This variable is shared among various cache mechanisms. + */ +int cache_memory_target = 0; + + +/* + * GUC for entry limit. Entries are removed when the number of them goes above + * cache_entry_limit by the ratio specified by cache_entry_limit_prune_ratio + */ +int cache_entry_limit = 0; +double cache_entry_limit_prune_ratio = 0.8; + +/* GUC variable to define the minimum age of entries that will be cosidered to + * be evicted in seconds. This variable is shared among various cache + * mechanisms. + */ +int cache_prune_min_age = 600; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Timestamp used for any operation on caches. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -481,6 +504,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) /* delink from linked list */ dlist_delete(&ct->cache_elem); + dlist_delete(&ct->lru_node); /* * Free keys when we're dealing with a negative entry, normal entries just @@ -490,6 +514,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys, cache->cc_keyno, ct->keys); + cache->cc_tupsize -= ct->size; pfree(ct); --cache->cc_ntup; @@ -841,7 +866,9 @@ InitCatCache(int id, cp->cc_nkeys = nkeys; for (i = 0; i < nkeys; ++i) cp->cc_keyno[i] = key[i]; + cp->cc_tupsize = 0; + dlist_init(&cp->cc_lru_list); /* * new cache is initialized as far as we can go for now. print some * debugging information, if appropriate. @@ -858,9 +885,133 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initilize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries can be left alone for several reasons. We remove them if + * they are not accessed for a certain time to prevent catcache from + * bloating. The eviction is performed with the similar algorithm with buffer + * eviction using access counter. Entries that are accessed several times can + * live longer than those that have had no access in the same duration. + */ +#define PRUNE_BY_AGE 0x01 +#define PRUNE_BY_NUMBER 0x02 + +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int nremoved = 0; + size_t hash_size; + int nelems_before = cp->cc_ntup; + int ndelelems = 0; + int action = 0; + dlist_mutable_iter iter; + + if (cache_prune_min_age >= 0) + { + /* prune only if the size of the hash is above the target */ + + hash_size = cp->cc_nbuckets * sizeof(dlist_head); + if (hash_size + cp->cc_tupsize > (Size) cache_memory_target * 1024L) + action |= PRUNE_BY_AGE; + } + + if (cache_entry_limit > 0 && nelems_before >= cache_entry_limit) + { + ndelelems = nelems_before - + (int) (cache_entry_limit * cache_entry_limit_prune_ratio); + + if (ndelelems < 256) + ndelelems = 256; + if (ndelelems > nelems_before) + ndelelems = nelems_before; + + action |= PRUNE_BY_NUMBER; + } + + /* Return immediately if no pruning is wanted */ + if (action == 0) + return false; + + /* Scan over LRU to find entries to remove */ + dlist_foreach_modify(iter, &cp->cc_lru_list) + { + CatCTup *ct = dlist_container(CatCTup, lru_node, iter.cur); + bool remove_this = false; + + /* We don't remove referenced entry */ + if (ct->refcount != 0 || + (ct->c_list && ct->c_list->refcount != 0)) + continue; + + /* check against age */ + if (action & PRUNE_BY_AGE) + { + long entry_age; + int us; + + /* + * Calculate the duration from the time of the last access to the + * "current" time. Since catcacheclock is not advanced within a + * transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + + if (entry_age < cache_prune_min_age) + { + /* no longer have a business with further entries, exit */ + action &= ~PRUNE_BY_AGE; + break; + } + + /* + * Entries that are not accessed after last pruning are removed in + * that seconds, and that has been accessed several times are + * removed after leaving alone for up to three times of the + * duration. We don't try shrink buckets since pruning effectively + * caps catcache expansion in the long term. + */ + if (ct->naccess > 0) + ct->naccess--; + else + remove_this = true; + } + + /* check against entry number */ + if (action & PRUNE_BY_NUMBER) + { + if (nremoved < ndelelems) + remove_this = true; + else + action &= ~PRUNE_BY_NUMBER; /* satisfied */ + } + + /* exit if finished */ + if (action == 0) + break; + + /* do the work */ + if (remove_this) + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + } + } + + elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d", + cp->id, cp->cc_relname, nremoved, nelems_before); + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -1274,6 +1425,12 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* Update access information for pruning */ + if (ct->naccess < 2) + ct->naccess++; + ct->lastaccess = catcacheclock; + dlist_move_tail(&cache->cc_lru_list, &ct->lru_node); + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1819,11 +1976,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, CatCTup *ct; HeapTuple dtp; MemoryContext oldcxt; + int tupsize = 0; /* negative entries have no tuple associated */ if (ntp) { int i; + int tupsize; Assert(!negative); @@ -1842,13 +2001,14 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, /* Allocate memory for CatCTup and the cached tuple in one go */ oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - ct = (CatCTup *) palloc(sizeof(CatCTup) + - MAXIMUM_ALIGNOF + dtp->t_len); + tupsize = sizeof(CatCTup) + MAXIMUM_ALIGNOF + dtp->t_len; + ct = (CatCTup *) palloc(tupsize); ct->tuple.t_len = dtp->t_len; ct->tuple.t_self = dtp->t_self; ct->tuple.t_tableOid = dtp->t_tableOid; ct->tuple.t_data = (HeapTupleHeader) MAXALIGN(((char *) ct) + sizeof(CatCTup)); + ct->size = tupsize; /* copy tuple contents */ memcpy((char *) ct->tuple.t_data, (const char *) dtp->t_data, @@ -1876,8 +2036,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, { Assert(negative); oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - ct = (CatCTup *) palloc(sizeof(CatCTup)); - + tupsize = sizeof(CatCTup); + ct = (CatCTup *) palloc(tupsize); /* * Store keys - they'll point into separately allocated memory if not * by-value. @@ -1898,18 +2058,34 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; + dlist_push_tail(&cache->cc_lru_list, &ct->lru_node); + ct->size = tupsize; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); cache->cc_ntup++; CacheHdr->ch_ntup++; + cache->cc_tupsize += tupsize; + + /* increase refcount so that this survives pruning */ + ct->refcount++; /* - * If the hash table has become too full, enlarge the buckets array. Quite - * arbitrarily, we enlarge when fill factor > 2. + * If the hash table has become too full, try cleanup by removing + * infrequently used entries to make a room for the new entry. If it + * failed, enlarge the bucket array instead. Quite arbitrarily, we try + * this when fill factor > 2. */ - if (cache->cc_ntup > cache->cc_nbuckets * 2) + if (cache->cc_ntup > cache->cc_nbuckets * 2 && + !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); + /* we may still want to prune by entry number, check it */ + else if (cache_entry_limit > 0 && cache->cc_ntup > cache_entry_limit) + CatCacheCleanupOldEntries(cache); + + ct->refcount--; return ct; } diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 8681ada33a..d4df841982 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -81,6 +81,7 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/guc_tables.h" #include "utils/float.h" #include "utils/memutils.h" @@ -2204,6 +2205,58 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"cache_memory_target", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum syscache size to keep."), + gettext_noop("Cache is not pruned before exceeding this size."), + GUC_UNIT_KB + }, + &cache_memory_target, + 0, 0, MAX_KILOBYTES, + NULL, NULL, NULL + }, + + { + {"cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum unused duration of cache entries before removal."), + gettext_noop("Cache entries that live unused for longer than this seconds are considered to be removed."), + GUC_UNIT_S + }, + &cache_prune_min_age, + 600, -1, INT_MAX, + NULL, NULL, NULL + }, + + { + {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the maximum entries of catcache."), + NULL + }, + &cache_entry_limit, + 0, 0, INT_MAX, + NULL, NULL, NULL + }, + + { + {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the maximum entries of catcache."), + NULL + }, + &cache_entry_limit, + 0, 0, INT_MAX, + NULL, NULL, NULL + }, + + { + {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the maximum entries of catcache."), + NULL + }, + &cache_entry_limit, + 0, 0, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if @@ -3368,6 +3421,16 @@ static struct config_real ConfigureNamesReal[] = NULL, NULL, NULL }, + { + {"cache_entry_limit_prune_ratio", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the maximum entries of catcache."), + NULL + }, + &cache_entry_limit_prune_ratio, + 0.8, 0.0, 1.0, + NULL, NULL, NULL + }, + /* End-of-list marker */ { {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index c7f53470df..108d332f2c 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -128,6 +128,8 @@ #work_mem = 4MB # min 64kB #maintenance_work_mem = 64MB # min 1MB #autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem +#cache_memory_target = 0kB # in kB +#cache_prune_min_age = 600s # -1 disables pruning #max_stack_depth = 2MB # min 100kB #shared_memory_type = mmap # the default is the first option # supported by the operating system: diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 65d816a583..973a87c2cf 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -61,6 +62,8 @@ typedef struct catcache slist_node cc_next; /* list link */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ + dlist_head cc_lru_list; + int cc_tupsize; /* total amount of catcache tuples */ /* * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS @@ -119,7 +122,10 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ - + int naccess; /* # of access to this entry, up to 2 */ + TimestampTz lastaccess; /* approx. timestamp of the last usage */ + dlist_node lru_node; /* LRU node */ + int size; /* palloc'ed size off this tuple */ /* * The tuple may also be a member of at most one CatCList. (If a single * catcache is list-searched with varying numbers of keys, we may have to @@ -189,6 +195,30 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int cache_prune_min_age; +extern int cache_memory_target; +extern int cache_entry_limit; +extern double cache_entry_limit_prune_ratio; + +/* to use as access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* + * SetCatCacheClock - set timestamp for catcache access record + */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + +static inline TimestampTz +GetCatCacheClock(void) +{ + return catcacheclock; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, -- 2.16.3 From 2591d5984d5fb9f2fd4cca0ecb8c68431311790a Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 16 Oct 2018 15:48:28 +0900 Subject: [PATCH 3/4] Syscache usage tracking feature. Collects syscache usage statictics and show it using the view pg_stat_syscache. The feature is controlled by the GUC variable track_syscache_usage_interval. --- doc/src/sgml/config.sgml | 15 ++ src/backend/catalog/system_views.sql | 17 +++ src/backend/postmaster/pgstat.c | 201 ++++++++++++++++++++++++-- src/backend/tcop/postgres.c | 23 +++ src/backend/utils/adt/pgstatfuncs.c | 134 +++++++++++++++++ src/backend/utils/cache/catcache.c | 89 +++++++++--- src/backend/utils/cache/syscache.c | 24 +++ src/backend/utils/init/globals.c | 1 + src/backend/utils/init/postinit.c | 11 ++ src/backend/utils/misc/guc.c | 10 ++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/catalog/pg_proc.dat | 9 ++ src/include/miscadmin.h | 1 + src/include/pgstat.h | 6 +- src/include/utils/catcache.h | 9 +- src/include/utils/syscache.h | 19 +++ src/include/utils/timeout.h | 1 + src/test/regress/expected/rules.out | 24 ++- 18 files changed, 559 insertions(+), 36 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index d0d2374944..5ff3ebeb4e 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -6687,6 +6687,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; </listitem> </varlistentry> + <varlistentry id="guc-track-syscache-usage-interval" xreflabel="track_syscache_usage_interval"> + <term><varname>track_syscache_usage_interval</varname> (<type>integer</type>) + <indexterm> + <primary><varname>track_syscache_usage_interval</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the interval to collect system cache usage statistics in + milliseconds. This parameter is 0 by default, which means disabled. + Only superusers can change this setting. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-track-io-timing" xreflabel="track_io_timing"> <term><varname>track_io_timing</varname> (<type>boolean</type>) <indexterm> diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 3e229c693c..f5d1aaf96f 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -906,6 +906,22 @@ CREATE VIEW pg_stat_progress_vacuum AS FROM pg_stat_get_progress_info('VACUUM') AS S LEFT JOIN pg_database D ON S.datid = D.oid; +CREATE VIEW pg_stat_syscache AS + SELECT + S.pid AS pid, + S.relid::regclass AS relname, + S.indid::regclass AS cache_name, + S.size AS size, + S.ntup AS ntuples, + S.searches AS searches, + S.hits AS hits, + S.neg_hits AS neg_hits, + S.ageclass AS ageclass, + S.last_update AS last_update + FROM pg_stat_activity A + JOIN LATERAL (SELECT A.pid, * FROM pg_get_syscache_stats(A.pid)) S + ON (A.pid = S.pid); + CREATE VIEW pg_user_mappings AS SELECT U.oid AS umid, @@ -1185,6 +1201,7 @@ GRANT EXECUTE ON FUNCTION pg_ls_waldir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_archive_statusdir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_tmpdir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_tmpdir(oid) TO pg_monitor; +GRANT EXECUTE ON FUNCTION pg_get_syscache_stats(int) TO pg_monitor; GRANT pg_read_all_settings TO pg_monitor; GRANT pg_read_all_stats TO pg_monitor; diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 81c6499251..a1939958b7 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -66,6 +66,7 @@ #include "utils/ps_status.h" #include "utils/rel.h" #include "utils/snapmgr.h" +#include "utils/syscache.h" #include "utils/timestamp.h" @@ -124,6 +125,7 @@ bool pgstat_track_activities = false; bool pgstat_track_counts = false; int pgstat_track_functions = TRACK_FUNC_OFF; +int pgstat_track_syscache_usage_interval = 0; int pgstat_track_activity_query_size = 1024; /* ---------- @@ -236,6 +238,11 @@ typedef struct TwoPhasePgStatRecord bool t_truncated; /* was the relation truncated? */ } TwoPhasePgStatRecord; +/* bitmap symbols to specify target file types remove */ +#define PGSTAT_REMFILE_DBSTAT 1 /* remove only databsae stats files */ +#define PGSTAT_REMFILE_SYSCACHE 2 /* remove only syscache stats files */ +#define PGSTAT_REMFILE_ALL 3 /* remove both type of files */ + /* * Info about current "snapshot" of stats file */ @@ -335,6 +342,7 @@ static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len); static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len); static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len); static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len); +static void pgstat_remove_syscache_statsfile(void); /* ------------------------------------------------------------ * Public functions called from postmaster follow @@ -630,10 +638,13 @@ startup_failed: } /* - * subroutine for pgstat_reset_all + * remove stats files + * + * clean up stats files in specified directory. target is one of + * PGSTAT_REFILE_DBSTAT/SYSCACHE/ALL and restricts files to remove. */ static void -pgstat_reset_remove_files(const char *directory) +pgstat_reset_remove_files(const char *directory, int target) { DIR *dir; struct dirent *entry; @@ -644,25 +655,39 @@ pgstat_reset_remove_files(const char *directory) { int nchars; Oid tmp_oid; + int filetype = 0; /* * Skip directory entries that don't match the file names we write. * See get_dbstat_filename for the database-specific pattern. */ if (strncmp(entry->d_name, "global.", 7) == 0) + { + filetype = PGSTAT_REMFILE_DBSTAT; nchars = 7; + } else { + char head[2]; + nchars = 0; - (void) sscanf(entry->d_name, "db_%u.%n", - &tmp_oid, &nchars); - if (nchars <= 0) - continue; + (void) sscanf(entry->d_name, "%c%c_%u.%n", + head, head + 1, &tmp_oid, &nchars); + /* %u allows leading whitespace, so reject that */ - if (strchr("0123456789", entry->d_name[3]) == NULL) + if (nchars < 3 || !isdigit(entry->d_name[3])) continue; + + if (strncmp(head, "db", 2) == 0) + filetype = PGSTAT_REMFILE_DBSTAT; + else if (strncmp(head, "cc", 2) == 0) + filetype = PGSTAT_REMFILE_SYSCACHE; } + /* skip if this is not a target */ + if ((filetype & target) == 0) + continue; + if (strcmp(entry->d_name + nchars, "tmp") != 0 && strcmp(entry->d_name + nchars, "stat") != 0) continue; @@ -683,8 +708,9 @@ pgstat_reset_remove_files(const char *directory) void pgstat_reset_all(void) { - pgstat_reset_remove_files(pgstat_stat_directory); - pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY); + pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_ALL); + pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY, + PGSTAT_REMFILE_ALL); } #ifdef EXEC_BACKEND @@ -2963,6 +2989,10 @@ pgstat_beshutdown_hook(int code, Datum arg) if (OidIsValid(MyDatabaseId)) pgstat_report_stat(true); + /* clear syscache statistics files and temprary settings */ + if (MyBackendId != InvalidBackendId) + pgstat_remove_syscache_statsfile(); + /* * Clear my status entry, following the protocol of bumping st_changecount * before and after. We use a volatile pointer here to ensure the @@ -4287,6 +4317,9 @@ PgstatCollectorMain(int argc, char *argv[]) pgStatRunningInCollector = true; pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true); + /* Remove left-over syscache stats files */ + pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_SYSCACHE); + /* * Loop to process messages until we get SIGQUIT or detect ungraceful * death of our parent postmaster. @@ -6377,3 +6410,153 @@ pgstat_clip_activity(const char *raw_activity) return activity; } + +/* + * return the filename for a syscache stat file; filename is the output + * buffer, of length len. + */ +void +pgstat_get_syscachestat_filename(bool permanent, bool tempname, int backendid, + char *filename, int len) +{ + int printed; + + /* NB -- pgstat_reset_remove_files knows about the pattern this uses */ + printed = snprintf(filename, len, "%s/cc_%u.%s", + permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY : + pgstat_stat_directory, + backendid, + tempname ? "tmp" : "stat"); + if (printed >= len) + elog(ERROR, "overlength pgstat path"); +} + +/* removes syscache stats files of this backend */ +static void +pgstat_remove_syscache_statsfile(void) +{ + char fname[MAXPGPATH]; + + pgstat_get_syscachestat_filename(false, false, MyBackendId, + fname, MAXPGPATH); + unlink(fname); /* don't care of the result */ +} + +/* + * pgstat_write_syscache_stats() - + * Write the syscache statistics files. + * + * If 'force' is false, this function skips writing a file and resturns the + * time remaining in the current interval in milliseconds. If'force' is true, + * writes a file regardless of the remaining time and reset the interval. + */ +long +pgstat_write_syscache_stats(bool force) +{ + static TimestampTz last_report = 0; + TimestampTz now; + long elapsed; + long secs; + int usecs; + int cacheId; + FILE *fpout; + char statfile[MAXPGPATH]; + char tmpfile[MAXPGPATH]; + + /* Return if we don't want it */ + if (!force && pgstat_track_syscache_usage_interval <= 0) + { + /* disabled. remove the statistics file if any */ + if (last_report > 0) + { + last_report = 0; + pgstat_remove_syscache_statsfile(); + } + return 0; + } + + /* Check aginst the interval */ + now = GetCurrentTransactionStopTimestamp(); + TimestampDifference(last_report, now, &secs, &usecs); + elapsed = secs * 1000 + usecs / 1000; + + if (!force && elapsed < pgstat_track_syscache_usage_interval) + { + /* not yet the time, inform the remaining time to the caller */ + return pgstat_track_syscache_usage_interval - elapsed; + } + + /* now update the stats */ + last_report = now; + + pgstat_get_syscachestat_filename(false, true, + MyBackendId, tmpfile, MAXPGPATH); + pgstat_get_syscachestat_filename(false, false, + MyBackendId, statfile, MAXPGPATH); + + /* + * This function can be called from ProcessInterrupts(). Inhibit recursive + * interrupts to avoid recursive entry. + */ + HOLD_INTERRUPTS(); + + fpout = AllocateFile(tmpfile, PG_BINARY_W); + if (fpout == NULL) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not open temporary statistics file \"%s\": %m", + tmpfile))); + /* + * Failure writing this file is not critical. Just skip this time and + * tell caller to wait for the next interval. + */ + RESUME_INTERRUPTS(); + return pgstat_track_syscache_usage_interval; + } + + /* write out every catcache stats */ + for (cacheId = 0 ; cacheId < SysCacheSize ; cacheId++) + { + SysCacheStats *stats; + + stats = SysCacheGetStats(cacheId); + Assert (stats); + + /* write error is checked later using ferror() */ + fputc('T', fpout); + (void)fwrite(&cacheId, sizeof(int), 1, fpout); + (void)fwrite(&last_report, sizeof(TimestampTz), 1, fpout); + (void)fwrite(stats, sizeof(*stats), 1, fpout); + } + fputc('E', fpout); + + if (ferror(fpout)) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not write syscache statistics file \"%s\": %m", + tmpfile))); + FreeFile(fpout); + unlink(tmpfile); + } + else if (FreeFile(fpout) < 0) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not close syscache statistics file \"%s\": %m", + tmpfile))); + unlink(tmpfile); + } + else if (rename(tmpfile, statfile) < 0) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not rename syscache statistics file \"%s\" to \"%s\": %m", + tmpfile, statfile))); + unlink(tmpfile); + } + + RESUME_INTERRUPTS(); + return 0; +} diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 36cfd507b2..fb77a0ce4c 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -3157,6 +3157,12 @@ ProcessInterrupts(void) } + if (IdleSyscacheStatsUpdateTimeoutPending) + { + IdleSyscacheStatsUpdateTimeoutPending = false; + pgstat_write_syscache_stats(true); + } + if (ParallelMessagePending) HandleParallelMessages(); } @@ -3733,6 +3739,7 @@ PostgresMain(int argc, char *argv[], sigjmp_buf local_sigjmp_buf; volatile bool send_ready_for_query = true; bool disable_idle_in_transaction_timeout = false; + bool disable_idle_catcache_update_timeout = false; /* Initialize startup process environment if necessary. */ if (!IsUnderPostmaster) @@ -4173,9 +4180,19 @@ PostgresMain(int argc, char *argv[], } else { + long timeout; + ProcessCompletedNotifies(); pgstat_report_stat(false); + timeout = pgstat_write_syscache_stats(false); + + if (timeout > 0) + { + disable_idle_catcache_update_timeout = true; + enable_timeout_after(IDLE_CATCACHE_UPDATE_TIMEOUT, + timeout); + } set_ps_display("idle", false); pgstat_report_activity(STATE_IDLE, NULL); } @@ -4218,6 +4235,12 @@ PostgresMain(int argc, char *argv[], disable_idle_in_transaction_timeout = false; } + if (disable_idle_catcache_update_timeout) + { + disable_timeout(IDLE_CATCACHE_UPDATE_TIMEOUT, false); + disable_idle_catcache_update_timeout = false; + } + /* * (6) check for any other interesting events that happened while we * slept. diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index b6ba856ebe..6526cfefb4 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -14,6 +14,8 @@ */ #include "postgres.h" +#include <sys/stat.h> + #include "access/htup_details.h" #include "catalog/pg_authid.h" #include "catalog/pg_type.h" @@ -28,6 +30,7 @@ #include "utils/acl.h" #include "utils/builtins.h" #include "utils/inet.h" +#include "utils/syscache.h" #include "utils/timestamp.h" #define UINT32_ACCESS_ONCE(var) ((uint32)(*((volatile uint32 *)&(var)))) @@ -1899,3 +1902,134 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS) PG_RETURN_DATUM(HeapTupleGetDatum( heap_form_tuple(tupdesc, values, nulls))); } + +Datum +pgstat_get_syscache_stats(PG_FUNCTION_ARGS) +{ +#define PG_GET_SYSCACHE_SIZE 9 + int pid = PG_GETARG_INT32(0); + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; + TupleDesc tupdesc; + Tuplestorestate *tupstore; + MemoryContext per_query_ctx; + MemoryContext oldcontext; + PgBackendStatus *beentry; + int beid; + char fname[MAXPGPATH]; + FILE *fpin; + char c; + + if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("set-valued function called in context that cannot accept a set"))); + if (!(rsinfo->allowedModes & SFRM_Materialize)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("materialize mode required, but it is not " \ + "allowed in this context"))); + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + + per_query_ctx = rsinfo->econtext->ecxt_per_query_memory; + + oldcontext = MemoryContextSwitchTo(per_query_ctx); + tupstore = tuplestore_begin_heap(true, false, work_mem); + rsinfo->returnMode = SFRM_Materialize; + rsinfo->setResult = tupstore; + rsinfo->setDesc = tupdesc; + + MemoryContextSwitchTo(oldcontext); + + /* find beentry for given pid*/ + beentry = NULL; + for (beid = 1; + (beentry = pgstat_fetch_stat_beentry(beid)) && + beentry->st_procpid != pid ; + beid++); + + /* + * we silently return empty result on failure or insufficient privileges + */ + if (!beentry || + (!has_privs_of_role(GetUserId(), beentry->st_userid) && + !is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS))) + goto no_data; + + pgstat_get_syscachestat_filename(false, false, beid, fname, MAXPGPATH); + + if ((fpin = AllocateFile(fname, PG_BINARY_R)) == NULL) + { + if (errno != ENOENT) + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not open statistics file \"%s\": %m", + fname))); + /* also return empty on no statistics file */ + goto no_data; + } + + /* read the statistics file into tuplestore */ + while ((c = fgetc(fpin)) == 'T') + { + TimestampTz last_update; + SysCacheStats stats; + int cacheid; + Datum values[PG_GET_SYSCACHE_SIZE]; + bool nulls[PG_GET_SYSCACHE_SIZE] = {0}; + Datum datums[SYSCACHE_STATS_NAGECLASSES * 2]; + bool arrnulls[SYSCACHE_STATS_NAGECLASSES * 2] = {0}; + int dims[] = {SYSCACHE_STATS_NAGECLASSES, 2}; + int lbs[] = {1, 1}; + ArrayType *arr; + int i, j; + + fread(&cacheid, sizeof(int), 1, fpin); + fread(&last_update, sizeof(TimestampTz), 1, fpin); + if (fread(&stats, 1, sizeof(stats), fpin) != sizeof(stats)) + { + ereport(WARNING, + (errmsg("corrupted syscache statistics file \"%s\"", + fname))); + goto no_data; + } + + i = 0; + values[i++] = ObjectIdGetDatum(stats.reloid); + values[i++] = ObjectIdGetDatum(stats.indoid); + values[i++] = Int64GetDatum(stats.size); + values[i++] = Int64GetDatum(stats.ntuples); + values[i++] = Int64GetDatum(stats.nsearches); + values[i++] = Int64GetDatum(stats.nhits); + values[i++] = Int64GetDatum(stats.nneg_hits); + + for (j = 0 ; j < SYSCACHE_STATS_NAGECLASSES ; j++) + { + datums[j * 2] = Int32GetDatum((int32) stats.ageclasses[j]); + datums[j * 2 + 1] = Int32GetDatum((int32) stats.nclass_entries[j]); + } + + arr = construct_md_array(datums, arrnulls, 2, dims, lbs, + INT4OID, sizeof(int32), true, 'i'); + values[i++] = PointerGetDatum(arr); + + values[i++] = TimestampTzGetDatum(last_update); + + Assert (i == PG_GET_SYSCACHE_SIZE); + + tuplestore_putvalues(tupstore, tupdesc, values, nulls); + } + + /* check for the end of file. abandon the result if file is broken */ + if (c != 'E' || fgetc(fpin) != EOF) + tuplestore_clear(tupstore); + + FreeFile(fpin); + +no_data: + tuplestore_donestoring(tupstore); + return (Datum) 0; +} diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 0a56390352..bdcc10064f 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -97,6 +97,10 @@ static CatCacheHeader *CacheHdr = NULL; /* Timestamp used for any operation on caches. */ TimestampTz catcacheclock = 0; +/* age classes for pruning */ +static double ageclass[SYSCACHE_STATS_NAGECLASSES] + = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -628,9 +632,7 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue) else CatCacheRemoveCTup(cache, ct); CACHE1_elog(DEBUG2, "CatCacheInvalidate: invalidated"); -#ifdef CATCACHE_STATS cache->cc_invals++; -#endif /* could be multiple matches, so keep looking! */ } } @@ -706,9 +708,7 @@ ResetCatalogCache(CatCache *cache) } else CatCacheRemoveCTup(cache, ct); -#ifdef CATCACHE_STATS cache->cc_invals++; -#endif } } } @@ -958,10 +958,10 @@ CatCacheCleanupOldEntries(CatCache *cp) int us; /* - * Calculate the duration from the time of the last access to the - * "current" time. Since catcacheclock is not advanced within a - * transaction, the entries that are accessed within the current - * transaction won't be pruned. + * Calculate the duration from the time from the last access to + * the "current" time. Since catcacheclock is not advanced within + * a transaction, the entries that are accessed within the current + * transaction always get 0 as the result. */ TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); @@ -1381,9 +1381,7 @@ SearchCatCacheInternal(CatCache *cache, if (unlikely(cache->cc_tupdesc == NULL)) CatalogCacheInitializeCache(cache); -#ifdef CATCACHE_STATS cache->cc_searches++; -#endif /* Initialize local parameter array */ arguments[0] = v1; @@ -1444,9 +1442,7 @@ SearchCatCacheInternal(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_hits++; -#endif return &ct->tuple; } @@ -1455,9 +1451,7 @@ SearchCatCacheInternal(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_neg_hits++; -#endif return NULL; } @@ -1585,9 +1579,7 @@ SearchCatCacheMiss(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): put in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_newloads++; -#endif return &ct->tuple; } @@ -1698,9 +1690,7 @@ SearchCatCacheList(CatCache *cache, Assert(nkeys > 0 && nkeys < cache->cc_nkeys); -#ifdef CATCACHE_STATS cache->cc_lsearches++; -#endif /* Initialize local parameter array */ arguments[0] = v1; @@ -1757,9 +1747,7 @@ SearchCatCacheList(CatCache *cache, CACHE2_elog(DEBUG2, "SearchCatCacheList(%s): found list", cache->cc_relname); -#ifdef CATCACHE_STATS cache->cc_lhits++; -#endif return cl; } @@ -2276,3 +2264,64 @@ PrintCatCacheListLeakWarning(CatCList *list) list->my_cache->cc_relname, list->my_cache->id, list, list->refcount); } + +/* + * CatCacheGetStats - fill in SysCacheStats struct. + * + * This is a support routine for SysCacheGetStats, substantially fills in the + * result. The classification here is based on the same criteria to + * CatCacheCleanupOldEntries(). + */ +void +CatCacheGetStats(CatCache *cache, SysCacheStats *stats) +{ + int i, j; + + Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0); + + /* fill in the stats struct */ + stats->size = cache->cc_tupsize + cache->cc_nbuckets * sizeof(dlist_head); + stats->ntuples = cache->cc_ntup; + stats->nsearches = cache->cc_searches; + stats->nhits = cache->cc_hits; + stats->nneg_hits = cache->cc_neg_hits; + + /* cache_prune_min_age can be changed on-session, fill it every time */ + for (i = 0 ; i < SYSCACHE_STATS_NAGECLASSES ; i++) + stats->ageclasses[i] = (int) (cache_prune_min_age * ageclass[i]); + + /* + * nth element in nclass_entries stores the number of cache entries that + * have lived unaccessed for corresponding multiple in ageclass of + * cache_prune_min_age. + */ + memset(stats->nclass_entries, 0, sizeof(int) * SYSCACHE_STATS_NAGECLASSES); + + /* Scan the whole hash */ + for (i = 0; i < cache->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cache->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + /* + * Calculate the duration from the time from the last access to + * the "current" time. Since catcacheclock is not advanced within + * a transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + + j = 0; + while (j < SYSCACHE_STATS_NAGECLASSES - 1 && + entry_age > stats->ageclasses[j]) + j++; + + stats->nclass_entries[j]++; + } + } +} diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c index ac98c19155..7b38a06708 100644 --- a/src/backend/utils/cache/syscache.c +++ b/src/backend/utils/cache/syscache.c @@ -20,6 +20,9 @@ */ #include "postgres.h" +#include <sys/stat.h> +#include <unistd.h> + #include "access/htup_details.h" #include "access/sysattr.h" #include "catalog/indexing.h" @@ -1534,6 +1537,27 @@ RelationSupportsSysCache(Oid relid) return false; } +/* + * SysCacheGetStats - returns stats of specified syscache + * + * This routine returns the address of its local static memory. + */ +SysCacheStats * +SysCacheGetStats(int cacheId) +{ + static SysCacheStats stats; + + Assert(cacheId >=0 && cacheId < SysCacheSize); + + memset(&stats, 0, sizeof(stats)); + + stats.reloid = cacheinfo[cacheId].reloid; + stats.indoid = cacheinfo[cacheId].indoid; + + CatCacheGetStats(SysCache[cacheId], &stats); + + return &stats; +} /* * OID comparator for pg_qsort diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index fd51934aaf..f039ecd805 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false; volatile sig_atomic_t ProcDiePending = false; volatile sig_atomic_t ClientConnectionLost = false; volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false; +volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending = false; volatile sig_atomic_t ConfigReloadPending = false; volatile uint32 InterruptHoldoffCount = 0; volatile uint32 QueryCancelHoldoffCount = 0; diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index c0b6231458..dee7f19475 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg); static void StatementTimeoutHandler(void); static void LockTimeoutHandler(void); static void IdleInTransactionSessionTimeoutHandler(void); +static void IdleSyscacheStatsUpdateTimeoutHandler(void); static bool ThereIsAtLeastOneRole(void); static void process_startup_options(Port *port, bool am_superuser); static void process_settings(Oid databaseid, Oid roleid); @@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler); RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT, IdleInTransactionSessionTimeoutHandler); + RegisterTimeout(IDLE_CATCACHE_UPDATE_TIMEOUT, + IdleSyscacheStatsUpdateTimeoutHandler); } /* @@ -1239,6 +1242,14 @@ IdleInTransactionSessionTimeoutHandler(void) SetLatch(MyLatch); } +static void +IdleSyscacheStatsUpdateTimeoutHandler(void) +{ + IdleSyscacheStatsUpdateTimeoutPending = true; + InterruptPending = true; + SetLatch(MyLatch); +} + /* * Returns true if at least one role is defined in this database cluster. */ diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index d4df841982..7bb239a07e 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -3198,6 +3198,16 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"track_syscache_usage_interval", PGC_SUSET, STATS_COLLECTOR, + gettext_noop("Sets the interval between syscache usage collection, in milliseconds. Zero disables syscache usagetracking."), + NULL + }, + &pgstat_track_syscache_usage_interval, + 0, 0, INT_MAX / 2, + NULL, NULL, NULL + }, + { {"gin_pending_list_limit", PGC_USERSET, CLIENT_CONN_STATEMENT, gettext_noop("Sets the maximum size of the pending list for GIN index."), diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 108d332f2c..4d4fb42251 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -560,6 +560,7 @@ #track_io_timing = off #track_functions = none # none, pl, all #track_activity_query_size = 1024 # (change requires restart) +#track_syscache_usage_interval = 0 # zero disables tracking #stats_temp_directory = 'pg_stat_tmp' diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index b8de13f03b..6099a828d2 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -9669,6 +9669,15 @@ proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}', proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}', prosrc => 'pg_get_replication_slots' }, +{ oid => '3425', + descr => 'syscache statistics', + proname => 'pg_get_syscache_stats', prorows => '100', proisstrict => 'f', + proretset => 't', provolatile => 'v', prorettype => 'record', + proargtypes => 'int4', + proallargtypes => '{int4,oid,oid,int8,int8,int8,int8,int8,_int4,timestamptz}', + proargmodes => '{i,o,o,o,o,o,o,o,o,o}', + proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,last_update}', + prosrc => 'pgstat_get_syscache_stats' }, { oid => '3786', descr => 'set up a logical replication slot', proname => 'pg_create_logical_replication_slot', provolatile => 'v', proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool', diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index c9e35003a5..69b9a976f0 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending; extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending; extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending; extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending; +extern PGDLLIMPORT volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending; extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending; extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost; diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 88a75fb798..b6bfd7d644 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -1144,6 +1144,7 @@ extern bool pgstat_track_activities; extern bool pgstat_track_counts; extern int pgstat_track_functions; extern PGDLLIMPORT int pgstat_track_activity_query_size; +extern int pgstat_track_syscache_usage_interval; extern char *pgstat_stat_directory; extern char *pgstat_stat_tmpname; extern char *pgstat_stat_filename; @@ -1228,7 +1229,8 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id); extern void pgstat_initstats(Relation rel); extern char *pgstat_clip_activity(const char *raw_activity); - +extern void pgstat_get_syscachestat_filename(bool permanent, + bool tempname, int backendid, char *filename, int len); /* ---------- * pgstat_report_wait_start() - * @@ -1363,5 +1365,5 @@ extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid); extern int pgstat_fetch_stat_numbackends(void); extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void); extern PgStat_GlobalStats *pgstat_fetch_global(void); - +extern long pgstat_write_syscache_stats(bool force); #endif /* PGSTAT_H */ diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 973a87c2cf..85fa7bdb86 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -66,10 +66,8 @@ typedef struct catcache int cc_tupsize; /* total amount of catcache tuples */ /* - * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS - * doesn't break ABI for other modules + * Statistics entries */ -#ifdef CATCACHE_STATS long cc_searches; /* total # searches against this cache */ long cc_hits; /* # of matches against existing entry */ long cc_neg_hits; /* # of matches against negative entry */ @@ -82,7 +80,6 @@ typedef struct catcache long cc_invals; /* # of entries invalidated from cache */ long cc_lsearches; /* total # list-searches */ long cc_lhits; /* # of matches against existing lists */ -#endif } CatCache; @@ -258,4 +255,8 @@ extern void PrepareToInvalidateCacheTuple(Relation relation, extern void PrintCatCacheLeakWarning(HeapTuple tuple); extern void PrintCatCacheListLeakWarning(CatCList *list); +/* defined in syscache.h */ +typedef struct syscachestats SysCacheStats; +extern void CatCacheGetStats(CatCache *cache, SysCacheStats *syscachestats); + #endif /* CATCACHE_H */ diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h index 95ee48954e..71b399c902 100644 --- a/src/include/utils/syscache.h +++ b/src/include/utils/syscache.h @@ -112,6 +112,24 @@ enum SysCacheIdentifier #define SysCacheSize (USERMAPPINGUSERSERVER + 1) }; +#define SYSCACHE_STATS_NAGECLASSES 6 +/* Struct for catcache tracking information */ +typedef struct syscachestats +{ + Oid reloid; /* target relation */ + Oid indoid; /* index */ + size_t size; /* size of the catcache */ + int ntuples; /* number of tuples resides in the catcache */ + int nsearches; /* number of searches */ + int nhits; /* number of cache hits */ + int nneg_hits; /* number of negative cache hits */ + /* age classes in seconds */ + int ageclasses[SYSCACHE_STATS_NAGECLASSES]; + /* number of tuples fall into the corresponding age class */ + int nclass_entries[SYSCACHE_STATS_NAGECLASSES]; +} SysCacheStats; + + extern void InitCatalogCache(void); extern void InitCatalogCachePhase2(void); @@ -164,6 +182,7 @@ extern void SysCacheInvalidate(int cacheId, uint32 hashValue); extern bool RelationInvalidatesSnapshotsOnly(Oid relid); extern bool RelationHasSysCache(Oid relid); extern bool RelationSupportsSysCache(Oid relid); +extern SysCacheStats *SysCacheGetStats(int cacheId); /* * The use of the macros below rather than direct calls to the corresponding diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h index 9244a2a7b7..0ab441a364 100644 --- a/src/include/utils/timeout.h +++ b/src/include/utils/timeout.h @@ -31,6 +31,7 @@ typedef enum TimeoutId STANDBY_TIMEOUT, STANDBY_LOCK_TIMEOUT, IDLE_IN_TRANSACTION_SESSION_TIMEOUT, + IDLE_CATCACHE_UPDATE_TIMEOUT, /* First user-definable timeout reason */ USER_TIMEOUT, /* Maximum number of timeout reasons */ diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out index 2c8e21baa7..7bd77e9972 100644 --- a/src/test/regress/expected/rules.out +++ b/src/test/regress/expected/rules.out @@ -1921,6 +1921,28 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid, pg_stat_all_tables.autoanalyze_count FROM pg_stat_all_tables WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname~ '^pg_toast'::text)); +pg_stat_syscache| SELECT s.pid, + (s.relid)::regclass AS relname, + (s.indid)::regclass AS cache_name, + s.size, + s.ntup AS ntuples, + s.searches, + s.hits, + s.neg_hits, + s.ageclass, + s.last_update + FROM (pg_stat_activity a + JOIN LATERAL ( SELECT a.pid, + pg_get_syscache_stats.relid, + pg_get_syscache_stats.indid, + pg_get_syscache_stats.size, + pg_get_syscache_stats.ntup, + pg_get_syscache_stats.searches, + pg_get_syscache_stats.hits, + pg_get_syscache_stats.neg_hits, + pg_get_syscache_stats.ageclass, + pg_get_syscache_stats.last_update + FROM pg_get_syscache_stats(a.pid) pg_get_syscache_stats(relid, indid, size, ntup, searches, hits, neg_hits, ageclass,last_update)) s ON ((a.pid = s.pid))); pg_stat_user_functions| SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, @@ -2352,7 +2374,7 @@ pg_settings|pg_settings_n|CREATE RULE pg_settings_n AS ON UPDATE TO pg_catalog.pg_settings DO INSTEAD NOTHING; pg_settings|pg_settings_u|CREATE RULE pg_settings_u AS ON UPDATE TO pg_catalog.pg_settings - WHERE (new.name = old.name) DO SELECT set_config(old.name, new.setting, false) AS set_config; + WHERE (new.name = old.name) DO SELECT set_config(old.name, new.setting, false, false) AS set_config; rtest_emp|rtest_emp_del|CREATE RULE rtest_emp_del AS ON DELETE TO public.rtest_emp DO INSERT INTO rtest_emplog (ename, who, action, newsal, oldsal) VALUES (old.ename, CURRENT_USER, 'fired'::bpchar, '$0.00'::money, old.salary); -- 2.16.3
Hi, thanks for recent rapid work. >From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] >At Tue, 5 Feb 2019 19:05:26 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> >wrote in <20190205220526.GA1442@alvherre.pgsql> >> On 2019-Feb-05, Tomas Vondra wrote: >> >> > I don't think we need to remove the expired entries right away, if >> > there are only very few of them. The cleanup requires walking the >> > hash table, which means significant fixed cost. So if there are only >> > few expired entries (say, less than 25% of the cache), we can just >> > leave them around and clean them if we happen to stumble on them >> > (although that may not be possible with dynahash, which has no >> > concept of expiration) of before enlarging the hash table. >> >> I think seqscanning the hash table is going to be too slow; >> Ideriha-san idea of having a dlist with the entries in LRU order >> (where each entry is moved to head of list when it is touched) seemed >> good: it allows you to evict older ones when the time comes, without >> having to scan the rest of the entries. Having a dlist means two more >> pointers on each cache entry AFAIR, so it's not a huge amount of memory. > >Ah, I had a separate list in my mind. Sounds reasonable to have pointers in cache entry. >But I'm not sure how much additional >dlist_* impact. Thank you for picking up my comment, Alvaro. That's what I was thinking about. >The attached is the new version with the following properties: > >- Both prune-by-age and hard limiting feature. > (Merged into single function, single scan) > Debug tracking feature in CatCacheCleanupOldEntries is removed > since it no longer runs a full scan. It seems to me that adding hard limit strategy choice besides prune-by-age one is good to help variety of (contradictory) cases which have been discussed in this thread. I need hard limit as well. The hard limit is currently represented as number of cache entry controlled by both cache_entry_limit and cache_entry_limit_prune_ratio. Why don't we change it to the amount of memory (bytes)? Amount of memory is more direct parameter for customer who wants to set the hard limit and easier to tune compared to number of cache entry. >- Using LRU to get rid of full scan. > >I added new API dlist_move_to_tail which was needed to construct LRU. I just thought there is dlist_move_head() so if new entries are head side and old ones are tail side. But that's not objection to adding new API because depending on the situation head for new entry could be readable code and vice versa. Regards, Takeshi Ideriha
At Thu, 07 Feb 2019 15:24:18 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190207.152418.139132570.horiguchi.kyotaro@lab.ntt.co.jp> > I'm going to retake numbers with search-only queries. Yeah, I was stupid. I made a rerun of benchmark using "-S -T 30" on the server build with no assertion and -O2. The numbers are the best of three successive attempts. The patched version is running with cache_target_memory = 0, cache_prune_min_age = 600 and cache_entry_limit = 0 but pruning doesn't happen by the workload. master: 13393 tps v12 : 12625 tps (-6%) Significant degradation is found. Recuded frequency of dlist_move_tail by taking 1ms interval between two succesive updates on the same entry let the degradation dissapear. patched : 13720 tps (+2%) I think there's still no need of such frequency. It is 100ms in the attched patch. # I'm not sure the name LRU_IGNORANCE_INTERVAL makes sens.. The attached regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 72a569703662b93fb57c55c337b16107ebccfce3 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Thu, 7 Feb 2019 14:56:07 +0900 Subject: [PATCH 1/4] Add dlist_move_tail We have dlist_push_head/tail and dlist_move_head but not dlist_move_tail. Add it. --- src/include/lib/ilist.h | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/src/include/lib/ilist.h b/src/include/lib/ilist.h index b1a5974ee4..659ab1ac87 100644 --- a/src/include/lib/ilist.h +++ b/src/include/lib/ilist.h @@ -394,6 +394,25 @@ dlist_move_head(dlist_head *head, dlist_node *node) dlist_check(head); } +/* + * Move element from its current position in the list to the tail position in + * the same list. + * + * Undefined behaviour if 'node' is not already part of the list. + */ +static inline void +dlist_move_tail(dlist_head *head, dlist_node *node) +{ + /* fast path if it's already at the tail */ + if (head->head.prev == node) + return; + + dlist_delete(node); + dlist_push_tail(head, node); + + dlist_check(head); +} + /* * Check whether 'node' has a following node. * Caution: unreliable if 'node' is not in the list. -- 2.16.3 From 429001e7cbbb88710cfc5589bc46e2490f93d216 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 16 Oct 2018 13:04:30 +0900 Subject: [PATCH 2/4] Remove entries that haven't been used for a certain time Catcache entries can be left alone for several reasons. It is not desirable that they eat up memory. With this patch, This adds consideration of removal of entries that haven't been used for a certain time before enlarging the hash array. This also can put a hard limit on the number of catcache entries. --- doc/src/sgml/config.sgml | 38 +++++ src/backend/access/transam/xact.c | 5 + src/backend/utils/cache/catcache.c | 205 +++++++++++++++++++++++++- src/backend/utils/misc/guc.c | 63 ++++++++ src/backend/utils/misc/postgresql.conf.sample | 2 + src/include/utils/catcache.h | 33 ++++- 6 files changed, 338 insertions(+), 8 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 9b7a7388d5..d0d2374944 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1662,6 +1662,44 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-syscache-memory-target" xreflabel="syscache_memory_target"> + <term><varname>syscache_memory_target</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_memory_target</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the maximum amount of memory to which syscache is expanded + without pruning. The value defaults to 0, indicating that pruning is + always considered. After exceeding this size, syscache pruning is + considered according to + <xref linkend="guc-syscache-prune-min-age"/>. If you need to keep + certain amount of syscache entries with intermittent usage, try + increase this setting. + </para> + </listitem> + </varlistentry> + + <varlistentry id="guc-syscache-prune-min-age" xreflabel="syscache_prune_min_age"> + <term><varname>syscache_prune_min_age</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_prune_min_age</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the minimum amount of unused time in seconds at which a + syscache entry is considered to be removed. -1 indicates that syscache + pruning is disabled at all. The value defaults to 600 seconds + (<literal>10 minutes</literal>). The syscache entries that are not + used for the duration can be removed to prevent syscache bloat. This + behavior is suppressed until the size of syscache exceeds + <xref linkend="guc-syscache-memory-target"/>. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth"> <term><varname>max_stack_depth</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 92bda87804..ddc433c59e 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -734,7 +734,12 @@ void SetCurrentStatementStartTimestamp(void) { if (!IsParallelWorker()) + { stmtStartTimestamp = GetCurrentTimestamp(); + + /* Set this timestamp as aproximated current time */ + SetCatCacheClock(stmtStartTimestamp); + } else Assert(stmtStartTimestamp != 0); } diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 258a1d64cc..c70ce3b745 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -71,9 +71,38 @@ #define CACHE6_elog(a,b,c,d,e,f,g) #endif +/* + * GUC variable to define the minimum size of hash to cosider entry eviction. + * This variable is shared among various cache mechanisms. + */ +int cache_memory_target = 0; + + +/* + * GUC for entry limit. Entries are removed when the number of them goes above + * cache_entry_limit by the ratio specified by cache_entry_limit_prune_ratio + */ +int cache_entry_limit = 0; +double cache_entry_limit_prune_ratio = 0.8; + +/* GUC variable to define the minimum age of entries that will be cosidered to + * be evicted in seconds. This variable is shared among various cache + * mechanisms. + */ +int cache_prune_min_age = 600; + +/* + * Ignorance interval between two success move of a cache entry in LRU list, + * in microseconds. + */ +#define LRU_IGNORANCE_INTERVAL 100000 /* 100ms */ + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Timestamp used for any operation on caches. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -481,6 +510,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) /* delink from linked list */ dlist_delete(&ct->cache_elem); + dlist_delete(&ct->lru_node); /* * Free keys when we're dealing with a negative entry, normal entries just @@ -490,6 +520,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys, cache->cc_keyno, ct->keys); + cache->cc_tupsize -= ct->size; pfree(ct); --cache->cc_ntup; @@ -841,7 +872,9 @@ InitCatCache(int id, cp->cc_nkeys = nkeys; for (i = 0; i < nkeys; ++i) cp->cc_keyno[i] = key[i]; + cp->cc_tupsize = 0; + dlist_init(&cp->cc_lru_list); /* * new cache is initialized as far as we can go for now. print some * debugging information, if appropriate. @@ -858,9 +891,133 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initilize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries can be left alone for several reasons. We remove them if + * they are not accessed for a certain time to prevent catcache from + * bloating. The eviction is performed with the similar algorithm with buffer + * eviction using access counter. Entries that are accessed several times can + * live longer than those that have had no access in the same duration. + */ +#define PRUNE_BY_AGE 0x01 +#define PRUNE_BY_NUMBER 0x02 + +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int nremoved = 0; + size_t hash_size; + int nelems_before = cp->cc_ntup; + int ndelelems = 0; + int action = 0; + dlist_mutable_iter iter; + + if (cache_prune_min_age >= 0) + { + /* prune only if the size of the hash is above the target */ + + hash_size = cp->cc_nbuckets * sizeof(dlist_head); + if (hash_size + cp->cc_tupsize > (Size) cache_memory_target * 1024L) + action |= PRUNE_BY_AGE; + } + + if (cache_entry_limit > 0 && nelems_before >= cache_entry_limit) + { + ndelelems = nelems_before - + (int) (cache_entry_limit * cache_entry_limit_prune_ratio); + + if (ndelelems < 256) + ndelelems = 256; + if (ndelelems > nelems_before) + ndelelems = nelems_before; + + action |= PRUNE_BY_NUMBER; + } + + /* Return immediately if no pruning is wanted */ + if (action == 0) + return false; + + /* Scan over LRU to find entries to remove */ + dlist_foreach_modify(iter, &cp->cc_lru_list) + { + CatCTup *ct = dlist_container(CatCTup, lru_node, iter.cur); + bool remove_this = false; + + /* We don't remove referenced entry */ + if (ct->refcount != 0 || + (ct->c_list && ct->c_list->refcount != 0)) + continue; + + /* check against age */ + if (action & PRUNE_BY_AGE) + { + long entry_age; + int us; + + /* + * Calculate the duration from the time of the last access to the + * "current" time. Since catcacheclock is not advanced within a + * transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + + if (entry_age < cache_prune_min_age) + { + /* no longer have a business with further entries, exit */ + action &= ~PRUNE_BY_AGE; + break; + } + + /* + * Entries that are not accessed after last pruning are removed in + * that seconds, and that has been accessed several times are + * removed after leaving alone for up to three times of the + * duration. We don't try shrink buckets since pruning effectively + * caps catcache expansion in the long term. + */ + if (ct->naccess > 0) + ct->naccess--; + else + remove_this = true; + } + + /* check against entry number */ + if (action & PRUNE_BY_NUMBER) + { + if (nremoved < ndelelems) + remove_this = true; + else + action &= ~PRUNE_BY_NUMBER; /* satisfied */ + } + + /* exit if finished */ + if (action == 0) + break; + + /* do the work */ + if (remove_this) + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + } + } + + elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d", + cp->id, cp->cc_relname, nremoved, nelems_before); + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -1274,6 +1431,21 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* Update access information for pruning */ + if (ct->naccess < 2) + ct->naccess++; + + /* + * We don't want too frequent update of LRU. cache_prune_min_age can + * be changed on-session so we need to maintan the LRU regardless of + * cache_prune_min_age. + */ + if (catcacheclock - ct->lastaccess > LRU_IGNORANCE_INTERVAL) + { + ct->lastaccess = catcacheclock; + dlist_move_tail(&cache->cc_lru_list, &ct->lru_node); + } + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1819,11 +1991,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, CatCTup *ct; HeapTuple dtp; MemoryContext oldcxt; + int tupsize = 0; /* negative entries have no tuple associated */ if (ntp) { int i; + int tupsize; Assert(!negative); @@ -1842,13 +2016,14 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, /* Allocate memory for CatCTup and the cached tuple in one go */ oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - ct = (CatCTup *) palloc(sizeof(CatCTup) + - MAXIMUM_ALIGNOF + dtp->t_len); + tupsize = sizeof(CatCTup) + MAXIMUM_ALIGNOF + dtp->t_len; + ct = (CatCTup *) palloc(tupsize); ct->tuple.t_len = dtp->t_len; ct->tuple.t_self = dtp->t_self; ct->tuple.t_tableOid = dtp->t_tableOid; ct->tuple.t_data = (HeapTupleHeader) MAXALIGN(((char *) ct) + sizeof(CatCTup)); + ct->size = tupsize; /* copy tuple contents */ memcpy((char *) ct->tuple.t_data, (const char *) dtp->t_data, @@ -1876,8 +2051,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, { Assert(negative); oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - ct = (CatCTup *) palloc(sizeof(CatCTup)); - + tupsize = sizeof(CatCTup); + ct = (CatCTup *) palloc(tupsize); /* * Store keys - they'll point into separately allocated memory if not * by-value. @@ -1898,18 +2073,34 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; + dlist_push_tail(&cache->cc_lru_list, &ct->lru_node); + ct->size = tupsize; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); cache->cc_ntup++; CacheHdr->ch_ntup++; + cache->cc_tupsize += tupsize; + + /* increase refcount so that this survives pruning */ + ct->refcount++; /* - * If the hash table has become too full, enlarge the buckets array. Quite - * arbitrarily, we enlarge when fill factor > 2. + * If the hash table has become too full, try cleanup by removing + * infrequently used entries to make a room for the new entry. If it + * failed, enlarge the bucket array instead. Quite arbitrarily, we try + * this when fill factor > 2. */ - if (cache->cc_ntup > cache->cc_nbuckets * 2) + if (cache->cc_ntup > cache->cc_nbuckets * 2 && + !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); + /* we may still want to prune by entry number, check it */ + else if (cache_entry_limit > 0 && cache->cc_ntup > cache_entry_limit) + CatCacheCleanupOldEntries(cache); + + ct->refcount--; return ct; } diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 8681ada33a..d4df841982 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -81,6 +81,7 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/guc_tables.h" #include "utils/float.h" #include "utils/memutils.h" @@ -2204,6 +2205,58 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"cache_memory_target", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum syscache size to keep."), + gettext_noop("Cache is not pruned before exceeding this size."), + GUC_UNIT_KB + }, + &cache_memory_target, + 0, 0, MAX_KILOBYTES, + NULL, NULL, NULL + }, + + { + {"cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum unused duration of cache entries before removal."), + gettext_noop("Cache entries that live unused for longer than this seconds are considered to be removed."), + GUC_UNIT_S + }, + &cache_prune_min_age, + 600, -1, INT_MAX, + NULL, NULL, NULL + }, + + { + {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the maximum entries of catcache."), + NULL + }, + &cache_entry_limit, + 0, 0, INT_MAX, + NULL, NULL, NULL + }, + + { + {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the maximum entries of catcache."), + NULL + }, + &cache_entry_limit, + 0, 0, INT_MAX, + NULL, NULL, NULL + }, + + { + {"cache_entry_limit", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the maximum entries of catcache."), + NULL + }, + &cache_entry_limit, + 0, 0, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if @@ -3368,6 +3421,16 @@ static struct config_real ConfigureNamesReal[] = NULL, NULL, NULL }, + { + {"cache_entry_limit_prune_ratio", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the maximum entries of catcache."), + NULL + }, + &cache_entry_limit_prune_ratio, + 0.8, 0.0, 1.0, + NULL, NULL, NULL + }, + /* End-of-list marker */ { {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index c7f53470df..108d332f2c 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -128,6 +128,8 @@ #work_mem = 4MB # min 64kB #maintenance_work_mem = 64MB # min 1MB #autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem +#cache_memory_target = 0kB # in kB +#cache_prune_min_age = 600s # -1 disables pruning #max_stack_depth = 2MB # min 100kB #shared_memory_type = mmap # the default is the first option # supported by the operating system: diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 65d816a583..3c6842e272 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -61,6 +62,9 @@ typedef struct catcache slist_node cc_next; /* list link */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ + dlist_head cc_lru_list; + int cc_tupsize; /* total amount of catcache tuples */ + int cc_nfreeent; /* # of entries currently not referenced */ /* * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS @@ -119,7 +123,10 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ - + int naccess; /* # of access to this entry, up to 2 */ + TimestampTz lastaccess; /* approx. timestamp of the last usage */ + dlist_node lru_node; /* LRU node */ + int size; /* palloc'ed size off this tuple */ /* * The tuple may also be a member of at most one CatCList. (If a single * catcache is list-searched with varying numbers of keys, we may have to @@ -189,6 +196,30 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int cache_prune_min_age; +extern int cache_memory_target; +extern int cache_entry_limit; +extern double cache_entry_limit_prune_ratio; + +/* to use as access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* + * SetCatCacheClock - set timestamp for catcache access record + */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + +static inline TimestampTz +GetCatCacheClock(void) +{ + return catcacheclock; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, -- 2.16.3 From 251607ff21981f840392387a28ca8f012ef18aab Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 16 Oct 2018 15:48:28 +0900 Subject: [PATCH 3/4] Syscache usage tracking feature. Collects syscache usage statictics and show it using the view pg_stat_syscache. The feature is controlled by the GUC variable track_syscache_usage_interval. --- doc/src/sgml/config.sgml | 15 ++ src/backend/catalog/system_views.sql | 17 +++ src/backend/postmaster/pgstat.c | 201 ++++++++++++++++++++++++-- src/backend/tcop/postgres.c | 23 +++ src/backend/utils/adt/pgstatfuncs.c | 134 +++++++++++++++++ src/backend/utils/cache/catcache.c | 89 +++++++++--- src/backend/utils/cache/syscache.c | 24 +++ src/backend/utils/init/globals.c | 1 + src/backend/utils/init/postinit.c | 11 ++ src/backend/utils/misc/guc.c | 10 ++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/catalog/pg_proc.dat | 9 ++ src/include/miscadmin.h | 1 + src/include/pgstat.h | 6 +- src/include/utils/catcache.h | 9 +- src/include/utils/syscache.h | 19 +++ src/include/utils/timeout.h | 1 + src/test/regress/expected/rules.out | 24 ++- 18 files changed, 559 insertions(+), 36 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index d0d2374944..5ff3ebeb4e 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -6687,6 +6687,21 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; </listitem> </varlistentry> + <varlistentry id="guc-track-syscache-usage-interval" xreflabel="track_syscache_usage_interval"> + <term><varname>track_syscache_usage_interval</varname> (<type>integer</type>) + <indexterm> + <primary><varname>track_syscache_usage_interval</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the interval to collect system cache usage statistics in + milliseconds. This parameter is 0 by default, which means disabled. + Only superusers can change this setting. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-track-io-timing" xreflabel="track_io_timing"> <term><varname>track_io_timing</varname> (<type>boolean</type>) <indexterm> diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 3e229c693c..f5d1aaf96f 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -906,6 +906,22 @@ CREATE VIEW pg_stat_progress_vacuum AS FROM pg_stat_get_progress_info('VACUUM') AS S LEFT JOIN pg_database D ON S.datid = D.oid; +CREATE VIEW pg_stat_syscache AS + SELECT + S.pid AS pid, + S.relid::regclass AS relname, + S.indid::regclass AS cache_name, + S.size AS size, + S.ntup AS ntuples, + S.searches AS searches, + S.hits AS hits, + S.neg_hits AS neg_hits, + S.ageclass AS ageclass, + S.last_update AS last_update + FROM pg_stat_activity A + JOIN LATERAL (SELECT A.pid, * FROM pg_get_syscache_stats(A.pid)) S + ON (A.pid = S.pid); + CREATE VIEW pg_user_mappings AS SELECT U.oid AS umid, @@ -1185,6 +1201,7 @@ GRANT EXECUTE ON FUNCTION pg_ls_waldir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_archive_statusdir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_tmpdir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_tmpdir(oid) TO pg_monitor; +GRANT EXECUTE ON FUNCTION pg_get_syscache_stats(int) TO pg_monitor; GRANT pg_read_all_settings TO pg_monitor; GRANT pg_read_all_stats TO pg_monitor; diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 81c6499251..a1939958b7 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -66,6 +66,7 @@ #include "utils/ps_status.h" #include "utils/rel.h" #include "utils/snapmgr.h" +#include "utils/syscache.h" #include "utils/timestamp.h" @@ -124,6 +125,7 @@ bool pgstat_track_activities = false; bool pgstat_track_counts = false; int pgstat_track_functions = TRACK_FUNC_OFF; +int pgstat_track_syscache_usage_interval = 0; int pgstat_track_activity_query_size = 1024; /* ---------- @@ -236,6 +238,11 @@ typedef struct TwoPhasePgStatRecord bool t_truncated; /* was the relation truncated? */ } TwoPhasePgStatRecord; +/* bitmap symbols to specify target file types remove */ +#define PGSTAT_REMFILE_DBSTAT 1 /* remove only databsae stats files */ +#define PGSTAT_REMFILE_SYSCACHE 2 /* remove only syscache stats files */ +#define PGSTAT_REMFILE_ALL 3 /* remove both type of files */ + /* * Info about current "snapshot" of stats file */ @@ -335,6 +342,7 @@ static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len); static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len); static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len); static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len); +static void pgstat_remove_syscache_statsfile(void); /* ------------------------------------------------------------ * Public functions called from postmaster follow @@ -630,10 +638,13 @@ startup_failed: } /* - * subroutine for pgstat_reset_all + * remove stats files + * + * clean up stats files in specified directory. target is one of + * PGSTAT_REFILE_DBSTAT/SYSCACHE/ALL and restricts files to remove. */ static void -pgstat_reset_remove_files(const char *directory) +pgstat_reset_remove_files(const char *directory, int target) { DIR *dir; struct dirent *entry; @@ -644,25 +655,39 @@ pgstat_reset_remove_files(const char *directory) { int nchars; Oid tmp_oid; + int filetype = 0; /* * Skip directory entries that don't match the file names we write. * See get_dbstat_filename for the database-specific pattern. */ if (strncmp(entry->d_name, "global.", 7) == 0) + { + filetype = PGSTAT_REMFILE_DBSTAT; nchars = 7; + } else { + char head[2]; + nchars = 0; - (void) sscanf(entry->d_name, "db_%u.%n", - &tmp_oid, &nchars); - if (nchars <= 0) - continue; + (void) sscanf(entry->d_name, "%c%c_%u.%n", + head, head + 1, &tmp_oid, &nchars); + /* %u allows leading whitespace, so reject that */ - if (strchr("0123456789", entry->d_name[3]) == NULL) + if (nchars < 3 || !isdigit(entry->d_name[3])) continue; + + if (strncmp(head, "db", 2) == 0) + filetype = PGSTAT_REMFILE_DBSTAT; + else if (strncmp(head, "cc", 2) == 0) + filetype = PGSTAT_REMFILE_SYSCACHE; } + /* skip if this is not a target */ + if ((filetype & target) == 0) + continue; + if (strcmp(entry->d_name + nchars, "tmp") != 0 && strcmp(entry->d_name + nchars, "stat") != 0) continue; @@ -683,8 +708,9 @@ pgstat_reset_remove_files(const char *directory) void pgstat_reset_all(void) { - pgstat_reset_remove_files(pgstat_stat_directory); - pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY); + pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_ALL); + pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY, + PGSTAT_REMFILE_ALL); } #ifdef EXEC_BACKEND @@ -2963,6 +2989,10 @@ pgstat_beshutdown_hook(int code, Datum arg) if (OidIsValid(MyDatabaseId)) pgstat_report_stat(true); + /* clear syscache statistics files and temprary settings */ + if (MyBackendId != InvalidBackendId) + pgstat_remove_syscache_statsfile(); + /* * Clear my status entry, following the protocol of bumping st_changecount * before and after. We use a volatile pointer here to ensure the @@ -4287,6 +4317,9 @@ PgstatCollectorMain(int argc, char *argv[]) pgStatRunningInCollector = true; pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true); + /* Remove left-over syscache stats files */ + pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_SYSCACHE); + /* * Loop to process messages until we get SIGQUIT or detect ungraceful * death of our parent postmaster. @@ -6377,3 +6410,153 @@ pgstat_clip_activity(const char *raw_activity) return activity; } + +/* + * return the filename for a syscache stat file; filename is the output + * buffer, of length len. + */ +void +pgstat_get_syscachestat_filename(bool permanent, bool tempname, int backendid, + char *filename, int len) +{ + int printed; + + /* NB -- pgstat_reset_remove_files knows about the pattern this uses */ + printed = snprintf(filename, len, "%s/cc_%u.%s", + permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY : + pgstat_stat_directory, + backendid, + tempname ? "tmp" : "stat"); + if (printed >= len) + elog(ERROR, "overlength pgstat path"); +} + +/* removes syscache stats files of this backend */ +static void +pgstat_remove_syscache_statsfile(void) +{ + char fname[MAXPGPATH]; + + pgstat_get_syscachestat_filename(false, false, MyBackendId, + fname, MAXPGPATH); + unlink(fname); /* don't care of the result */ +} + +/* + * pgstat_write_syscache_stats() - + * Write the syscache statistics files. + * + * If 'force' is false, this function skips writing a file and resturns the + * time remaining in the current interval in milliseconds. If'force' is true, + * writes a file regardless of the remaining time and reset the interval. + */ +long +pgstat_write_syscache_stats(bool force) +{ + static TimestampTz last_report = 0; + TimestampTz now; + long elapsed; + long secs; + int usecs; + int cacheId; + FILE *fpout; + char statfile[MAXPGPATH]; + char tmpfile[MAXPGPATH]; + + /* Return if we don't want it */ + if (!force && pgstat_track_syscache_usage_interval <= 0) + { + /* disabled. remove the statistics file if any */ + if (last_report > 0) + { + last_report = 0; + pgstat_remove_syscache_statsfile(); + } + return 0; + } + + /* Check aginst the interval */ + now = GetCurrentTransactionStopTimestamp(); + TimestampDifference(last_report, now, &secs, &usecs); + elapsed = secs * 1000 + usecs / 1000; + + if (!force && elapsed < pgstat_track_syscache_usage_interval) + { + /* not yet the time, inform the remaining time to the caller */ + return pgstat_track_syscache_usage_interval - elapsed; + } + + /* now update the stats */ + last_report = now; + + pgstat_get_syscachestat_filename(false, true, + MyBackendId, tmpfile, MAXPGPATH); + pgstat_get_syscachestat_filename(false, false, + MyBackendId, statfile, MAXPGPATH); + + /* + * This function can be called from ProcessInterrupts(). Inhibit recursive + * interrupts to avoid recursive entry. + */ + HOLD_INTERRUPTS(); + + fpout = AllocateFile(tmpfile, PG_BINARY_W); + if (fpout == NULL) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not open temporary statistics file \"%s\": %m", + tmpfile))); + /* + * Failure writing this file is not critical. Just skip this time and + * tell caller to wait for the next interval. + */ + RESUME_INTERRUPTS(); + return pgstat_track_syscache_usage_interval; + } + + /* write out every catcache stats */ + for (cacheId = 0 ; cacheId < SysCacheSize ; cacheId++) + { + SysCacheStats *stats; + + stats = SysCacheGetStats(cacheId); + Assert (stats); + + /* write error is checked later using ferror() */ + fputc('T', fpout); + (void)fwrite(&cacheId, sizeof(int), 1, fpout); + (void)fwrite(&last_report, sizeof(TimestampTz), 1, fpout); + (void)fwrite(stats, sizeof(*stats), 1, fpout); + } + fputc('E', fpout); + + if (ferror(fpout)) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not write syscache statistics file \"%s\": %m", + tmpfile))); + FreeFile(fpout); + unlink(tmpfile); + } + else if (FreeFile(fpout) < 0) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not close syscache statistics file \"%s\": %m", + tmpfile))); + unlink(tmpfile); + } + else if (rename(tmpfile, statfile) < 0) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not rename syscache statistics file \"%s\" to \"%s\": %m", + tmpfile, statfile))); + unlink(tmpfile); + } + + RESUME_INTERRUPTS(); + return 0; +} diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 36cfd507b2..fb77a0ce4c 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -3157,6 +3157,12 @@ ProcessInterrupts(void) } + if (IdleSyscacheStatsUpdateTimeoutPending) + { + IdleSyscacheStatsUpdateTimeoutPending = false; + pgstat_write_syscache_stats(true); + } + if (ParallelMessagePending) HandleParallelMessages(); } @@ -3733,6 +3739,7 @@ PostgresMain(int argc, char *argv[], sigjmp_buf local_sigjmp_buf; volatile bool send_ready_for_query = true; bool disable_idle_in_transaction_timeout = false; + bool disable_idle_catcache_update_timeout = false; /* Initialize startup process environment if necessary. */ if (!IsUnderPostmaster) @@ -4173,9 +4180,19 @@ PostgresMain(int argc, char *argv[], } else { + long timeout; + ProcessCompletedNotifies(); pgstat_report_stat(false); + timeout = pgstat_write_syscache_stats(false); + + if (timeout > 0) + { + disable_idle_catcache_update_timeout = true; + enable_timeout_after(IDLE_CATCACHE_UPDATE_TIMEOUT, + timeout); + } set_ps_display("idle", false); pgstat_report_activity(STATE_IDLE, NULL); } @@ -4218,6 +4235,12 @@ PostgresMain(int argc, char *argv[], disable_idle_in_transaction_timeout = false; } + if (disable_idle_catcache_update_timeout) + { + disable_timeout(IDLE_CATCACHE_UPDATE_TIMEOUT, false); + disable_idle_catcache_update_timeout = false; + } + /* * (6) check for any other interesting events that happened while we * slept. diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index b6ba856ebe..6526cfefb4 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -14,6 +14,8 @@ */ #include "postgres.h" +#include <sys/stat.h> + #include "access/htup_details.h" #include "catalog/pg_authid.h" #include "catalog/pg_type.h" @@ -28,6 +30,7 @@ #include "utils/acl.h" #include "utils/builtins.h" #include "utils/inet.h" +#include "utils/syscache.h" #include "utils/timestamp.h" #define UINT32_ACCESS_ONCE(var) ((uint32)(*((volatile uint32 *)&(var)))) @@ -1899,3 +1902,134 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS) PG_RETURN_DATUM(HeapTupleGetDatum( heap_form_tuple(tupdesc, values, nulls))); } + +Datum +pgstat_get_syscache_stats(PG_FUNCTION_ARGS) +{ +#define PG_GET_SYSCACHE_SIZE 9 + int pid = PG_GETARG_INT32(0); + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; + TupleDesc tupdesc; + Tuplestorestate *tupstore; + MemoryContext per_query_ctx; + MemoryContext oldcontext; + PgBackendStatus *beentry; + int beid; + char fname[MAXPGPATH]; + FILE *fpin; + char c; + + if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("set-valued function called in context that cannot accept a set"))); + if (!(rsinfo->allowedModes & SFRM_Materialize)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("materialize mode required, but it is not " \ + "allowed in this context"))); + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + + per_query_ctx = rsinfo->econtext->ecxt_per_query_memory; + + oldcontext = MemoryContextSwitchTo(per_query_ctx); + tupstore = tuplestore_begin_heap(true, false, work_mem); + rsinfo->returnMode = SFRM_Materialize; + rsinfo->setResult = tupstore; + rsinfo->setDesc = tupdesc; + + MemoryContextSwitchTo(oldcontext); + + /* find beentry for given pid*/ + beentry = NULL; + for (beid = 1; + (beentry = pgstat_fetch_stat_beentry(beid)) && + beentry->st_procpid != pid ; + beid++); + + /* + * we silently return empty result on failure or insufficient privileges + */ + if (!beentry || + (!has_privs_of_role(GetUserId(), beentry->st_userid) && + !is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS))) + goto no_data; + + pgstat_get_syscachestat_filename(false, false, beid, fname, MAXPGPATH); + + if ((fpin = AllocateFile(fname, PG_BINARY_R)) == NULL) + { + if (errno != ENOENT) + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not open statistics file \"%s\": %m", + fname))); + /* also return empty on no statistics file */ + goto no_data; + } + + /* read the statistics file into tuplestore */ + while ((c = fgetc(fpin)) == 'T') + { + TimestampTz last_update; + SysCacheStats stats; + int cacheid; + Datum values[PG_GET_SYSCACHE_SIZE]; + bool nulls[PG_GET_SYSCACHE_SIZE] = {0}; + Datum datums[SYSCACHE_STATS_NAGECLASSES * 2]; + bool arrnulls[SYSCACHE_STATS_NAGECLASSES * 2] = {0}; + int dims[] = {SYSCACHE_STATS_NAGECLASSES, 2}; + int lbs[] = {1, 1}; + ArrayType *arr; + int i, j; + + fread(&cacheid, sizeof(int), 1, fpin); + fread(&last_update, sizeof(TimestampTz), 1, fpin); + if (fread(&stats, 1, sizeof(stats), fpin) != sizeof(stats)) + { + ereport(WARNING, + (errmsg("corrupted syscache statistics file \"%s\"", + fname))); + goto no_data; + } + + i = 0; + values[i++] = ObjectIdGetDatum(stats.reloid); + values[i++] = ObjectIdGetDatum(stats.indoid); + values[i++] = Int64GetDatum(stats.size); + values[i++] = Int64GetDatum(stats.ntuples); + values[i++] = Int64GetDatum(stats.nsearches); + values[i++] = Int64GetDatum(stats.nhits); + values[i++] = Int64GetDatum(stats.nneg_hits); + + for (j = 0 ; j < SYSCACHE_STATS_NAGECLASSES ; j++) + { + datums[j * 2] = Int32GetDatum((int32) stats.ageclasses[j]); + datums[j * 2 + 1] = Int32GetDatum((int32) stats.nclass_entries[j]); + } + + arr = construct_md_array(datums, arrnulls, 2, dims, lbs, + INT4OID, sizeof(int32), true, 'i'); + values[i++] = PointerGetDatum(arr); + + values[i++] = TimestampTzGetDatum(last_update); + + Assert (i == PG_GET_SYSCACHE_SIZE); + + tuplestore_putvalues(tupstore, tupdesc, values, nulls); + } + + /* check for the end of file. abandon the result if file is broken */ + if (c != 'E' || fgetc(fpin) != EOF) + tuplestore_clear(tupstore); + + FreeFile(fpin); + +no_data: + tuplestore_donestoring(tupstore); + return (Datum) 0; +} diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index c70ce3b745..484fe43e09 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -103,6 +103,10 @@ static CatCacheHeader *CacheHdr = NULL; /* Timestamp used for any operation on caches. */ TimestampTz catcacheclock = 0; +/* age classes for pruning */ +static double ageclass[SYSCACHE_STATS_NAGECLASSES] + = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -634,9 +638,7 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue) else CatCacheRemoveCTup(cache, ct); CACHE1_elog(DEBUG2, "CatCacheInvalidate: invalidated"); -#ifdef CATCACHE_STATS cache->cc_invals++; -#endif /* could be multiple matches, so keep looking! */ } } @@ -712,9 +714,7 @@ ResetCatalogCache(CatCache *cache) } else CatCacheRemoveCTup(cache, ct); -#ifdef CATCACHE_STATS cache->cc_invals++; -#endif } } } @@ -964,10 +964,10 @@ CatCacheCleanupOldEntries(CatCache *cp) int us; /* - * Calculate the duration from the time of the last access to the - * "current" time. Since catcacheclock is not advanced within a - * transaction, the entries that are accessed within the current - * transaction won't be pruned. + * Calculate the duration from the time from the last access to + * the "current" time. Since catcacheclock is not advanced within + * a transaction, the entries that are accessed within the current + * transaction always get 0 as the result. */ TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); @@ -1387,9 +1387,7 @@ SearchCatCacheInternal(CatCache *cache, if (unlikely(cache->cc_tupdesc == NULL)) CatalogCacheInitializeCache(cache); -#ifdef CATCACHE_STATS cache->cc_searches++; -#endif /* Initialize local parameter array */ arguments[0] = v1; @@ -1459,9 +1457,7 @@ SearchCatCacheInternal(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_hits++; -#endif return &ct->tuple; } @@ -1470,9 +1466,7 @@ SearchCatCacheInternal(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_neg_hits++; -#endif return NULL; } @@ -1600,9 +1594,7 @@ SearchCatCacheMiss(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): put in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_newloads++; -#endif return &ct->tuple; } @@ -1713,9 +1705,7 @@ SearchCatCacheList(CatCache *cache, Assert(nkeys > 0 && nkeys < cache->cc_nkeys); -#ifdef CATCACHE_STATS cache->cc_lsearches++; -#endif /* Initialize local parameter array */ arguments[0] = v1; @@ -1772,9 +1762,7 @@ SearchCatCacheList(CatCache *cache, CACHE2_elog(DEBUG2, "SearchCatCacheList(%s): found list", cache->cc_relname); -#ifdef CATCACHE_STATS cache->cc_lhits++; -#endif return cl; } @@ -2291,3 +2279,64 @@ PrintCatCacheListLeakWarning(CatCList *list) list->my_cache->cc_relname, list->my_cache->id, list, list->refcount); } + +/* + * CatCacheGetStats - fill in SysCacheStats struct. + * + * This is a support routine for SysCacheGetStats, substantially fills in the + * result. The classification here is based on the same criteria to + * CatCacheCleanupOldEntries(). + */ +void +CatCacheGetStats(CatCache *cache, SysCacheStats *stats) +{ + int i, j; + + Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0); + + /* fill in the stats struct */ + stats->size = cache->cc_tupsize + cache->cc_nbuckets * sizeof(dlist_head); + stats->ntuples = cache->cc_ntup; + stats->nsearches = cache->cc_searches; + stats->nhits = cache->cc_hits; + stats->nneg_hits = cache->cc_neg_hits; + + /* cache_prune_min_age can be changed on-session, fill it every time */ + for (i = 0 ; i < SYSCACHE_STATS_NAGECLASSES ; i++) + stats->ageclasses[i] = (int) (cache_prune_min_age * ageclass[i]); + + /* + * nth element in nclass_entries stores the number of cache entries that + * have lived unaccessed for corresponding multiple in ageclass of + * cache_prune_min_age. + */ + memset(stats->nclass_entries, 0, sizeof(int) * SYSCACHE_STATS_NAGECLASSES); + + /* Scan the whole hash */ + for (i = 0; i < cache->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cache->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + /* + * Calculate the duration from the time from the last access to + * the "current" time. Since catcacheclock is not advanced within + * a transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + + j = 0; + while (j < SYSCACHE_STATS_NAGECLASSES - 1 && + entry_age > stats->ageclasses[j]) + j++; + + stats->nclass_entries[j]++; + } + } +} diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c index ac98c19155..7b38a06708 100644 --- a/src/backend/utils/cache/syscache.c +++ b/src/backend/utils/cache/syscache.c @@ -20,6 +20,9 @@ */ #include "postgres.h" +#include <sys/stat.h> +#include <unistd.h> + #include "access/htup_details.h" #include "access/sysattr.h" #include "catalog/indexing.h" @@ -1534,6 +1537,27 @@ RelationSupportsSysCache(Oid relid) return false; } +/* + * SysCacheGetStats - returns stats of specified syscache + * + * This routine returns the address of its local static memory. + */ +SysCacheStats * +SysCacheGetStats(int cacheId) +{ + static SysCacheStats stats; + + Assert(cacheId >=0 && cacheId < SysCacheSize); + + memset(&stats, 0, sizeof(stats)); + + stats.reloid = cacheinfo[cacheId].reloid; + stats.indoid = cacheinfo[cacheId].indoid; + + CatCacheGetStats(SysCache[cacheId], &stats); + + return &stats; +} /* * OID comparator for pg_qsort diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index fd51934aaf..f039ecd805 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false; volatile sig_atomic_t ProcDiePending = false; volatile sig_atomic_t ClientConnectionLost = false; volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false; +volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending = false; volatile sig_atomic_t ConfigReloadPending = false; volatile uint32 InterruptHoldoffCount = 0; volatile uint32 QueryCancelHoldoffCount = 0; diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index c0b6231458..dee7f19475 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg); static void StatementTimeoutHandler(void); static void LockTimeoutHandler(void); static void IdleInTransactionSessionTimeoutHandler(void); +static void IdleSyscacheStatsUpdateTimeoutHandler(void); static bool ThereIsAtLeastOneRole(void); static void process_startup_options(Port *port, bool am_superuser); static void process_settings(Oid databaseid, Oid roleid); @@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler); RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT, IdleInTransactionSessionTimeoutHandler); + RegisterTimeout(IDLE_CATCACHE_UPDATE_TIMEOUT, + IdleSyscacheStatsUpdateTimeoutHandler); } /* @@ -1239,6 +1242,14 @@ IdleInTransactionSessionTimeoutHandler(void) SetLatch(MyLatch); } +static void +IdleSyscacheStatsUpdateTimeoutHandler(void) +{ + IdleSyscacheStatsUpdateTimeoutPending = true; + InterruptPending = true; + SetLatch(MyLatch); +} + /* * Returns true if at least one role is defined in this database cluster. */ diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index d4df841982..7bb239a07e 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -3198,6 +3198,16 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"track_syscache_usage_interval", PGC_SUSET, STATS_COLLECTOR, + gettext_noop("Sets the interval between syscache usage collection, in milliseconds. Zero disables syscache usagetracking."), + NULL + }, + &pgstat_track_syscache_usage_interval, + 0, 0, INT_MAX / 2, + NULL, NULL, NULL + }, + { {"gin_pending_list_limit", PGC_USERSET, CLIENT_CONN_STATEMENT, gettext_noop("Sets the maximum size of the pending list for GIN index."), diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 108d332f2c..4d4fb42251 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -560,6 +560,7 @@ #track_io_timing = off #track_functions = none # none, pl, all #track_activity_query_size = 1024 # (change requires restart) +#track_syscache_usage_interval = 0 # zero disables tracking #stats_temp_directory = 'pg_stat_tmp' diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index b8de13f03b..6099a828d2 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -9669,6 +9669,15 @@ proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}', proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}', prosrc => 'pg_get_replication_slots' }, +{ oid => '3425', + descr => 'syscache statistics', + proname => 'pg_get_syscache_stats', prorows => '100', proisstrict => 'f', + proretset => 't', provolatile => 'v', prorettype => 'record', + proargtypes => 'int4', + proallargtypes => '{int4,oid,oid,int8,int8,int8,int8,int8,_int4,timestamptz}', + proargmodes => '{i,o,o,o,o,o,o,o,o,o}', + proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,last_update}', + prosrc => 'pgstat_get_syscache_stats' }, { oid => '3786', descr => 'set up a logical replication slot', proname => 'pg_create_logical_replication_slot', provolatile => 'v', proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool', diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index c9e35003a5..69b9a976f0 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending; extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending; extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending; extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending; +extern PGDLLIMPORT volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending; extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending; extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost; diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 88a75fb798..b6bfd7d644 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -1144,6 +1144,7 @@ extern bool pgstat_track_activities; extern bool pgstat_track_counts; extern int pgstat_track_functions; extern PGDLLIMPORT int pgstat_track_activity_query_size; +extern int pgstat_track_syscache_usage_interval; extern char *pgstat_stat_directory; extern char *pgstat_stat_tmpname; extern char *pgstat_stat_filename; @@ -1228,7 +1229,8 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id); extern void pgstat_initstats(Relation rel); extern char *pgstat_clip_activity(const char *raw_activity); - +extern void pgstat_get_syscachestat_filename(bool permanent, + bool tempname, int backendid, char *filename, int len); /* ---------- * pgstat_report_wait_start() - * @@ -1363,5 +1365,5 @@ extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid); extern int pgstat_fetch_stat_numbackends(void); extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void); extern PgStat_GlobalStats *pgstat_fetch_global(void); - +extern long pgstat_write_syscache_stats(bool force); #endif /* PGSTAT_H */ diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 3c6842e272..9af414b307 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -67,10 +67,8 @@ typedef struct catcache int cc_nfreeent; /* # of entries currently not referenced */ /* - * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS - * doesn't break ABI for other modules + * Statistics entries */ -#ifdef CATCACHE_STATS long cc_searches; /* total # searches against this cache */ long cc_hits; /* # of matches against existing entry */ long cc_neg_hits; /* # of matches against negative entry */ @@ -83,7 +81,6 @@ typedef struct catcache long cc_invals; /* # of entries invalidated from cache */ long cc_lsearches; /* total # list-searches */ long cc_lhits; /* # of matches against existing lists */ -#endif } CatCache; @@ -259,4 +256,8 @@ extern void PrepareToInvalidateCacheTuple(Relation relation, extern void PrintCatCacheLeakWarning(HeapTuple tuple); extern void PrintCatCacheListLeakWarning(CatCList *list); +/* defined in syscache.h */ +typedef struct syscachestats SysCacheStats; +extern void CatCacheGetStats(CatCache *cache, SysCacheStats *syscachestats); + #endif /* CATCACHE_H */ diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h index 95ee48954e..71b399c902 100644 --- a/src/include/utils/syscache.h +++ b/src/include/utils/syscache.h @@ -112,6 +112,24 @@ enum SysCacheIdentifier #define SysCacheSize (USERMAPPINGUSERSERVER + 1) }; +#define SYSCACHE_STATS_NAGECLASSES 6 +/* Struct for catcache tracking information */ +typedef struct syscachestats +{ + Oid reloid; /* target relation */ + Oid indoid; /* index */ + size_t size; /* size of the catcache */ + int ntuples; /* number of tuples resides in the catcache */ + int nsearches; /* number of searches */ + int nhits; /* number of cache hits */ + int nneg_hits; /* number of negative cache hits */ + /* age classes in seconds */ + int ageclasses[SYSCACHE_STATS_NAGECLASSES]; + /* number of tuples fall into the corresponding age class */ + int nclass_entries[SYSCACHE_STATS_NAGECLASSES]; +} SysCacheStats; + + extern void InitCatalogCache(void); extern void InitCatalogCachePhase2(void); @@ -164,6 +182,7 @@ extern void SysCacheInvalidate(int cacheId, uint32 hashValue); extern bool RelationInvalidatesSnapshotsOnly(Oid relid); extern bool RelationHasSysCache(Oid relid); extern bool RelationSupportsSysCache(Oid relid); +extern SysCacheStats *SysCacheGetStats(int cacheId); /* * The use of the macros below rather than direct calls to the corresponding diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h index 9244a2a7b7..0ab441a364 100644 --- a/src/include/utils/timeout.h +++ b/src/include/utils/timeout.h @@ -31,6 +31,7 @@ typedef enum TimeoutId STANDBY_TIMEOUT, STANDBY_LOCK_TIMEOUT, IDLE_IN_TRANSACTION_SESSION_TIMEOUT, + IDLE_CATCACHE_UPDATE_TIMEOUT, /* First user-definable timeout reason */ USER_TIMEOUT, /* Maximum number of timeout reasons */ diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out index 2c8e21baa7..7bd77e9972 100644 --- a/src/test/regress/expected/rules.out +++ b/src/test/regress/expected/rules.out @@ -1921,6 +1921,28 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid, pg_stat_all_tables.autoanalyze_count FROM pg_stat_all_tables WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname~ '^pg_toast'::text)); +pg_stat_syscache| SELECT s.pid, + (s.relid)::regclass AS relname, + (s.indid)::regclass AS cache_name, + s.size, + s.ntup AS ntuples, + s.searches, + s.hits, + s.neg_hits, + s.ageclass, + s.last_update + FROM (pg_stat_activity a + JOIN LATERAL ( SELECT a.pid, + pg_get_syscache_stats.relid, + pg_get_syscache_stats.indid, + pg_get_syscache_stats.size, + pg_get_syscache_stats.ntup, + pg_get_syscache_stats.searches, + pg_get_syscache_stats.hits, + pg_get_syscache_stats.neg_hits, + pg_get_syscache_stats.ageclass, + pg_get_syscache_stats.last_update + FROM pg_get_syscache_stats(a.pid) pg_get_syscache_stats(relid, indid, size, ntup, searches, hits, neg_hits, ageclass,last_update)) s ON ((a.pid = s.pid))); pg_stat_user_functions| SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, @@ -2352,7 +2374,7 @@ pg_settings|pg_settings_n|CREATE RULE pg_settings_n AS ON UPDATE TO pg_catalog.pg_settings DO INSTEAD NOTHING; pg_settings|pg_settings_u|CREATE RULE pg_settings_u AS ON UPDATE TO pg_catalog.pg_settings - WHERE (new.name = old.name) DO SELECT set_config(old.name, new.setting, false) AS set_config; + WHERE (new.name = old.name) DO SELECT set_config(old.name, new.setting, false, false) AS set_config; rtest_emp|rtest_emp_del|CREATE RULE rtest_emp_del AS ON DELETE TO public.rtest_emp DO INSERT INTO rtest_emplog (ename, who, action, newsal, oldsal) VALUES (old.ename, CURRENT_USER, 'fired'::bpchar, '$0.00'::money, old.salary); -- 2.16.3
>From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] >I made a rerun of benchmark using "-S -T 30" on the server build with no assertion and >-O2. The numbers are the best of three successive attempts. The patched version is >running with cache_target_memory = 0, cache_prune_min_age = 600 and >cache_entry_limit = 0 but pruning doesn't happen by the workload. > >master: 13393 tps >v12 : 12625 tps (-6%) > >Significant degradation is found. > >Recuded frequency of dlist_move_tail by taking 1ms interval between two succesive >updates on the same entry let the degradation dissapear. > >patched : 13720 tps (+2%) It would be good to introduce some interval. I followed your benchmark (initialized scale factor=10 and others are same option) and found the same tendency. These are average of 5 trials. master: 7640.000538 patch_v12:7417.981378 (3 % down against master) patch_v13:7645.071787 (almost same as master) These cases are not pruning happen workload as you mentioned. I'd like to do benchmark of cache-pruning-case as well. To demonstrate cache-pruning-case right now I'm making hundreds of partitioned table and run select query for each partitioned table using pgbench custom file. Maybe using small number of cache_prune_min_age or hard limit would be better. Are there any good model? ># I'm not sure the name LRU_IGNORANCE_INTERVAL makes sens.. How about MIN_LRU_UPDATE_INTERVAL? Regards, Takeshi Ideriha
From: Tomas Vondra > I don't think we need to remove the expired entries right away, if there > are only very few of them. The cleanup requires walking the hash table, > which means significant fixed cost. So if there are only few expired > entries (say, less than 25% of the cache), we can just leave them around > and clean them if we happen to stumble on them (although that may not be > possible with dynahash, which has no concept of expiration) of before > enlarging the hash table. I agree in that we don't need to evict cache entries as long as the memory permits (within the control of the DBA.) But how does the concept of expiration fit the catcache? How would the user determine the expiration time, i.e. setting of syscache_prune_min_age? If you set a small value to evict unnecessary entries faster, necessary entries will also be evicted. Some access counter would keep accessed entries longer, but some idle time (e.g. lunch break) can flush entries that you want to access after the lunch break. The idea of expiration applies to the case where we want possibly stale entries to vanish and load newer data upon the next access. For example, the TTL (time-to-live) of Memcached, Redis, DNS, ARP. Is the catcache based on the same idea with them? No. What we want to do is to evict never or infrequently used cache entries. That's naturally the task of LRU, isn't it? Even the high performance Memcached and Redis uses LRU when the cache is full. As Bruce said, we don't have to be worried about the lock contention or something, because we're talking about the backend local cache. Are we worried about the overhead of manipulating the LRU chain? The current catcache already does it on every access; it calls dlist_move_head() to put the accessed entry to the front of the hash bucket. > So if we want to address this case too (and we probably want), we may > need to discard the old cache memory context someho (e.g. rebuild the > cache in a new one, and copy the non-expired entries). Which is a nice > opportunity to do the "full" cleanup, of course. The straightforward, natural, and familiar way is to limit the cache size, which I mentioned in some previous mail. We should give the DBA the ability to control memory usage, rather than considering what to do after leaving the memory area grow unnecessarily too large. That's what a typical "cache" is, isn't it? https://en.wikipedia.org/wiki/Cache_(computing) "To be cost-effective and to enable efficient use of data, caches must be relatively small." Another relevant suboptimal idea would be to provide each catcache with a separate memory context, which is the child of CacheMemoryContext. This gives slight optimization by using the slab context (slab.c) for a catcache with fixed-sized tuples. But that'd be a bit complex, I'm afraid for PG 12. Regards MauMau
From: Alvaro Herrera > I think seqscanning the hash table is going to be too slow; Ideriha-san > idea of having a dlist with the entries in LRU order (where each entry > is moved to head of list when it is touched) seemed good: it allows you > to evict older ones when the time comes, without having to scan the rest > of the entries. Having a dlist means two more pointers on each cache > entry AFAIR, so it's not a huge amount of memory. Absolutely. We should try to avoid unpredictable long response time caused by an occasional unlucky batch processing. That makes the troubleshooting when the user asks why they experience unsteady response time. Regards MauMau
On 2/8/19 2:27 PM, MauMau wrote: > From: Tomas Vondra >> I don't think we need to remove the expired entries right away, if >> there are only very few of them. The cleanup requires walking the >> hash table, which means significant fixed cost. So if there are >> only few expired entries (say, less than 25% of the cache), we can >> just leave them around and clean them if we happen to stumble on >> them (although that may not be possible with dynahash, which has no >> concept of expiration) of before enlarging the hash table. > > I agree in that we don't need to evict cache entries as long as the > memory permits (within the control of the DBA.) > > But how does the concept of expiration fit the catcache? How would > the user determine the expiration time, i.e. setting of > syscache_prune_min_age? If you set a small value to evict > unnecessary entries faster, necessary entries will also be evicted. > Some access counter would keep accessed entries longer, but some idle > time (e.g. lunch break) can flush entries that you want to access > after the lunch break. > I'm not sure what you mean by "necessary" and "unnecessary" here. What matters is how often an entry is accessed - if it's accessed often, it makes sense to keep it in the cache. Otherwise evict it. Entries not accessed for 5 minutes are clearly not accessed very often, so and getting rid of them will not hurt the cache hit ratio very much. So I agree with Robert a time-based approach should work well here. It does not have the issues with setting exact syscache size limit, it's kinda self-adaptive etc. In a way, this is exactly what the 5 minute rule [1] says about caching. [1] http://www.hpl.hp.com/techreports/tandem/TR-86.1.pdf > The idea of expiration applies to the case where we want possibly > stale entries to vanish and load newer data upon the next access. > For example, the TTL (time-to-live) of Memcached, Redis, DNS, ARP. > Is the catcache based on the same idea with them? No. > I'm not sure what has this to do with those other databases. > What we want to do is to evict never or infrequently used cache > entries. That's naturally the task of LRU, isn't it? Even the high > performance Memcached and Redis uses LRU when the cache is full. As > Bruce said, we don't have to be worried about the lock contention or > something, because we're talking about the backend local cache. Are > we worried about the overhead of manipulating the LRU chain? The > current catcache already does it on every access; it calls > dlist_move_head() to put the accessed entry to the front of the hash > bucket. > I'm certainly worried about the performance aspect of it. The syscache is in a plenty of hot paths, so adding overhead may have significant impact. But that depends on how complex the eviction criteria will be. And then there may be cases conflicting with the criteria, i.e. running into just-evicted entries much more often. This is the issue with the initially proposed hard limits on cache sizes, where it'd be trivial to under-size it just a little bit. > >> So if we want to address this case too (and we probably want), we >> may need to discard the old cache memory context somehow (e.g. >> rebuild the cache in a new one, and copy the non-expired entries). >> Which is a nice opportunity to do the "full" cleanup, of course. > > The straightforward, natural, and familiar way is to limit the cache > size, which I mentioned in some previous mail. We should give the > DBA the ability to control memory usage, rather than considering what > to do after leaving the memory area grow unnecessarily too large. > That's what a typical "cache" is, isn't it? > Not sure which mail you're referring to - this seems to be the first e-mail from you in this thread (per our archives). I personally don't find explicit limit on cache size very attractive, because it's rather low-level and difficult to tune, and very easy to get it wrong (at which point you fall from a cliff). All the information is in backend private memory, so how would you even identify syscache is the thing you need to tune, or how would you determine the correct size? > https://en.wikipedia.org/wiki/Cache_(computing) > > "To be cost-effective and to enable efficient use of data, caches must > be relatively small." > Relatively small compared to what? It's also a question of how expensive cache misses are. > > Another relevant suboptimal idea would be to provide each catcache > with a separate memory context, which is the child of > CacheMemoryContext. This gives slight optimization by using the slab > context (slab.c) for a catcache with fixed-sized tuples. But that'd > be a bit complex, I'm afraid for PG 12. > I don't know, but that does not seem very attractive. Each memory context has some overhead, and it does not solve the issue of never releasing memory to the OS. So we'd still have to rebuild the contexts at some point, I'm afraid. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2/7/19 1:18 PM, Kyotaro HORIGUCHI wrote: > At Thu, 07 Feb 2019 15:24:18 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in<20190207.152418.139132570.horiguchi.kyotaro@lab.ntt.co.jp> >> I'm going to retake numbers with search-only queries. > > Yeah, I was stupid. > > I made a rerun of benchmark using "-S -T 30" on the server build > with no assertion and -O2. The numbers are the best of three > successive attempts. The patched version is running with > cache_target_memory = 0, cache_prune_min_age = 600 and > cache_entry_limit = 0 but pruning doesn't happen by the workload. > > master: 13393 tps > v12 : 12625 tps (-6%) > > Significant degradation is found. > > Recuded frequency of dlist_move_tail by taking 1ms interval > between two succesive updates on the same entry let the > degradation dissapear. > > patched : 13720 tps (+2%) > > I think there's still no need of such frequency. It is 100ms in > the attched patch. > > # I'm not sure the name LRU_IGNORANCE_INTERVAL makes sens.. > Hi, I've done a bunch of benchmarks on v13, and I don't see any serious regression either. Each test creates a number of tables (100, 1k, 10k, 100k and 1M) and then runs SELECT queries on them. The tables are accessed randomly - with either uniform or exponential distribution. For each combination there are 5 runs, 60 seconds each (see the attached shell scripts, it should be pretty obvious). I've done the tests on two different machines - small one (i5 with 8GB of RAM) and large one (e5-2620v4 with 64GB RAM), but the behavior is almost exactly the same (with the exception of 1M tables, which does not fit into RAM on the smaller one). On the xeon, the results (throughput compared to master) look like this: uniform 100 1000 10000 100000 1000000 ------------------------------------------------------------ v13 105.04% 100.28% 102.96% 102.11% 101.54% v13 (nodata) 97.05% 98.30% 97.42% 96.60% 107.55% exponential 100 1000 10000 100000 1000000 ------------------------------------------------------------ v13 100.04% 103.48% 101.70% 98.56% 103.20% v13 (nodata) 97.12% 98.43% 98.86% 98.48% 104.94% The "nodata" case means the tables were empty (so no files created), while in the other case each table contained 1 row. Per the results it's mostly break even, and in some cases there is actually a measurable improvement. That being said, the question is whether the patch actually reduces memory usage in a useful way - that's not something this benchmark validates. I plan to modify the tests to make pgbench script time-dependent (i.e. to pick a subset of tables depending on time). A couple of things I've happened to notice during a quick review: 1) The sgml docs in 0002 talk about "syscache_memory_target" and "syscache_prune_min_age", but those options were renamed to just "cache_memory_target" and "cache_prune_min_age". 2) "cache_entry_limit" is not mentioned in sgml docs at all, and it's defined three times in guc.c for some reason. 3) I don't see why to define PRUNE_BY_AGE and PRUNE_BY_NUMBER, instead of just using two bool variables prune_by_age and prune_by_number doing the same thing. 4) I'm not entirely sure about using stmtStartTimestamp. Doesn't that pretty much mean long-running statements will set the lastaccess to very old timestamp? Also, it means that long-running statements (like a PL function accessing a bunch of tables) won't do any eviction at all, no? AFAICS we'll set the timestamp only once, at the very beginning. I wonder whether using some other timestamp source (like a timestamp updated regularly from a timer, or something like that). 5) There are two fread() calls in 0003 triggering a compiler warning about unused return value. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
From: Tomas Vondra <tomas.vondra@2ndquadrant.com> > I'm not sure what you mean by "necessary" and "unnecessary" here. What > matters is how often an entry is accessed - if it's accessed often, it makes sense > to keep it in the cache. Otherwise evict it. Entries not accessed for 5 minutes are > clearly not accessed very often, so and getting rid of them will not hurt the > cache hit ratio very much. Right, "necessary" and "unnecessary" were imprecise, and it matters how frequent the entries are accessed. What made mesay "unnecessary" is the pg_statistic entry left by CREATE/DROP TEMP TABLE which is never accessed again. > So I agree with Robert a time-based approach should work well here. It does > not have the issues with setting exact syscache size limit, it's kinda self-adaptive > etc. > > In a way, this is exactly what the 5 minute rule [1] says about caching. > > [1] http://www.hpl.hp.com/techreports/tandem/TR-86.1.pdf Then, can we just set 5min to syscache_prune_min_age? Otherwise, how can users set the expiration period? > > The idea of expiration applies to the case where we want possibly > > stale entries to vanish and load newer data upon the next access. > > For example, the TTL (time-to-live) of Memcached, Redis, DNS, ARP. > > Is the catcache based on the same idea with them? No. > > > > I'm not sure what has this to do with those other databases. I meant that the time-based eviction is not very good, because it could cause less frequently entries to vanish even whenmemory is not short. Time-based eviction reminds me of Memcached, Redis, DNS, etc. that evicts long-lived entries toavoid stale data, not to free space for other entries. I think size-based eviction is sufficient like shared_buffers,OS page cache, CPU cache, disk cache, etc. > I'm certainly worried about the performance aspect of it. The syscache is in a > plenty of hot paths, so adding overhead may have significant impact. But that > depends on how complex the eviction criteria will be. The LRU chain manipulation, dlist_move_head() in SearchCatCacheInternal(), may certainly incur some overhead. If it hasvisible impact, then we can do the manipulation only when the user set an upper limit on the cache size. > And then there may be cases conflicting with the criteria, i.e. running into > just-evicted entries much more often. This is the issue with the initially > proposed hard limits on cache sizes, where it'd be trivial to under-size it just a > little bit. In that case, the user can just enlarge the catcache. > Not sure which mail you're referring to - this seems to be the first e-mail from > you in this thread (per our archives). Sorry, MauMau is me, Takayuki Tsunakawa. > I personally don't find explicit limit on cache size very attractive, because it's > rather low-level and difficult to tune, and very easy to get it wrong (at which > point you fall from a cliff). All the information is in backend private memory, so > how would you even identify syscache is the thing you need to tune, or how > would you determine the correct size? Just like other caches, we can present a view that shows the hits, misses, and the hit ratio of the entire catcaches. Ifthe hit ratio is low, the user can enlarge the catcache size. That's what Oracle and MySQL do as I referred to in thisthread. The tuning parameter is the size. That's all. Besides, the v13 patch has as many as 4 parameters: cache_memory_target,cache_prune_min_age, cache_entry_limit, cache_entry_limit_prune_ratio. I don't think I can give theuser good intuitive advice on how to tune these. > > https://en.wikipedia.org/wiki/Cache_(computing) > > > > "To be cost-effective and to enable efficient use of data, caches must > > be relatively small." > > > > Relatively small compared to what? It's also a question of how expensive cache > misses are. I guess the author meant that the cache is "relatively small" compared to the underlying storage: CPU cache is smaller thanDRAM, DRAM is smaller than SSD/HDD. In our case, we have to pay more attention to limit the catcache memory consumption,especially because they are duplicated in multiple backend processes. > I don't know, but that does not seem very attractive. Each memory context has > some overhead, and it does not solve the issue of never releasing memory to > the OS. So we'd still have to rebuild the contexts at some point, I'm afraid. I think there is little additional overhead on each catcache access -- processing overhead is the same as when using aset,and the memory overhead is as much as several dozens (which is the number of catcaches) of MemoryContext structure. The slab context (slab.c) returns empty blocks to OS unlike the allocation context (aset.c). Regards Takayuki Tsunakawa
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] > Recuded frequency of dlist_move_tail by taking 1ms interval between two > succesive updates on the same entry let the degradation dissapear. > > patched : 13720 tps (+2%) What do you think contributed to this performance increase? Or do you hink this is just a measurement variation? Most of my previous comments also seem to apply to v13, so let me repost them below: (1) (1) +/* GUC variable to define the minimum age of entries that will be cosidered to + /* initilize catcache reference clock if haven't done yet */ cosidered -> considered initilize -> initialize I remember I saw some other wrong spelling and/or missing words, which I forgot (sorry). (2) Only the doc prefixes "sys" to the new parameter names. Other places don't have it. I think we should prefix sys, becauserelcache and plancache should be configurable separately because of their different usage patterns/lifecycle. (3) The doc doesn't describe the unit of syscache_memory_target. Kilobytes? (4) + hash_size = cp->cc_nbuckets * sizeof(dlist_head); + tupsize = sizeof(CatCTup) + MAXIMUM_ALIGNOF + dtp->t_len; + tupsize = sizeof(CatCTup); GetMemoryChunkSpace() should be used to include the memory context overhead. That's what the files in src/backend/utils/sort/do. (5) + if (entry_age > cache_prune_min_age) ">=" instead of ">"? (6) + if (!ct->c_list || ct->c_list->refcount == 0) + { + CatCacheRemoveCTup(cp, ct); It's better to write "ct->c_list == NULL" to follow the style in this file. "ct->refcount == 0" should also be checked prior to removing the catcache tuple, just in case the tuple hasn't been releasedfor a long time, which might hardly happen. (7) CatalogCacheCreateEntry + int tupsize = 0; if (ntp) { int i; + int tupsize; tupsize is defined twice. (8) CatalogCacheCreateEntry In the negative entry case, the memory allocated by CatCacheCopyKeys() is not counted. I'm afraid that's not negligible. (9) The memory for CatCList is not taken into account for syscache_memory_target. Regards Takayuki Tsunakawa
On 2/12/19 1:49 AM, Tsunakawa, Takayuki wrote: > From: Tomas Vondra <tomas.vondra@2ndquadrant.com> >> I'm not sure what you mean by "necessary" and "unnecessary" here. What >> matters is how often an entry is accessed - if it's accessed often, it makes sense >> to keep it in the cache. Otherwise evict it. Entries not accessed for 5 minutes are >> clearly not accessed very often, so and getting rid of them will not hurt the >> cache hit ratio very much. > > Right, "necessary" and "unnecessary" were imprecise, and it matters > how frequent the entries are accessed. What made me say "unnecessary" > is the pg_statistic entry left by CREATE/DROP TEMP TABLE which is never > accessed again. > OK, understood. >> So I agree with Robert a time-based approach should work well here. It does >> not have the issues with setting exact syscache size limit, it's kinda self-adaptive >> etc. >> >> In a way, this is exactly what the 5 minute rule [1] says about caching. >> >> [1] http://www.hpl.hp.com/techreports/tandem/TR-86.1.pdf > > Then, can we just set 5min to syscache_prune_min_age? Otherwise, > how can users set the expiration period? > I believe so. >>> The idea of expiration applies to the case where we want possibly >>> stale entries to vanish and load newer data upon the next access. >>> For example, the TTL (time-to-live) of Memcached, Redis, DNS, ARP. >>> Is the catcache based on the same idea with them? No. >>> >> >> I'm not sure what has this to do with those other databases. > > I meant that the time-based eviction is not very good, because it > could cause less frequently entries to vanish even when memory is not > short. Time-based eviction reminds me of Memcached, Redis, DNS, etc. > that evicts long-lived entries to avoid stale data, not to free space > for other entries. I think size-based eviction is sufficient like > shared_buffers, OS page cache, CPU cache, disk cache, etc. > Right. But the logic behind time-based approach is that evicting such entries should not cause any issues exactly because they are accessed infrequently. It might incur some latency when we need them for the first time after the eviction, but IMHO that's acceptable (although I see Andres did not like that). FWIW we might even evict entries after some time passes since inserting them into the cache - that's what memcached et al do, IIRC. The logic is that frequently accessed entries will get immediately loaded back (thus keeping cache hit ratio high). But there are reasons why the other dbs do that - like not having any cache invalidation (unlike us). That being said, having a "minimal size" threshold before starting with the time-based eviction may be a good idea. >> I'm certainly worried about the performance aspect of it. The syscache is in a >> plenty of hot paths, so adding overhead may have significant impact. But that >> depends on how complex the eviction criteria will be. > > The LRU chain manipulation, dlist_move_head() in > SearchCatCacheInternal(), may certainly incur some overhead. If it has > visible impact, then we can do the manipulation only when the user set > an upper limit on the cache size. > I think the benchmarks done so far suggest the extra overhead is within noise. So unless we manage to make it much more expensive, we should be OK I think. >> And then there may be cases conflicting with the criteria, i.e. running into >> just-evicted entries much more often. This is the issue with the initially >> proposed hard limits on cache sizes, where it'd be trivial to under-size it just a >> little bit. > > In that case, the user can just enlarge the catcache. > IMHO the main issues with this are (a) It's not quite clear how to determine the appropriate limit. I can probably apply a bit of perf+gdb, but I doubt that's what very nice. (b) It's not adaptive, so systems that grow over time (e.g. by adding schemas and other objects) will keep hitting the limit over and over. > >> Not sure which mail you're referring to - this seems to be the first e-mail from >> you in this thread (per our archives). > > Sorry, MauMau is me, Takayuki Tsunakawa. > Ah, of course! > >> I personally don't find explicit limit on cache size very attractive, because it's >> rather low-level and difficult to tune, and very easy to get it wrong (at which >> point you fall from a cliff). All the information is in backend private memory, so >> how would you even identify syscache is the thing you need to tune, or how >> would you determine the correct size? > > Just like other caches, we can present a view that shows the hits, misses, and the hit ratio of the entire catcaches. If the hit ratio is low, the user can enlarge the catcache size. That's what Oracle and MySQL do as I referred to in thisthread. The tuning parameter is the size. That's all. How will that work, considering the caches are in private backend memory? And each backend may have quite different characteristics, even if they are connected to the same database? > Besides, the v13 patch has as many as 4 parameters: cache_memory_target, cache_prune_min_age, cache_entry_limit, cache_entry_limit_prune_ratio. I don't think I can give the user good intuitive advice on how to tune these. > Isn't that more an argument for not having 4 parameters? > >>> https://en.wikipedia.org/wiki/Cache_(computing) >>> >>> "To be cost-effective and to enable efficient use of data, caches must >>> be relatively small." >>> >> >> Relatively small compared to what? It's also a question of how expensive cache >> misses are. > > I guess the author meant that the cache is "relatively small" compared to the underlying storage: CPU cache is smallerthan DRAM, DRAM is smaller than SSD/HDD. In our case, we have to pay more attention to limit the catcache memoryconsumption, especially because they are duplicated in multiple backend processes. > I don't think so. IMHO the focus there in on "cost-effective", i.e. caches are generally more expensive than the storage, so to make them worth it you need to make them much smaller than the main storage. That's pretty much what the 5 minute rule is about, I think. But I don't see how this applies to the problem at hand, because the system is already split into storage + cache (represented by RAM). The challenge is how to use RAM to cache various pieces of data to get the best behavior. The problem is, we don't have a unified cache, but multiple smaller ones (shared buffers, page cache, syscache) competing for the same resource. Of course, having multiple (different) copies of syscache makes it even more difficult. (Does this make sense, or am I just babbling nonsense?) > >> I don't know, but that does not seem very attractive. Each memory context has >> some overhead, and it does not solve the issue of never releasing memory to >> the OS. So we'd still have to rebuild the contexts at some point, I'm afraid. > > I think there is little additional overhead on each catcache access > -- processing overhead is the same as when using aset, and the memory > overhead is as much as several dozens (which is the number of catcaches) > of MemoryContext structure. Hmmm. That doesn't seem particularly terrible, I guess. > The slab context (slab.c) returns empty blocks to OS unlike the > allocation context (aset.c). Slab can do that, but it requires certain allocation pattern, and I very much doubt syscache has it. It'll be trivial to end with one active entry on each block (which means slab can't release it). BTW doesn't syscache store the full on-disk tuple? That doesn't seem like a fixed-length entry, which is a requirement for slab. No? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: Tomas Vondra [mailto:tomas.vondra@2ndquadrant.com] > > I meant that the time-based eviction is not very good, because it > > could cause less frequently entries to vanish even when memory is not > > short. Time-based eviction reminds me of Memcached, Redis, DNS, etc. > > that evicts long-lived entries to avoid stale data, not to free space > > for other entries. I think size-based eviction is sufficient like > > shared_buffers, OS page cache, CPU cache, disk cache, etc. > > > > Right. But the logic behind time-based approach is that evicting such > entries should not cause any issues exactly because they are accessed > infrequently. It might incur some latency when we need them for the > first time after the eviction, but IMHO that's acceptable (although I > see Andres did not like that). Yes, that's what I expressed. That is, I'm probably with Andres. > FWIW we might even evict entries after some time passes since inserting > them into the cache - that's what memcached et al do, IIRC. The logic is > that frequently accessed entries will get immediately loaded back (thus > keeping cache hit ratio high). But there are reasons why the other dbs > do that - like not having any cache invalidation (unlike us). These are what Memcached and Redis do: 1. Evict entries that have lived longer than their TTLs. This is independent of the cache size. This is to avoid keeping stale data in the cache when the underlying data (such asin the database) is modified. This doesn't apply to PostgreSQL. 2. Evict most least recently accessed entries. This is to make room for new entries when the cache is full. This is similar or the same as PostgreSQL and other DBMSs dofor their database cache. Oracle and MySQL also do this for their dictionary caches, where "dictionary cache" correspondsto syscache in PostgreSQL. Here's my sketch for this feature. Although it may not meet all (contradictory) requirements as you said, it's simple andfamiliar for those who have used PostgreSQL and other DBMSs. What do you think? The points are simplicity, familiarity,and memory consumption control for the DBA. * Add a GUC parameter syscache_size which imposes the upper limit on the total size of all catcaches, not on individual catcache. The naming follows effective_cache_size. It can be syscache_mem to follow work_mem and maintenance_work_mem. The default value is 0, which doesn't limit the cache size as now. * A new member variable in CatCacheHeader tracks the total size of all cached entries. * A single new LRU list in CatCacheHeader links all cache tuples in LRU order. Each cache access, SearchSCatCacheInternal(),puts the found entry on its front. * Insertion of a new catcache entry adds the entry size to the total cache size. If the total size exceeds the limit definedby syscache_size, most infrequently accessed entries are removed until the total cache size gets below the limit. This eviction results in slight overhead when the cache is full, but the response time is steady. On the other hand, withthe proposed approach, users will wonder about mysterious long response time due to bulk entry deletions. > > In that case, the user can just enlarge the catcache. > > > > IMHO the main issues with this are > > (a) It's not quite clear how to determine the appropriate limit. I can > probably apply a bit of perf+gdb, but I doubt that's what very nice. Like Oracle and MySQL, the user should be able to see the cache hit ratio with a statistics view. > (b) It's not adaptive, so systems that grow over time (e.g. by adding > schemas and other objects) will keep hitting the limit over and over. The user needs to restart the database instance to enlarge the syscache. That's also true for shared buffers: to accommodategrowing amoun of data, the user needs to increase shared_buffers and restart the server. But the current syscache is local memory, so the server may not need restart. > > Just like other caches, we can present a view that shows the hits, misses, > and the hit ratio of the entire catcaches. If the hit ratio is low, the > user can enlarge the catcache size. That's what Oracle and MySQL do as > I referred to in this thread. The tuning parameter is the size. That's > all. > > How will that work, considering the caches are in private backend > memory? And each backend may have quite different characteristics, even > if they are connected to the same database? Assuming that pg_stat_syscache (pid, cache_name, hits, misses) gives the statistics, the statistics data can be stored onthe shared memory, because the number of backends and the number of catcaches are fixed. > > I guess the author meant that the cache is "relatively small" compared > to the underlying storage: CPU cache is smaller than DRAM, DRAM is smaller > than SSD/HDD. In our case, we have to pay more attention to limit the > catcache memory consumption, especially because they are duplicated in > multiple backend processes. > > > > I don't think so. IMHO the focus there in on "cost-effective", i.e. > caches are generally more expensive than the storage, so to make them > worth it you need to make them much smaller than the main storage. I think we're saying the same thing. Perhaps my English is not good enough. > But I don't see how this applies to the problem at hand, because the > system is already split into storage + cache (represented by RAM). The > challenge is how to use RAM to cache various pieces of data to get the > best behavior. The problem is, we don't have a unified cache, but > multiple smaller ones (shared buffers, page cache, syscache) competing > for the same resource. You're right. On the other hand, we can consider syscache, shared buffers, and page cache as different tiers of storage,even though they are all on DRAM. syscache caches some data from shared buffers for efficient access. If we usemuch memory for syscache, there's less memory for caching user data in shared buffers and page cache. That's a normaltradeoff of caches. > Slab can do that, but it requires certain allocation pattern, and I very > much doubt syscache has it. It'll be trivial to end with one active > entry on each block (which means slab can't release it). I expect so, too, although slab context makes efforts to mitigate that possibility like this: * This also allows various optimizations - for example when searching for * free chunk, the allocator reuses space from the fullest blocks first, in * the hope that some of the less full blocks will get completely empty (and * returned back to the OS). > BTW doesn't syscache store the full on-disk tuple? That doesn't seem > like a fixed-length entry, which is a requirement for slab. No? Some system catalogs are fixed in size like pg_am and pg_amop. But I guess the number of such catalogs is small. Dominantcatalogs like pg_class and pg_attribute are variable size. So using different memory contexts for limited catalogsmight not show any visible performance improvement nor memory reduction. Regards Takayuki Tsunakawa
At Fri, 8 Feb 2019 09:42:20 +0000, "Ideriha, Takeshi" <ideriha.takeshi@jp.fujitsu.com> wrote in <4E72940DA2BF16479384A86D54D0988A6F41EDD1@G01JPEXMBKW04> > >From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] > >I made a rerun of benchmark using "-S -T 30" on the server build with no assertion and > >-O2. The numbers are the best of three successive attempts. The patched version is > >running with cache_target_memory = 0, cache_prune_min_age = 600 and > >cache_entry_limit = 0 but pruning doesn't happen by the workload. > > > >master: 13393 tps > >v12 : 12625 tps (-6%) > > > >Significant degradation is found. > > > >Recuded frequency of dlist_move_tail by taking 1ms interval between two succesive > >updates on the same entry let the degradation dissapear. > > > >patched : 13720 tps (+2%) > > It would be good to introduce some interval. > I followed your benchmark (initialized scale factor=10 and others are same option) > and found the same tendency. > These are average of 5 trials. > master: 7640.000538 > patch_v12:7417.981378 (3 % down against master) > patch_v13:7645.071787 (almost same as master) Thank you for cross checking. > These cases are not pruning happen workload as you mentioned. > I'd like to do benchmark of cache-pruning-case as well. > To demonstrate cache-pruning-case > right now I'm making hundreds of partitioned table and run select query for each partitioned table > using pgbench custom file. Maybe using small number of cache_prune_min_age or hard limit would be better. > Are there any good model? As per Tomas' benchmark, it doesn't seem to harm for the case. > ># I'm not sure the name LRU_IGNORANCE_INTERVAL makes sens.. > How about MIN_LRU_UPDATE_INTERVAL? Looks fine. Fixed in the next version. Thank you for the suggestion. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Thank you for testing and the commits, Tomas. At Sat, 9 Feb 2019 19:09:59 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in <74386116-0bc5-84f2-e614-0cff19aca2de@2ndquadrant.com> > On 2/7/19 1:18 PM, Kyotaro HORIGUCHI wrote: > > At Thu, 07 Feb 2019 15:24:18 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in<20190207.152418.139132570.horiguchi.kyotaro@lab.ntt.co.jp> > I've done a bunch of benchmarks on v13, and I don't see any serious > regression either. Each test creates a number of tables (100, 1k, 10k, > 100k and 1M) and then runs SELECT queries on them. The tables are > accessed randomly - with either uniform or exponential distribution. For > each combination there are 5 runs, 60 seconds each (see the attached > shell scripts, it should be pretty obvious). > > I've done the tests on two different machines - small one (i5 with 8GB > of RAM) and large one (e5-2620v4 with 64GB RAM), but the behavior is > almost exactly the same (with the exception of 1M tables, which does not > fit into RAM on the smaller one). > > On the xeon, the results (throughput compared to master) look like this: > > > uniform 100 1000 10000 100000 1000000 > ------------------------------------------------------------ > v13 105.04% 100.28% 102.96% 102.11% 101.54% > v13 (nodata) 97.05% 98.30% 97.42% 96.60% 107.55% > > > exponential 100 1000 10000 100000 1000000 > ------------------------------------------------------------ > v13 100.04% 103.48% 101.70% 98.56% 103.20% > v13 (nodata) 97.12% 98.43% 98.86% 98.48% 104.94% > > The "nodata" case means the tables were empty (so no files created), > while in the other case each table contained 1 row. > > Per the results it's mostly break even, and in some cases there is > actually a measurable improvement. Great! I guess it comes from reduced size of hash? > That being said, the question is whether the patch actually reduces > memory usage in a useful way - that's not something this benchmark > validates. I plan to modify the tests to make pgbench script > time-dependent (i.e. to pick a subset of tables depending on time). Thank you. > A couple of things I've happened to notice during a quick review: > > 1) The sgml docs in 0002 talk about "syscache_memory_target" and > "syscache_prune_min_age", but those options were renamed to just > "cache_memory_target" and "cache_prune_min_age". I'm at a loss how call syscache for users. I think it is "catalog cache". The most basic component is called catcache, which is covered by the syscache layer, both of then are not revealed to users, and it is shown to user as "catalog cache". "catalog_cache_prune_min_age", "catalog_cache_memory_target", (if exists) "catalog_cache_entry_limit" and "catalog_cache_prune_ratio" make sense? > 2) "cache_entry_limit" is not mentioned in sgml docs at all, and it's > defined three times in guc.c for some reason. It is just PoC, added to show how it looks. (The multiple instances must bex a result of a convulsion of my fingers..) I think this is not useful unless it can be specfied per-relation or per-cache basis. I'll remove the GUC and add reloptions for the purpose. (But it won't work for pg_class and pg_attribute for now). > 3) I don't see why to define PRUNE_BY_AGE and PRUNE_BY_NUMBER, instead > of just using two bool variables prune_by_age and prune_by_number doing > the same thing. Agreed. It's a kind of memory-stingy, which is useless there. > 4) I'm not entirely sure about using stmtStartTimestamp. Doesn't that > pretty much mean long-running statements will set the lastaccess to very > old timestamp? Also, it means that long-running statements (like a PL > function accessing a bunch of tables) won't do any eviction at all, no? > AFAICS we'll set the timestamp only once, at the very beginning. > > I wonder whether using some other timestamp source (like a timestamp > updated regularly from a timer, or something like that). I didin't consider planning that happen within a function. If 5min is the default for catalog_cache_prune_min_age, 10% of it (30s) seems enough and gettieofday() with such intervals wouldn't affect forground jobs. I'd choose catalog_c_p_m_age/10 rather than fixed value 30s and 1s as the minimal. I obeserved significant degradation by setting up timer at every statement start. The patch is doing the followings to get rid of the degradation. (1) Every statement updates the catcache timestamp as currently does. (SetCatCacheClock) (2) The timestamp is also updated periodically using timer separately from (1). The timer starts if not yet at the time of (1). (SetCatCacheClock, UpdateCatCacheClock) (3) Statement end and transaction end don't stop the timer, to avoid overhead of setting up a timer. ( (4) But it stops by error. I choosed not to change the thing in PostgresMain that it kills all timers on error. (5) Also changing the GUC catalog_cache_prune_min_age kills the timer, in order to reflect the change quickly especially when it is shortened. > 5) There are two fread() calls in 0003 triggering a compiler warning > about unused return value. Ugg. It's in PoC style... (But my compiler didn't complain about it) Maybe fixed. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Tue, 12 Feb 2019 01:02:39 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1FB972A6@G01JPEXMBYT05> > From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] > > Recuded frequency of dlist_move_tail by taking 1ms interval between two > > succesive updates on the same entry let the degradation dissapear. > > > > patched : 13720 tps (+2%) > > What do you think contributed to this performance increase? Or do you hink this is just a measurement variation? > > Most of my previous comments also seem to apply to v13, so let me repost them below: > > > (1) > > (1) > +/* GUC variable to define the minimum age of entries that will be cosidered to > + /* initilize catcache reference clock if haven't done yet */ > > cosidered -> considered > initilize -> initialize Fixed. I found "databsae", "temprary", "resturns", "If'force'"(missing space), "aginst", "maintan". And all fixed. > I remember I saw some other wrong spelling and/or missing words, which I forgot (sorry). Thank you for pointing some of them. > (2) > Only the doc prefixes "sys" to the new parameter names. Other places don't have it. I think we should prefix sys, becauserelcache and plancache should be configurable separately because of their different usage patterns/lifecycle. I tend to agree. They are already removed in this patchset. The names are changed to "catalog_cache_*" in the new version. > (3) > The doc doesn't describe the unit of syscache_memory_target. Kilobytes? Added. > (4) > + hash_size = cp->cc_nbuckets * sizeof(dlist_head); > + tupsize = sizeof(CatCTup) + MAXIMUM_ALIGNOF + dtp->t_len; > + tupsize = sizeof(CatCTup); > > GetMemoryChunkSpace() should be used to include the memory context overhead. That's what the files in src/backend/utils/sort/do. Thanks. Done. Include bucket and cache header part but still excluding clist. Renamed from tupsize to memusage. > (5) > + if (entry_age > cache_prune_min_age) > > ">=" instead of ">"? I didn't get it serious, but it is better. Fixed. > (6) > + if (!ct->c_list || ct->c_list->refcount == 0) > + { > + CatCacheRemoveCTup(cp, ct); > > It's better to write "ct->c_list == NULL" to follow the style in this file. > > "ct->refcount == 0" should also be checked prior to removing the catcache tuple, just in case the tuple hasn't been releasedfor a long time, which might hardly happen. Yeah, I fixed it in v12. This no longer removes an entry in use. (if (c_list) is used in the file.) > (7) > CatalogCacheCreateEntry > > + int tupsize = 0; > if (ntp) > { > int i; > + int tupsize; > > tupsize is defined twice. The second tupsize was bogus, but the first is removed in this version. Now memory usage of an entry is calculated as a chunk size. > (8) > CatalogCacheCreateEntry > > In the negative entry case, the memory allocated by CatCacheCopyKeys() is not counted. I'm afraid that's not negligible. Right. Fixed. > (9) > The memory for CatCList is not taken into account for syscache_memory_target. Yeah, this is intensional since CatCacheList is short lived. Comment added. | * Don't waste a time by counting the list in catcache memory usage, | * since a list doesn't persist for a long time | */ | cl = (CatCList *) | palloc(offsetof(CatCList, members) + nmembers * sizeof(CatCTup *)); Please fine the attached, which is the new version v14 addressing Tomas', Ideriha-san and your comments. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 3b24233b1891b967ccac65a4d21ed0207037578b Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Thu, 7 Feb 2019 14:56:07 +0900 Subject: [PATCH 1/3] Add dlist_move_tail We have dlist_push_head/tail and dlist_move_head but not dlist_move_tail. Add it. --- src/include/lib/ilist.h | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/src/include/lib/ilist.h b/src/include/lib/ilist.h index b1a5974ee4..659ab1ac87 100644 --- a/src/include/lib/ilist.h +++ b/src/include/lib/ilist.h @@ -394,6 +394,25 @@ dlist_move_head(dlist_head *head, dlist_node *node) dlist_check(head); } +/* + * Move element from its current position in the list to the tail position in + * the same list. + * + * Undefined behaviour if 'node' is not already part of the list. + */ +static inline void +dlist_move_tail(dlist_head *head, dlist_node *node) +{ + /* fast path if it's already at the tail */ + if (head->head.prev == node) + return; + + dlist_delete(node); + dlist_push_tail(head, node); + + dlist_check(head); +} + /* * Check whether 'node' has a following node. * Caution: unreliable if 'node' is not in the list. -- 2.16.3 From 5031833af1777c4c6a6bf8daf32b6a3f428ccd79 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 16 Oct 2018 13:04:30 +0900 Subject: [PATCH 2/3] Remove entries that haven't been used for a certain time Catcache entries can be left alone for several reasons. It is not desirable that they eat up memory. With this patch, This adds consideration of removal of entries that haven't been used for a certain time before enlarging the hash array. This also can put a hard limit on the number of catcache entries. --- doc/src/sgml/config.sgml | 41 ++++ src/backend/tcop/postgres.c | 13 ++ src/backend/utils/cache/catcache.c | 285 +++++++++++++++++++++++++- src/backend/utils/init/globals.c | 1 + src/backend/utils/init/postinit.c | 11 + src/backend/utils/misc/guc.c | 43 ++++ src/backend/utils/misc/postgresql.conf.sample | 2 + src/include/miscadmin.h | 1 + src/include/utils/catcache.h | 49 ++++- src/include/utils/timeout.h | 1 + 10 files changed, 440 insertions(+), 7 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 07b847a8e9..71d784b6fe 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1661,6 +1661,47 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-catalog-cache-prune-min-age" xreflabel="catalog_cache_prune_min_age"> + <term><varname>catalog_cache_prune_min_age</varname> (<type>integer</type>) + <indexterm> + <primary><varname>catalog_cache_prune_min_age</varname> configuration + parameter</primary> + </indexterm> + </term> + <listitem> + <para> + + Specifies the minimum amount of unused time in seconds at which a + catalog cache entry is considered to be removed. -1 indicates that + this feature is disabled at all. The value defaults to 300 seconds + (<literal>5 minutes</literal>). The catalog cache entries that are + not used for the duration can be removed to prevent bloat. This + behavior is suppressed until the size of a catalog cache exceeds + <xref linkend="guc-catalog-cache-memory-target"/>. + </para> + </listitem> + </varlistentry> + + <varlistentry id="guc-catalog-cache-memory-target" xreflabel="catalog_cache_memory_target"> + <term><varname>catalog_cache_memory_target</varname> (<type>integer</type>) + <indexterm> + <primary><varname>syscache_memory_target</varname> configuration + parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the maximum amount of memory to which syscache is expanded + without pruning in kilobytes. The value defaults to 0, indicating that + pruning is always considered. After exceeding this size, catalog cache + pruning is considered according to + <xref linkend="guc-catalog-cache-prune-min-age"/>. If you need to keep + certain amount of catalog cache entries with intermittent usage, try + increase this setting. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth"> <term><varname>max_stack_depth</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 36cfd507b2..f192ee2ca6 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -71,6 +71,7 @@ #include "tcop/pquery.h" #include "tcop/tcopprot.h" #include "tcop/utility.h" +#include "utils/catcache.h" #include "utils/lsyscache.h" #include "utils/memutils.h" #include "utils/ps_status.h" @@ -2584,6 +2585,7 @@ start_xact_command(void) * not desired, the timeout has to be disabled explicitly. */ enable_statement_timeout(); + SetCatCacheClock(GetCurrentStatementStartTimestamp()); } static void @@ -3159,6 +3161,14 @@ ProcessInterrupts(void) if (ParallelMessagePending) HandleParallelMessages(); + + if (CatcacheClockTimeoutPending) + { + CatcacheClockTimeoutPending = 0; + + /* Update timetamp then set up the next timeout */ + UpdateCatCacheClock(); + } } @@ -4021,6 +4031,9 @@ PostgresMain(int argc, char *argv[], QueryCancelPending = false; /* second to avoid race condition */ stmt_timeout_active = false; + /* get sync with the timer state */ + catcache_clock_timeout_active = false; + /* Not reading from the client anymore. */ DoingCommandRead = false; diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 258a1d64cc..0195e19976 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -39,6 +39,7 @@ #include "utils/rel.h" #include "utils/resowner_private.h" #include "utils/syscache.h" +#include "utils/timeout.h" /* #define CACHEDEBUG */ /* turns DEBUG elogs on */ @@ -71,9 +72,43 @@ #define CACHE6_elog(a,b,c,d,e,f,g) #endif +/* GUC variable to define the minimum age of entries that will be considered to + * be evicted in seconds. This variable is shared among various cache + * mechanisms. + */ +int catalog_cache_prune_min_age = 300; + +/* + * GUC variable to define the minimum size of hash to cosider entry eviction. + * This variable is shared among various cache mechanisms. + */ +int catalog_cache_memory_target = 0; + +/* + * GUC for limit by the number of entries. Entries are removed when the number + * of them goes above catalog_cache_entry_limit and leaving newer entries by + * the ratio specified by catalog_cache_prune_ratio. + */ +int catalog_cache_entry_limit = 0; +double catalog_cache_prune_ratio = 0.8; + +/* + * Flag to keep track of whether catcache clock timer is active. + */ +bool catcache_clock_timeout_active = false; + +/* + * Minimum interval between two success move of a cache entry in LRU list, + * in microseconds. + */ +#define MIN_LRU_UPDATE_INTERVAL 100000 /* 100ms */ + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Clock used to record the last accessed time of a catcache record. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -481,6 +516,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) /* delink from linked list */ dlist_delete(&ct->cache_elem); + dlist_delete(&ct->lru_node); /* * Free keys when we're dealing with a negative entry, normal entries just @@ -490,6 +526,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys, cache->cc_keyno, ct->keys); + cache->cc_memusage -= ct->size; pfree(ct); --cache->cc_ntup; @@ -841,7 +878,13 @@ InitCatCache(int id, cp->cc_nkeys = nkeys; for (i = 0; i < nkeys; ++i) cp->cc_keyno[i] = key[i]; + cp->cc_memusage = + CacheMemoryContext->methods->get_chunk_space(CacheMemoryContext, + cp) + + CacheMemoryContext->methods->get_chunk_space(CacheMemoryContext, + cp->cc_bucket); + dlist_init(&cp->cc_lru_list); /* * new cache is initialized as far as we can go for now. print some * debugging information, if appropriate. @@ -858,9 +901,191 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initialize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } +/* + * helper routine for SetCatCacheClock and UpdateCatCacheClockTimer. + * + * We need to maintain the catcache clock during a long query. + */ +void +SetupCatCacheClockTimer(void) +{ + long delay; + + /* stop timer if not needed */ + if (catalog_cache_prune_min_age == 0) + { + catcache_clock_timeout_active = false; + return; + } + + /* One 10th of the variable, in milliseconds */ + delay = catalog_cache_prune_min_age * 1000/10; + + /* Lower limit is 1 second */ + if (delay < 1000) + delay = 1000; + + enable_timeout_after(CATCACHE_CLOCK_TIMEOUT, delay); + + catcache_clock_timeout_active = true; +} + +/* + * Update catcacheclock: this is intended to be called from + * CATCACHE_CLOCK_TIMEOUT. The interval is expected more than 1 second (see + * above), so GetCurrentTime() doesn't harm. + */ +void +UpdateCatCacheClock(void) +{ + catcacheclock = GetCurrentTimestamp(); + SetupCatCacheClockTimer(); +} + +/* + * It may take an unexpectedly long time before the next clock update when + * catalog_cache_prune_min_age gets shorter. Disabling the current timer let + * the next update happen at the expected interval. We don't necessariry + * require this for increase the age but we don't need to avoid to disable + * either. + */ +void +assign_catalog_cache_prune_min_age(int newval, void *extra) +{ + if (catcache_clock_timeout_active) + disable_timeout(CATCACHE_CLOCK_TIMEOUT, false); + + catcache_clock_timeout_active = false; +} + +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries can be left alone for several reasons. We remove them if + * they are not accessed for a certain time to prevent catcache from + * bloating. The eviction is performed with the similar algorithm with buffer + * eviction using access counter. Entries that are accessed several times can + * live longer than those that have had less access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int nremoved = 0; + size_t hash_size; + int nelems_before = cp->cc_ntup; + int ndelelems = 0; + bool prune_by_age = false; + bool prune_by_number = false; + dlist_mutable_iter iter; + + if (catalog_cache_prune_min_age >= 0) + { + /* prune only if the size of the hash is above the target */ + + hash_size = cp->cc_nbuckets * sizeof(dlist_head); + if (hash_size + cp->cc_memusage > + (Size) catalog_cache_memory_target * 1024L) + prune_by_age = true; + } + + if (catalog_cache_entry_limit > 0 && + nelems_before >= catalog_cache_entry_limit) + { + ndelelems = nelems_before - + (int) (catalog_cache_entry_limit * catalog_cache_prune_ratio); + + /* an arbitrary lower limit.. */ + if (ndelelems < 256) + ndelelems = 256; + if (ndelelems > nelems_before) + ndelelems = nelems_before; + + prune_by_number = true; + } + + /* Return immediately if no pruning is wanted */ + if (!prune_by_age && !prune_by_number) + return false; + + /* Scan over LRU to find entries to remove */ + dlist_foreach_modify(iter, &cp->cc_lru_list) + { + CatCTup *ct = dlist_container(CatCTup, lru_node, iter.cur); + bool remove_this = false; + + /* We don't remove referenced entry */ + if (ct->refcount != 0 || + (ct->c_list && ct->c_list->refcount != 0)) + continue; + + /* check against age */ + if (prune_by_age) + { + long entry_age; + int us; + + /* + * Calculate the duration from the time of the last access to the + * "current" time. Since catcacheclock is not advanced within a + * transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + + if (entry_age < catalog_cache_prune_min_age) + { + /* no longer have a business with further entries, exit */ + prune_by_age = false; + break; + } + /* + * Entries that are not accessed after last pruning are removed in + * that seconds, and that has been accessed several times are + * removed after leaving alone for up to three times of the + * duration. We don't try shrink buckets since pruning effectively + * caps catcache expansion in the long term. + */ + if (ct->naccess > 0) + ct->naccess--; + else + remove_this = true; + } + + /* check against entry number */ + if (prune_by_number) + { + if (nremoved < ndelelems) + remove_this = true; + else + prune_by_number = false; /* we're satisfied */ + } + + /* exit immediately if all finished */ + if (!prune_by_age && !prune_by_number) + break; + + /* do the work */ + if (remove_this) + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + } + } + + if (nremoved > 0) + elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d", + cp->id, cp->cc_relname, nremoved, nelems_before); + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -878,6 +1103,13 @@ RehashCatCache(CatCache *cp) newnbuckets = cp->cc_nbuckets * 2; newbucket = (dlist_head *) MemoryContextAllocZero(CacheMemoryContext, newnbuckets * sizeof(dlist_head)); + /* recalculate memory usage from the first */ + cp->cc_memusage = + CacheMemoryContext->methods->get_chunk_space(CacheMemoryContext, + cp) + + CacheMemoryContext->methods->get_chunk_space(CacheMemoryContext, + newbucket); + /* Move all entries from old hash table to new. */ for (i = 0; i < cp->cc_nbuckets; i++) { @@ -890,6 +1122,7 @@ RehashCatCache(CatCache *cp) dlist_delete(iter.cur); dlist_push_head(&newbucket[hashIndex], &ct->cache_elem); + cp->cc_memusage += ct->size; } } @@ -1274,6 +1507,21 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* Update access information for pruning */ + if (ct->naccess < 2) + ct->naccess++; + + /* + * We don't want too frequent update of + * LRU. catalog_cache_prune_min_age can be changed on-session so we + * need to maintain the LRU regardless of catalog_cache_prune_min_age. + */ + if (catcacheclock - ct->lastaccess > MIN_LRU_UPDATE_INTERVAL) + { + ct->lastaccess = catcacheclock; + dlist_move_tail(&cache->cc_lru_list, &ct->lru_node); + } + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1709,6 +1957,11 @@ SearchCatCacheList(CatCache *cache, /* Now we can build the CatCList entry. */ oldcxt = MemoryContextSwitchTo(CacheMemoryContext); nmembers = list_length(ctlist); + + /* + * Don't waste a time by counting the list in catcache memory usage, + * since it doesn't live a long life. + */ cl = (CatCList *) palloc(offsetof(CatCList, members) + nmembers * sizeof(CatCTup *)); @@ -1824,6 +2077,7 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, if (ntp) { int i; + int tupsize; Assert(!negative); @@ -1842,8 +2096,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, /* Allocate memory for CatCTup and the cached tuple in one go */ oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - ct = (CatCTup *) palloc(sizeof(CatCTup) + - MAXIMUM_ALIGNOF + dtp->t_len); + tupsize = sizeof(CatCTup) + MAXIMUM_ALIGNOF + dtp->t_len; + ct = (CatCTup *) palloc(tupsize); ct->tuple.t_len = dtp->t_len; ct->tuple.t_self = dtp->t_self; ct->tuple.t_tableOid = dtp->t_tableOid; @@ -1877,7 +2131,6 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, Assert(negative); oldcxt = MemoryContextSwitchTo(CacheMemoryContext); ct = (CatCTup *) palloc(sizeof(CatCTup)); - /* * Store keys - they'll point into separately allocated memory if not * by-value. @@ -1898,18 +2151,38 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; + dlist_push_tail(&cache->cc_lru_list, &ct->lru_node); dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); cache->cc_ntup++; CacheHdr->ch_ntup++; + ct->size = + CacheMemoryContext->methods->get_chunk_space(CacheMemoryContext, + ct); + cache->cc_memusage += ct->size; + + /* increase refcount so that this survives pruning */ + ct->refcount++; + /* - * If the hash table has become too full, enlarge the buckets array. Quite - * arbitrarily, we enlarge when fill factor > 2. + * If the hash table has become too full, try cleanup by removing + * infrequently used entries to make a room for the new entry. If it + * failed, enlarge the bucket array instead. Quite arbitrarily, we try + * this when fill factor > 2. */ - if (cache->cc_ntup > cache->cc_nbuckets * 2) + if (cache->cc_ntup > cache->cc_nbuckets * 2 && + !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); + /* we may still want to prune by entry number, check it */ + else if (catalog_cache_entry_limit > 0 && + cache->cc_ntup > catalog_cache_entry_limit) + CatCacheCleanupOldEntries(cache); + + ct->refcount--; return ct; } diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index fd51934aaf..0e8b972a29 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false; volatile sig_atomic_t ProcDiePending = false; volatile sig_atomic_t ClientConnectionLost = false; volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false; +volatile sig_atomic_t CatcacheClockTimeoutPending = false; volatile sig_atomic_t ConfigReloadPending = false; volatile uint32 InterruptHoldoffCount = 0; volatile uint32 QueryCancelHoldoffCount = 0; diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index a5ee209f91..9eb50e9676 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg); static void StatementTimeoutHandler(void); static void LockTimeoutHandler(void); static void IdleInTransactionSessionTimeoutHandler(void); +static void CatcacheClockTimeoutHandler(void); static bool ThereIsAtLeastOneRole(void); static void process_startup_options(Port *port, bool am_superuser); static void process_settings(Oid databaseid, Oid roleid); @@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler); RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT, IdleInTransactionSessionTimeoutHandler); + RegisterTimeout(CATCACHE_CLOCK_TIMEOUT, + CatcacheClockTimeoutHandler); } /* @@ -1238,6 +1241,14 @@ IdleInTransactionSessionTimeoutHandler(void) SetLatch(MyLatch); } +static void +CatcacheClockTimeoutHandler(void) +{ + CatcacheClockTimeoutPending = true; + InterruptPending = true; + SetLatch(MyLatch); +} + /* * Returns true if at least one role is defined in this database cluster. */ diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 41d477165c..c62d5ad8b8 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -81,6 +81,7 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/guc_tables.h" #include "utils/float.h" #include "utils/memutils.h" @@ -2205,6 +2206,38 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum unused duration of cache entries before removal."), + gettext_noop("Catalog cache entries that live unused for longer than this seconds are considered to be removed."), + GUC_UNIT_S + }, + &catalog_cache_prune_min_age, + 300, -1, INT_MAX, + NULL, assign_catalog_cache_prune_min_age, NULL + }, + + { + {"catalog_cache_memory_target", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum syscache size to keep."), + gettext_noop("Time-based cache pruning starts working after exceeding this size."), + GUC_UNIT_KB + }, + &catalog_cache_memory_target, + 0, 0, MAX_KILOBYTES, + NULL, NULL, NULL + }, + + { + {"catalog_cache_entry_limit", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the maximum entries of catcache."), + NULL + }, + &catalog_cache_entry_limit, + 0, 0, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if @@ -3368,6 +3401,16 @@ static struct config_real ConfigureNamesReal[] = NULL, NULL, NULL }, + { + {"catalog_cache_prune_ratio", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Reduce ratio of pruning caused by catalog_cache_entry_limit."), + NULL + }, + &catalog_cache_prune_ratio, + 0.8, 0.0, 1.0, + NULL, NULL, NULL + }, + /* End-of-list marker */ { {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index ad6c436f93..aeb5968e75 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -128,6 +128,8 @@ #work_mem = 4MB # min 64kB #maintenance_work_mem = 64MB # min 1MB #autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem +#catalog_cache_memory_target = 0kB # in kB +#catalog_cache_prune_min_age = 300s # -1 disables pruning #max_stack_depth = 2MB # min 100kB #shared_memory_type = mmap # the default is the first option # supported by the operating system: diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index c9e35003a5..33b800e80f 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending; extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending; extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending; extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending; +extern PGDLLIMPORT volatile sig_atomic_t CatcacheClockTimeoutPending; extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending; extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost; diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 65d816a583..0425fc0786 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -61,6 +62,10 @@ typedef struct catcache slist_node cc_next; /* list link */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ + dlist_head cc_lru_list; + int cc_memusage; /* memory usage of this catcache (excluding + * header part) */ + int cc_nfreeent; /* # of entries currently not referenced */ /* * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS @@ -119,7 +124,10 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ - + int naccess; /* # of access to this entry, up to 2 */ + TimestampTz lastaccess; /* approx. timestamp of the last usage */ + dlist_node lru_node; /* LRU node */ + int size; /* palloc'ed size off this tuple */ /* * The tuple may also be a member of at most one CatCList. (If a single * catcache is list-searched with varying numbers of keys, we may have to @@ -189,6 +197,45 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int catalog_cache_prune_min_age; +extern int catalog_cache_memory_target; +extern int catalog_cache_entry_limit; +extern double catalog_cache_prune_ratio; + +/* to use as access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* + * Flag to keep track of whether catcache timestamp timer is active. + */ +extern bool catcache_clock_timeout_active; + +/* catcache prune time helper functions */ +extern void SetupCatCacheClockTimer(void); +extern void UpdateCatCacheClock(void); + +/* + * SetCatCacheClock - set timestamp for catcache access record and start + * maintenance timer if needed. We keep to update the clock even while pruning + * is disable so that we are not confused by bogus clock value. + */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; + + if (!catcache_clock_timeout_active && catalog_cache_prune_min_age > 0) + SetupCatCacheClockTimer(); +} + +static inline TimestampTz +GetCatCacheClock(void) +{ + return catcacheclock; +} + +extern void assign_catalog_cache_prune_min_age(int newval, void *extra); extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h index 9244a2a7b7..b2d97b4f7b 100644 --- a/src/include/utils/timeout.h +++ b/src/include/utils/timeout.h @@ -31,6 +31,7 @@ typedef enum TimeoutId STANDBY_TIMEOUT, STANDBY_LOCK_TIMEOUT, IDLE_IN_TRANSACTION_SESSION_TIMEOUT, + CATCACHE_CLOCK_TIMEOUT, /* First user-definable timeout reason */ USER_TIMEOUT, /* Maximum number of timeout reasons */ -- 2.16.3 From ea9d43f623d093bc1276fd1d5480e5cff6097d60 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 12 Feb 2019 20:31:16 +0900 Subject: [PATCH 3/3] Syscache usage tracking feature Collects syscache usage statictics and show it using the view pg_stat_syscache. The feature is controlled by the GUC variable track_syscache_usage_interval. --- doc/src/sgml/config.sgml | 16 ++ src/backend/catalog/system_views.sql | 17 +++ src/backend/postmaster/pgstat.c | 201 ++++++++++++++++++++++++-- src/backend/tcop/postgres.c | 23 +++ src/backend/utils/adt/pgstatfuncs.c | 134 +++++++++++++++++ src/backend/utils/cache/catcache.c | 93 +++++++++--- src/backend/utils/cache/syscache.c | 24 +++ src/backend/utils/init/globals.c | 1 + src/backend/utils/init/postinit.c | 11 ++ src/backend/utils/misc/guc.c | 10 ++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/catalog/pg_proc.dat | 9 ++ src/include/miscadmin.h | 1 + src/include/pgstat.h | 6 +- src/include/utils/catcache.h | 9 +- src/include/utils/syscache.h | 19 +++ src/include/utils/timeout.h | 1 + src/test/regress/expected/rules.out | 24 ++- 18 files changed, 564 insertions(+), 36 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 71d784b6fe..2eceec1d94 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -6703,6 +6703,22 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; </listitem> </varlistentry> + <varlistentry id="guc-track-catalog-cache-usage-interval" xreflabel="track_catalog_cache_usage_interval"> + <term><varname>track_catalog_cache_usage_interval</varname> (<type>integer</type>) + <indexterm> + <primary><varname>track_catlog_cache_usage_interval</varname> + configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the interval to collect catalog cache usage statistics on + the session in milliseconds. This parameter is 0 by default, which + means disabled. Only superusers can change this setting. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-track-io-timing" xreflabel="track_io_timing"> <term><varname>track_io_timing</varname> (<type>boolean</type>) <indexterm> diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 3e229c693c..f5d1aaf96f 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -906,6 +906,22 @@ CREATE VIEW pg_stat_progress_vacuum AS FROM pg_stat_get_progress_info('VACUUM') AS S LEFT JOIN pg_database D ON S.datid = D.oid; +CREATE VIEW pg_stat_syscache AS + SELECT + S.pid AS pid, + S.relid::regclass AS relname, + S.indid::regclass AS cache_name, + S.size AS size, + S.ntup AS ntuples, + S.searches AS searches, + S.hits AS hits, + S.neg_hits AS neg_hits, + S.ageclass AS ageclass, + S.last_update AS last_update + FROM pg_stat_activity A + JOIN LATERAL (SELECT A.pid, * FROM pg_get_syscache_stats(A.pid)) S + ON (A.pid = S.pid); + CREATE VIEW pg_user_mappings AS SELECT U.oid AS umid, @@ -1185,6 +1201,7 @@ GRANT EXECUTE ON FUNCTION pg_ls_waldir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_archive_statusdir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_tmpdir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_tmpdir(oid) TO pg_monitor; +GRANT EXECUTE ON FUNCTION pg_get_syscache_stats(int) TO pg_monitor; GRANT pg_read_all_settings TO pg_monitor; GRANT pg_read_all_stats TO pg_monitor; diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 81c6499251..8c4ab0aef9 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -66,6 +66,7 @@ #include "utils/ps_status.h" #include "utils/rel.h" #include "utils/snapmgr.h" +#include "utils/syscache.h" #include "utils/timestamp.h" @@ -124,6 +125,7 @@ bool pgstat_track_activities = false; bool pgstat_track_counts = false; int pgstat_track_functions = TRACK_FUNC_OFF; +int pgstat_track_syscache_usage_interval = 0; int pgstat_track_activity_query_size = 1024; /* ---------- @@ -236,6 +238,11 @@ typedef struct TwoPhasePgStatRecord bool t_truncated; /* was the relation truncated? */ } TwoPhasePgStatRecord; +/* bitmap symbols to specify target file types remove */ +#define PGSTAT_REMFILE_DBSTAT 1 /* remove only database stats files */ +#define PGSTAT_REMFILE_SYSCACHE 2 /* remove only syscache stats files */ +#define PGSTAT_REMFILE_ALL 3 /* remove both type of files */ + /* * Info about current "snapshot" of stats file */ @@ -335,6 +342,7 @@ static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len); static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len); static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len); static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len); +static void pgstat_remove_syscache_statsfile(void); /* ------------------------------------------------------------ * Public functions called from postmaster follow @@ -630,10 +638,13 @@ startup_failed: } /* - * subroutine for pgstat_reset_all + * remove stats files + * + * clean up stats files in specified directory. target is one of + * PGSTAT_REFILE_DBSTAT/SYSCACHE/ALL and restricts files to remove. */ static void -pgstat_reset_remove_files(const char *directory) +pgstat_reset_remove_files(const char *directory, int target) { DIR *dir; struct dirent *entry; @@ -644,25 +655,39 @@ pgstat_reset_remove_files(const char *directory) { int nchars; Oid tmp_oid; + int filetype = 0; /* * Skip directory entries that don't match the file names we write. * See get_dbstat_filename for the database-specific pattern. */ if (strncmp(entry->d_name, "global.", 7) == 0) + { + filetype = PGSTAT_REMFILE_DBSTAT; nchars = 7; + } else { + char head[2]; + nchars = 0; - (void) sscanf(entry->d_name, "db_%u.%n", - &tmp_oid, &nchars); - if (nchars <= 0) - continue; + (void) sscanf(entry->d_name, "%c%c_%u.%n", + head, head + 1, &tmp_oid, &nchars); + /* %u allows leading whitespace, so reject that */ - if (strchr("0123456789", entry->d_name[3]) == NULL) + if (nchars < 3 || !isdigit(entry->d_name[3])) continue; + + if (strncmp(head, "db", 2) == 0) + filetype = PGSTAT_REMFILE_DBSTAT; + else if (strncmp(head, "cc", 2) == 0) + filetype = PGSTAT_REMFILE_SYSCACHE; } + /* skip if this is not a target */ + if ((filetype & target) == 0) + continue; + if (strcmp(entry->d_name + nchars, "tmp") != 0 && strcmp(entry->d_name + nchars, "stat") != 0) continue; @@ -683,8 +708,9 @@ pgstat_reset_remove_files(const char *directory) void pgstat_reset_all(void) { - pgstat_reset_remove_files(pgstat_stat_directory); - pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY); + pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_ALL); + pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY, + PGSTAT_REMFILE_ALL); } #ifdef EXEC_BACKEND @@ -2963,6 +2989,10 @@ pgstat_beshutdown_hook(int code, Datum arg) if (OidIsValid(MyDatabaseId)) pgstat_report_stat(true); + /* clear syscache statistics files and temporary settings */ + if (MyBackendId != InvalidBackendId) + pgstat_remove_syscache_statsfile(); + /* * Clear my status entry, following the protocol of bumping st_changecount * before and after. We use a volatile pointer here to ensure the @@ -4287,6 +4317,9 @@ PgstatCollectorMain(int argc, char *argv[]) pgStatRunningInCollector = true; pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true); + /* Remove left-over syscache stats files */ + pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_SYSCACHE); + /* * Loop to process messages until we get SIGQUIT or detect ungraceful * death of our parent postmaster. @@ -6377,3 +6410,153 @@ pgstat_clip_activity(const char *raw_activity) return activity; } + +/* + * return the filename for a syscache stat file; filename is the output + * buffer, of length len. + */ +void +pgstat_get_syscachestat_filename(bool permanent, bool tempname, int backendid, + char *filename, int len) +{ + int printed; + + /* NB -- pgstat_reset_remove_files knows about the pattern this uses */ + printed = snprintf(filename, len, "%s/cc_%u.%s", + permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY : + pgstat_stat_directory, + backendid, + tempname ? "tmp" : "stat"); + if (printed >= len) + elog(ERROR, "overlength pgstat path"); +} + +/* removes syscache stats files of this backend */ +static void +pgstat_remove_syscache_statsfile(void) +{ + char fname[MAXPGPATH]; + + pgstat_get_syscachestat_filename(false, false, MyBackendId, + fname, MAXPGPATH); + unlink(fname); /* don't care of the result */ +} + +/* + * pgstat_write_syscache_stats() - + * Write the syscache statistics files. + * + * If 'force' is false, this function skips writing a file and returns the + * time remaining in the current interval in milliseconds. If 'force' is true, + * writes a file regardless of the remaining time and reset the interval. + */ +long +pgstat_write_syscache_stats(bool force) +{ + static TimestampTz last_report = 0; + TimestampTz now; + long elapsed; + long secs; + int usecs; + int cacheId; + FILE *fpout; + char statfile[MAXPGPATH]; + char tmpfile[MAXPGPATH]; + + /* Return if we don't want it */ + if (!force && pgstat_track_syscache_usage_interval <= 0) + { + /* disabled. remove the statistics file if any */ + if (last_report > 0) + { + last_report = 0; + pgstat_remove_syscache_statsfile(); + } + return 0; + } + + /* Check against the interval */ + now = GetCurrentTransactionStopTimestamp(); + TimestampDifference(last_report, now, &secs, &usecs); + elapsed = secs * 1000 + usecs / 1000; + + if (!force && elapsed < pgstat_track_syscache_usage_interval) + { + /* not yet the time, inform the remaining time to the caller */ + return pgstat_track_syscache_usage_interval - elapsed; + } + + /* now update the stats */ + last_report = now; + + pgstat_get_syscachestat_filename(false, true, + MyBackendId, tmpfile, MAXPGPATH); + pgstat_get_syscachestat_filename(false, false, + MyBackendId, statfile, MAXPGPATH); + + /* + * This function can be called from ProcessInterrupts(). Inhibit recursive + * interrupts to avoid recursive entry. + */ + HOLD_INTERRUPTS(); + + fpout = AllocateFile(tmpfile, PG_BINARY_W); + if (fpout == NULL) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not open temporary statistics file \"%s\": %m", + tmpfile))); + /* + * Failure writing this file is not critical. Just skip this time and + * tell caller to wait for the next interval. + */ + RESUME_INTERRUPTS(); + return pgstat_track_syscache_usage_interval; + } + + /* write out every catcache stats */ + for (cacheId = 0 ; cacheId < SysCacheSize ; cacheId++) + { + SysCacheStats *stats; + + stats = SysCacheGetStats(cacheId); + Assert (stats); + + /* write error is checked later using ferror() */ + fputc('T', fpout); + (void)fwrite(&cacheId, sizeof(int), 1, fpout); + (void)fwrite(&last_report, sizeof(TimestampTz), 1, fpout); + (void)fwrite(stats, sizeof(*stats), 1, fpout); + } + fputc('E', fpout); + + if (ferror(fpout)) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not write syscache statistics file \"%s\": %m", + tmpfile))); + FreeFile(fpout); + unlink(tmpfile); + } + else if (FreeFile(fpout) < 0) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not close syscache statistics file \"%s\": %m", + tmpfile))); + unlink(tmpfile); + } + else if (rename(tmpfile, statfile) < 0) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not rename syscache statistics file \"%s\" to \"%s\": %m", + tmpfile, statfile))); + unlink(tmpfile); + } + + RESUME_INTERRUPTS(); + return 0; +} diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index f192ee2ca6..d0afee189f 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -3159,6 +3159,12 @@ ProcessInterrupts(void) } + if (IdleSyscacheStatsUpdateTimeoutPending) + { + IdleSyscacheStatsUpdateTimeoutPending = false; + pgstat_write_syscache_stats(true); + } + if (ParallelMessagePending) HandleParallelMessages(); @@ -3743,6 +3749,7 @@ PostgresMain(int argc, char *argv[], sigjmp_buf local_sigjmp_buf; volatile bool send_ready_for_query = true; bool disable_idle_in_transaction_timeout = false; + bool disable_idle_syscache_update_timeout = false; /* Initialize startup process environment if necessary. */ if (!IsUnderPostmaster) @@ -4186,9 +4193,19 @@ PostgresMain(int argc, char *argv[], } else { + long timeout; + ProcessCompletedNotifies(); pgstat_report_stat(false); + timeout = pgstat_write_syscache_stats(false); + + if (timeout > 0) + { + disable_idle_syscache_update_timeout = true; + enable_timeout_after(IDLE_SYSCACHE_STATS_UPDATE_TIMEOUT, + timeout); + } set_ps_display("idle", false); pgstat_report_activity(STATE_IDLE, NULL); } @@ -4231,6 +4248,12 @@ PostgresMain(int argc, char *argv[], disable_idle_in_transaction_timeout = false; } + if (disable_idle_syscache_update_timeout) + { + disable_timeout(IDLE_SYSCACHE_STATS_UPDATE_TIMEOUT, false); + disable_idle_syscache_update_timeout = false; + } + /* * (6) check for any other interesting events that happened while we * slept. diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index b6ba856ebe..a314f431c6 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -14,6 +14,8 @@ */ #include "postgres.h" +#include <sys/stat.h> + #include "access/htup_details.h" #include "catalog/pg_authid.h" #include "catalog/pg_type.h" @@ -28,6 +30,7 @@ #include "utils/acl.h" #include "utils/builtins.h" #include "utils/inet.h" +#include "utils/syscache.h" #include "utils/timestamp.h" #define UINT32_ACCESS_ONCE(var) ((uint32)(*((volatile uint32 *)&(var)))) @@ -1899,3 +1902,134 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS) PG_RETURN_DATUM(HeapTupleGetDatum( heap_form_tuple(tupdesc, values, nulls))); } + +Datum +pgstat_get_syscache_stats(PG_FUNCTION_ARGS) +{ +#define PG_GET_SYSCACHE_SIZE 9 + int pid = PG_GETARG_INT32(0); + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; + TupleDesc tupdesc; + Tuplestorestate *tupstore; + MemoryContext per_query_ctx; + MemoryContext oldcontext; + PgBackendStatus *beentry; + int beid; + char fname[MAXPGPATH]; + FILE *fpin; + char c; + + if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("set-valued function called in context that cannot accept a set"))); + if (!(rsinfo->allowedModes & SFRM_Materialize)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("materialize mode required, but it is not " \ + "allowed in this context"))); + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + + per_query_ctx = rsinfo->econtext->ecxt_per_query_memory; + + oldcontext = MemoryContextSwitchTo(per_query_ctx); + tupstore = tuplestore_begin_heap(true, false, work_mem); + rsinfo->returnMode = SFRM_Materialize; + rsinfo->setResult = tupstore; + rsinfo->setDesc = tupdesc; + + MemoryContextSwitchTo(oldcontext); + + /* find beentry for given pid*/ + beentry = NULL; + for (beid = 1; + (beentry = pgstat_fetch_stat_beentry(beid)) && + beentry->st_procpid != pid ; + beid++); + + /* + * we silently return empty result on failure or insufficient privileges + */ + if (!beentry || + (!has_privs_of_role(GetUserId(), beentry->st_userid) && + !is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS))) + goto no_data; + + pgstat_get_syscachestat_filename(false, false, beid, fname, MAXPGPATH); + + if ((fpin = AllocateFile(fname, PG_BINARY_R)) == NULL) + { + if (errno != ENOENT) + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not open statistics file \"%s\": %m", + fname))); + /* also return empty on no statistics file */ + goto no_data; + } + + /* read the statistics file into tuplestore */ + while ((c = fgetc(fpin)) == 'T') + { + TimestampTz last_update; + SysCacheStats stats; + int cacheid; + Datum values[PG_GET_SYSCACHE_SIZE]; + bool nulls[PG_GET_SYSCACHE_SIZE] = {0}; + Datum datums[SYSCACHE_STATS_NAGECLASSES * 2]; + bool arrnulls[SYSCACHE_STATS_NAGECLASSES * 2] = {0}; + int dims[] = {SYSCACHE_STATS_NAGECLASSES, 2}; + int lbs[] = {1, 1}; + ArrayType *arr; + int i, j; + + if (fread(&cacheid, sizeof(int), 1, fpin) != 1 || + fread(&last_update, sizeof(TimestampTz), 1, fpin) != 1 || + fread(&stats, 1, sizeof(stats), fpin) != sizeof(stats)) + { + ereport(WARNING, + (errmsg("corrupted syscache statistics file \"%s\"", + fname))); + goto no_data; + } + + i = 0; + values[i++] = ObjectIdGetDatum(stats.reloid); + values[i++] = ObjectIdGetDatum(stats.indoid); + values[i++] = Int64GetDatum(stats.size); + values[i++] = Int64GetDatum(stats.ntuples); + values[i++] = Int64GetDatum(stats.nsearches); + values[i++] = Int64GetDatum(stats.nhits); + values[i++] = Int64GetDatum(stats.nneg_hits); + + for (j = 0 ; j < SYSCACHE_STATS_NAGECLASSES ; j++) + { + datums[j * 2] = Int32GetDatum((int32) stats.ageclasses[j]); + datums[j * 2 + 1] = Int32GetDatum((int32) stats.nclass_entries[j]); + } + + arr = construct_md_array(datums, arrnulls, 2, dims, lbs, + INT4OID, sizeof(int32), true, 'i'); + values[i++] = PointerGetDatum(arr); + + values[i++] = TimestampTzGetDatum(last_update); + + Assert (i == PG_GET_SYSCACHE_SIZE); + + tuplestore_putvalues(tupstore, tupdesc, values, nulls); + } + + /* check for the end of file. abandon the result if file is broken */ + if (c != 'E' || fgetc(fpin) != EOF) + tuplestore_clear(tupstore); + + FreeFile(fpin); + +no_data: + tuplestore_donestoring(tupstore); + return (Datum) 0; +} diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 0195e19976..fd84e35a6a 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -109,6 +109,10 @@ static CatCacheHeader *CacheHdr = NULL; /* Clock used to record the last accessed time of a catcache record. */ TimestampTz catcacheclock = 0; +/* age classes for pruning */ +static double ageclass[SYSCACHE_STATS_NAGECLASSES] + = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -640,9 +644,7 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue) else CatCacheRemoveCTup(cache, ct); CACHE1_elog(DEBUG2, "CatCacheInvalidate: invalidated"); -#ifdef CATCACHE_STATS cache->cc_invals++; -#endif /* could be multiple matches, so keep looking! */ } } @@ -718,9 +720,7 @@ ResetCatalogCache(CatCache *cache) } else CatCacheRemoveCTup(cache, ct); -#ifdef CATCACHE_STATS cache->cc_invals++; -#endif } } } @@ -1032,10 +1032,10 @@ CatCacheCleanupOldEntries(CatCache *cp) int us; /* - * Calculate the duration from the time of the last access to the - * "current" time. Since catcacheclock is not advanced within a - * transaction, the entries that are accessed within the current - * transaction won't be pruned. + * Calculate the duration from the time from the last access to + * the "current" time. Since catcacheclock is not advanced within + * a transaction, the entries that are accessed within the current + * transaction always get 0 as the result. */ TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); @@ -1463,9 +1463,7 @@ SearchCatCacheInternal(CatCache *cache, if (unlikely(cache->cc_tupdesc == NULL)) CatalogCacheInitializeCache(cache); -#ifdef CATCACHE_STATS cache->cc_searches++; -#endif /* Initialize local parameter array */ arguments[0] = v1; @@ -1535,9 +1533,7 @@ SearchCatCacheInternal(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_hits++; -#endif return &ct->tuple; } @@ -1546,9 +1542,7 @@ SearchCatCacheInternal(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_neg_hits++; -#endif return NULL; } @@ -1676,9 +1670,7 @@ SearchCatCacheMiss(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): put in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_newloads++; -#endif return &ct->tuple; } @@ -1789,9 +1781,7 @@ SearchCatCacheList(CatCache *cache, Assert(nkeys > 0 && nkeys < cache->cc_nkeys); -#ifdef CATCACHE_STATS cache->cc_lsearches++; -#endif /* Initialize local parameter array */ arguments[0] = v1; @@ -1848,9 +1838,7 @@ SearchCatCacheList(CatCache *cache, CACHE2_elog(DEBUG2, "SearchCatCacheList(%s): found list", cache->cc_relname); -#ifdef CATCACHE_STATS cache->cc_lhits++; -#endif return cl; } @@ -2373,3 +2361,68 @@ PrintCatCacheListLeakWarning(CatCList *list) list->my_cache->cc_relname, list->my_cache->id, list, list->refcount); } + +/* + * CatCacheGetStats - fill in SysCacheStats struct. + * + * This is a support routine for SysCacheGetStats, substantially fills in the + * result. The classification here is based on the same criteria to + * CatCacheCleanupOldEntries(). + */ +void +CatCacheGetStats(CatCache *cache, SysCacheStats *stats) +{ + int i, j; + + Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0); + + /* fill in the stats struct */ + stats->size = cache->cc_memusage; + stats->ntuples = cache->cc_ntup; + stats->nsearches = cache->cc_searches; + stats->nhits = cache->cc_hits; + stats->nneg_hits = cache->cc_neg_hits; + + /* + * catalog_cache_prune_min_age can be changed on-session, fill it every + * time + */ + for (i = 0 ; i < SYSCACHE_STATS_NAGECLASSES ; i++) + stats->ageclasses[i] = + (int) (catalog_cache_prune_min_age * ageclass[i]); + + /* + * nth element in nclass_entries stores the number of cache entries that + * have lived unaccessed for corresponding multiple in ageclass of + * catalog_cache_prune_min_age. + */ + memset(stats->nclass_entries, 0, sizeof(int) * SYSCACHE_STATS_NAGECLASSES); + + /* Scan the whole hash */ + for (i = 0; i < cache->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cache->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + /* + * Calculate the duration from the time from the last access to + * the "current" time. Since catcacheclock is not advanced within + * a transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + + j = 0; + while (j < SYSCACHE_STATS_NAGECLASSES - 1 && + entry_age > stats->ageclasses[j]) + j++; + + stats->nclass_entries[j]++; + } + } +} diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c index ac98c19155..7b38a06708 100644 --- a/src/backend/utils/cache/syscache.c +++ b/src/backend/utils/cache/syscache.c @@ -20,6 +20,9 @@ */ #include "postgres.h" +#include <sys/stat.h> +#include <unistd.h> + #include "access/htup_details.h" #include "access/sysattr.h" #include "catalog/indexing.h" @@ -1534,6 +1537,27 @@ RelationSupportsSysCache(Oid relid) return false; } +/* + * SysCacheGetStats - returns stats of specified syscache + * + * This routine returns the address of its local static memory. + */ +SysCacheStats * +SysCacheGetStats(int cacheId) +{ + static SysCacheStats stats; + + Assert(cacheId >=0 && cacheId < SysCacheSize); + + memset(&stats, 0, sizeof(stats)); + + stats.reloid = cacheinfo[cacheId].reloid; + stats.indoid = cacheinfo[cacheId].indoid; + + CatCacheGetStats(SysCache[cacheId], &stats); + + return &stats; +} /* * OID comparator for pg_qsort diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index 0e8b972a29..b7c647b5e0 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false; volatile sig_atomic_t ClientConnectionLost = false; volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false; volatile sig_atomic_t CatcacheClockTimeoutPending = false; +volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending = false; volatile sig_atomic_t ConfigReloadPending = false; volatile uint32 InterruptHoldoffCount = 0; volatile uint32 QueryCancelHoldoffCount = 0; diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index 9eb50e9676..2f3251e8d5 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -73,6 +73,7 @@ static void StatementTimeoutHandler(void); static void LockTimeoutHandler(void); static void IdleInTransactionSessionTimeoutHandler(void); static void CatcacheClockTimeoutHandler(void); +static void IdleSyscacheStatsUpdateTimeoutHandler(void); static bool ThereIsAtLeastOneRole(void); static void process_startup_options(Port *port, bool am_superuser); static void process_settings(Oid databaseid, Oid roleid); @@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, IdleInTransactionSessionTimeoutHandler); RegisterTimeout(CATCACHE_CLOCK_TIMEOUT, CatcacheClockTimeoutHandler); + RegisterTimeout(IDLE_SYSCACHE_STATS_UPDATE_TIMEOUT, + IdleSyscacheStatsUpdateTimeoutHandler); } /* @@ -1249,6 +1252,14 @@ CatcacheClockTimeoutHandler(void) SetLatch(MyLatch); } +static void +IdleSyscacheStatsUpdateTimeoutHandler(void) +{ + IdleSyscacheStatsUpdateTimeoutPending = true; + InterruptPending = true; + SetLatch(MyLatch); +} + /* * Returns true if at least one role is defined in this database cluster. */ diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index c62d5ad8b8..7f1670fa5b 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -3178,6 +3178,16 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"track_catalog_cache_usage_interval", PGC_SUSET, STATS_COLLECTOR, + gettext_noop("Sets the interval between syscache usage collection, in milliseconds. Zero disables syscache usagetracking."), + NULL + }, + &pgstat_track_syscache_usage_interval, + 0, 0, INT_MAX / 2, + NULL, NULL, NULL + }, + { {"gin_pending_list_limit", PGC_USERSET, CLIENT_CONN_STATEMENT, gettext_noop("Sets the maximum size of the pending list for GIN index."), diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index aeb5968e75..797f52fa2a 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -556,6 +556,7 @@ #track_io_timing = off #track_functions = none # none, pl, all #track_activity_query_size = 1024 # (change requires restart) +#track_catlog_cache_usage_interval = 0 # zero disables tracking #stats_temp_directory = 'pg_stat_tmp' diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 24f99f7fc4..fc35b6be47 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -9689,6 +9689,15 @@ proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}', proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}', prosrc => 'pg_get_replication_slots' }, +{ oid => '3425', + descr => 'syscache statistics', + proname => 'pg_get_syscache_stats', prorows => '100', proisstrict => 'f', + proretset => 't', provolatile => 'v', prorettype => 'record', + proargtypes => 'int4', + proallargtypes => '{int4,oid,oid,int8,int8,int8,int8,int8,_int4,timestamptz}', + proargmodes => '{i,o,o,o,o,o,o,o,o,o}', + proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,last_update}', + prosrc => 'pgstat_get_syscache_stats' }, { oid => '3786', descr => 'set up a logical replication slot', proname => 'pg_create_logical_replication_slot', provolatile => 'v', proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool', diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index 33b800e80f..767c94a63c 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -83,6 +83,7 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending; extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending; extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending; extern PGDLLIMPORT volatile sig_atomic_t CatcacheClockTimeoutPending; +extern PGDLLIMPORT volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending; extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending; extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost; diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 88a75fb798..b6bfd7d644 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -1144,6 +1144,7 @@ extern bool pgstat_track_activities; extern bool pgstat_track_counts; extern int pgstat_track_functions; extern PGDLLIMPORT int pgstat_track_activity_query_size; +extern int pgstat_track_syscache_usage_interval; extern char *pgstat_stat_directory; extern char *pgstat_stat_tmpname; extern char *pgstat_stat_filename; @@ -1228,7 +1229,8 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id); extern void pgstat_initstats(Relation rel); extern char *pgstat_clip_activity(const char *raw_activity); - +extern void pgstat_get_syscachestat_filename(bool permanent, + bool tempname, int backendid, char *filename, int len); /* ---------- * pgstat_report_wait_start() - * @@ -1363,5 +1365,5 @@ extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid); extern int pgstat_fetch_stat_numbackends(void); extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void); extern PgStat_GlobalStats *pgstat_fetch_global(void); - +extern long pgstat_write_syscache_stats(bool force); #endif /* PGSTAT_H */ diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 0425fc0786..8e477090e2 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -68,10 +68,8 @@ typedef struct catcache int cc_nfreeent; /* # of entries currently not referenced */ /* - * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS - * doesn't break ABI for other modules + * Statistics entries */ -#ifdef CATCACHE_STATS long cc_searches; /* total # searches against this cache */ long cc_hits; /* # of matches against existing entry */ long cc_neg_hits; /* # of matches against negative entry */ @@ -84,7 +82,6 @@ typedef struct catcache long cc_invals; /* # of entries invalidated from cache */ long cc_lsearches; /* total # list-searches */ long cc_lhits; /* # of matches against existing lists */ -#endif } CatCache; @@ -275,4 +272,8 @@ extern void PrepareToInvalidateCacheTuple(Relation relation, extern void PrintCatCacheLeakWarning(HeapTuple tuple); extern void PrintCatCacheListLeakWarning(CatCList *list); +/* defined in syscache.h */ +typedef struct syscachestats SysCacheStats; +extern void CatCacheGetStats(CatCache *cache, SysCacheStats *syscachestats); + #endif /* CATCACHE_H */ diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h index 95ee48954e..71b399c902 100644 --- a/src/include/utils/syscache.h +++ b/src/include/utils/syscache.h @@ -112,6 +112,24 @@ enum SysCacheIdentifier #define SysCacheSize (USERMAPPINGUSERSERVER + 1) }; +#define SYSCACHE_STATS_NAGECLASSES 6 +/* Struct for catcache tracking information */ +typedef struct syscachestats +{ + Oid reloid; /* target relation */ + Oid indoid; /* index */ + size_t size; /* size of the catcache */ + int ntuples; /* number of tuples resides in the catcache */ + int nsearches; /* number of searches */ + int nhits; /* number of cache hits */ + int nneg_hits; /* number of negative cache hits */ + /* age classes in seconds */ + int ageclasses[SYSCACHE_STATS_NAGECLASSES]; + /* number of tuples fall into the corresponding age class */ + int nclass_entries[SYSCACHE_STATS_NAGECLASSES]; +} SysCacheStats; + + extern void InitCatalogCache(void); extern void InitCatalogCachePhase2(void); @@ -164,6 +182,7 @@ extern void SysCacheInvalidate(int cacheId, uint32 hashValue); extern bool RelationInvalidatesSnapshotsOnly(Oid relid); extern bool RelationHasSysCache(Oid relid); extern bool RelationSupportsSysCache(Oid relid); +extern SysCacheStats *SysCacheGetStats(int cacheId); /* * The use of the macros below rather than direct calls to the corresponding diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h index b2d97b4f7b..0677978923 100644 --- a/src/include/utils/timeout.h +++ b/src/include/utils/timeout.h @@ -32,6 +32,7 @@ typedef enum TimeoutId STANDBY_LOCK_TIMEOUT, IDLE_IN_TRANSACTION_SESSION_TIMEOUT, CATCACHE_CLOCK_TIMEOUT, + IDLE_SYSCACHE_STATS_UPDATE_TIMEOUT, /* First user-definable timeout reason */ USER_TIMEOUT, /* Maximum number of timeout reasons */ diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out index 2c8e21baa7..7bd77e9972 100644 --- a/src/test/regress/expected/rules.out +++ b/src/test/regress/expected/rules.out @@ -1921,6 +1921,28 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid, pg_stat_all_tables.autoanalyze_count FROM pg_stat_all_tables WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname~ '^pg_toast'::text)); +pg_stat_syscache| SELECT s.pid, + (s.relid)::regclass AS relname, + (s.indid)::regclass AS cache_name, + s.size, + s.ntup AS ntuples, + s.searches, + s.hits, + s.neg_hits, + s.ageclass, + s.last_update + FROM (pg_stat_activity a + JOIN LATERAL ( SELECT a.pid, + pg_get_syscache_stats.relid, + pg_get_syscache_stats.indid, + pg_get_syscache_stats.size, + pg_get_syscache_stats.ntup, + pg_get_syscache_stats.searches, + pg_get_syscache_stats.hits, + pg_get_syscache_stats.neg_hits, + pg_get_syscache_stats.ageclass, + pg_get_syscache_stats.last_update + FROM pg_get_syscache_stats(a.pid) pg_get_syscache_stats(relid, indid, size, ntup, searches, hits, neg_hits, ageclass,last_update)) s ON ((a.pid = s.pid))); pg_stat_user_functions| SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, @@ -2352,7 +2374,7 @@ pg_settings|pg_settings_n|CREATE RULE pg_settings_n AS ON UPDATE TO pg_catalog.pg_settings DO INSTEAD NOTHING; pg_settings|pg_settings_u|CREATE RULE pg_settings_u AS ON UPDATE TO pg_catalog.pg_settings - WHERE (new.name = old.name) DO SELECT set_config(old.name, new.setting, false) AS set_config; + WHERE (new.name = old.name) DO SELECT set_config(old.name, new.setting, false, false) AS set_config; rtest_emp|rtest_emp_del|CREATE RULE rtest_emp_del AS ON DELETE TO public.rtest_emp DO INSERT INTO rtest_emplog (ename, who, action, newsal, oldsal) VALUES (old.ename, CURRENT_USER, 'fired'::bpchar, '$0.00'::money, old.salary); -- 2.16.3
On 2/12/19 12:35 PM, Kyotaro HORIGUCHI wrote: > Thank you for testing and the commits, Tomas. > > At Sat, 9 Feb 2019 19:09:59 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in <74386116-0bc5-84f2-e614-0cff19aca2de@2ndquadrant.com> >> On 2/7/19 1:18 PM, Kyotaro HORIGUCHI wrote: >>> At Thu, 07 Feb 2019 15:24:18 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in<20190207.152418.139132570.horiguchi.kyotaro@lab.ntt.co.jp> >> I've done a bunch of benchmarks on v13, and I don't see any serious >> regression either. Each test creates a number of tables (100, 1k, 10k, >> 100k and 1M) and then runs SELECT queries on them. The tables are >> accessed randomly - with either uniform or exponential distribution. For >> each combination there are 5 runs, 60 seconds each (see the attached >> shell scripts, it should be pretty obvious). >> >> I've done the tests on two different machines - small one (i5 with 8GB >> of RAM) and large one (e5-2620v4 with 64GB RAM), but the behavior is >> almost exactly the same (with the exception of 1M tables, which does not >> fit into RAM on the smaller one). >> >> On the xeon, the results (throughput compared to master) look like this: >> >> >> uniform 100 1000 10000 100000 1000000 >> ------------------------------------------------------------ >> v13 105.04% 100.28% 102.96% 102.11% 101.54% >> v13 (nodata) 97.05% 98.30% 97.42% 96.60% 107.55% >> >> >> exponential 100 1000 10000 100000 1000000 >> ------------------------------------------------------------ >> v13 100.04% 103.48% 101.70% 98.56% 103.20% >> v13 (nodata) 97.12% 98.43% 98.86% 98.48% 104.94% >> >> The "nodata" case means the tables were empty (so no files created), >> while in the other case each table contained 1 row. >> >> Per the results it's mostly break even, and in some cases there is >> actually a measurable improvement. > > Great! I guess it comes from reduced size of hash? > Not sure about that. I haven't actually verified that it reduces the cache size at all - I was measuring the overhead of the extra work. And I don't think the syscache actually shrunk significantly, because the throughput was quite high (~15-30k tps, IIRC) so pretty much everything was touched within the default 600 seconds. >> That being said, the question is whether the patch actually reduces >> memory usage in a useful way - that's not something this benchmark >> validates. I plan to modify the tests to make pgbench script >> time-dependent (i.e. to pick a subset of tables depending on time). > > Thank you. > >> A couple of things I've happened to notice during a quick review: >> >> 1) The sgml docs in 0002 talk about "syscache_memory_target" and >> "syscache_prune_min_age", but those options were renamed to just >> "cache_memory_target" and "cache_prune_min_age". > > I'm at a loss how call syscache for users. I think it is "catalog > cache". The most basic component is called catcache, which is > covered by the syscache layer, both of then are not revealed to > users, and it is shown to user as "catalog cache". > > "catalog_cache_prune_min_age", "catalog_cache_memory_target", (if > exists) "catalog_cache_entry_limit" and > "catalog_cache_prune_ratio" make sense? > I think "catalog_cache" sounds about right, although my point was simply that there's a discrepancy between sgml docs and code. >> 2) "cache_entry_limit" is not mentioned in sgml docs at all, and it's >> defined three times in guc.c for some reason. > > It is just PoC, added to show how it looks. (The multiple > instances must bex a result of a convulsion of my fingers..) I > think this is not useful unless it can be specfied per-relation > or per-cache basis. I'll remove the GUC and add reloptions for > the purpose. (But it won't work for pg_class and pg_attribute > for now). > OK, although I'd just keep it as simple as possible. TBH I can't really imagine users tuning limits for individual caches in any meaningful way. >> 3) I don't see why to define PRUNE_BY_AGE and PRUNE_BY_NUMBER, instead >> of just using two bool variables prune_by_age and prune_by_number doing >> the same thing. > > Agreed. It's a kind of memory-stingy, which is useless there. > >> 4) I'm not entirely sure about using stmtStartTimestamp. Doesn't that >> pretty much mean long-running statements will set the lastaccess to very >> old timestamp? Also, it means that long-running statements (like a PL >> function accessing a bunch of tables) won't do any eviction at all, no? >> AFAICS we'll set the timestamp only once, at the very beginning. >> >> I wonder whether using some other timestamp source (like a timestamp >> updated regularly from a timer, or something like that). > > I didin't consider planning that happen within a function. If > 5min is the default for catalog_cache_prune_min_age, 10% of it > (30s) seems enough and gettieofday() with such intervals wouldn't > affect forground jobs. I'd choose catalog_c_p_m_age/10 rather > than fixed value 30s and 1s as the minimal. > Actually, I see CatCacheCleanupOldEntries contains this comment: /* * Calculate the duration from the time of the last access to the * "current" time. Since catcacheclock is not advanced within a * transaction, the entries that are accessed within the current * transaction won't be pruned. */ which I think is pretty much what I've been saying ... But the question is whether we need to do something about it. > I obeserved significant degradation by setting up timer at every > statement start. The patch is doing the followings to get rid of > the degradation. > > (1) Every statement updates the catcache timestamp as currently > does. (SetCatCacheClock) > > (2) The timestamp is also updated periodically using timer > separately from (1). The timer starts if not yet at the time > of (1). (SetCatCacheClock, UpdateCatCacheClock) > > (3) Statement end and transaction end don't stop the timer, to > avoid overhead of setting up a timer. ( > > (4) But it stops by error. I choosed not to change the thing in > PostgresMain that it kills all timers on error. > > (5) Also changing the GUC catalog_cache_prune_min_age kills the > timer, in order to reflect the change quickly especially when > it is shortened. > Interesting. What was the frequency of the timer / how often was it executed? Can you share the code somehow? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] > I'm at a loss how call syscache for users. I think it is "catalog > cache". The most basic component is called catcache, which is > covered by the syscache layer, both of then are not revealed to > users, and it is shown to user as "catalog cache". > > "catalog_cache_prune_min_age", "catalog_cache_memory_target", (if > exists) "catalog_cache_entry_limit" and > "catalog_cache_prune_ratio" make sense? PostgreSQL documentation uses "system catalog" in its table of contents, so syscat_cache_xxx would be a bit more familiar? I'm for either catalog_ and syscat_, but what name shall we use for the relation cache? catcache and relcachehave different element sizes and possibly different usage patterns, so they may as well have different parametersjust like MySQL does. If we follow that idea, then the name would be relation_cache_xxx. However, from the user'sviewpoint, the relation cache is also created from the system catalog like pg_class and pg_attribute... Regards Takayuki Tsunakawa
From: Tomas Vondra [mailto:tomas.vondra@2ndquadrant.com] > > I didin't consider planning that happen within a function. If > > 5min is the default for catalog_cache_prune_min_age, 10% of it > > (30s) seems enough and gettieofday() with such intervals wouldn't > > affect forground jobs. I'd choose catalog_c_p_m_age/10 rather > > than fixed value 30s and 1s as the minimal. > > > > Actually, I see CatCacheCleanupOldEntries contains this comment: > > /* > * Calculate the duration from the time of the last access to the > * "current" time. Since catcacheclock is not advanced within a > * transaction, the entries that are accessed within the current > * transaction won't be pruned. > */ > > which I think is pretty much what I've been saying ... But the question > is whether we need to do something about it. Hmm, I'm surprised at v14 patch about this. I remember that previous patches renewed the cache clock on every statement,and it is correct. If the cache clock is only updated at the beginning of a transaction, the following TODO itemwould not be solved: https://wiki.postgresql.org/wiki/Todo " Reduce memory use when analyzing many tables in a single command by making catcache and syscache flushable or bounded." Also, Tom mentioned pg_dump in this thread (protect syscache...). pg_dump runs in a single transaction, touching all systemcatalogs. That may result in OOM, and this patch can rescue it. Regards Takayuki Tsunakawa
At Tue, 12 Feb 2019 20:36:28 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190212.203628.118792892.horiguchi.kyotaro@lab.ntt.co.jp> > > (4) > > + hash_size = cp->cc_nbuckets * sizeof(dlist_head); > > + tupsize = sizeof(CatCTup) + MAXIMUM_ALIGNOF + dtp->t_len; > > + tupsize = sizeof(CatCTup); > > > > GetMemoryChunkSpace() should be used to include the memory context overhead. That's what the files in src/backend/utils/sort/do. > > Thanks. Done. Include bucket and cache header part but still > excluding clist. Renamed from tupsize to memusage. It is too complex as I was afraid. The indirect calls causes siginicant degradation. (Anyway the previous code was bogus in that it passes CACHELINEALIGN'ed pointer to get_chunk_size..) Instead, I added an accounting(?) interface function. | MemoryContextGettConsumption(MemoryContext cxt); The API returns the current consumption in this memory context. This allows "real" memory accounting almost without overhead. (1) New patch v15-0002 adds accounting feature to MemoryContext. (It adds this feature only to AllocSet, if this is acceptable it can be extended to other allocators.) (2) Another new patch v15-0005 on top of previous design of limit-by-number-of-a-cache feature converts it to limit-by-size-on-all-caches feature, which I think is Tsunakawa-san wanted. As far as I can see no significant degradation is found in usual (as long as pruning doesn't happen) code paths. About the new global-size based evicition(2), cache entry creation becomes slow after the total size reached to the limit since every one new entry evicts one or more old (= not-recently-used) entries. Because of not needing knbos for each cache, it become far realistic. So I added documentation of "catalog_cache_max_size" in 0005. About the age-based eviction, the bulk eviction seems to take a a bit long time but it happnes instead of hash resizing so the user doesn't observe additional slowdown. On the contrary the pruning can avoid rehashing scanning the whole cache. I think it is the gain seen in the Tomas' experiment. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 3b24233b1891b967ccac65a4d21ed0207037578b Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Thu, 7 Feb 2019 14:56:07 +0900 Subject: [PATCH 1/5] Add dlist_move_tail We have dlist_push_head/tail and dlist_move_head but not dlist_move_tail. Add it. --- src/include/lib/ilist.h | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/src/include/lib/ilist.h b/src/include/lib/ilist.h index b1a5974ee4..659ab1ac87 100644 --- a/src/include/lib/ilist.h +++ b/src/include/lib/ilist.h @@ -394,6 +394,25 @@ dlist_move_head(dlist_head *head, dlist_node *node) dlist_check(head); } +/* + * Move element from its current position in the list to the tail position in + * the same list. + * + * Undefined behaviour if 'node' is not already part of the list. + */ +static inline void +dlist_move_tail(dlist_head *head, dlist_node *node) +{ + /* fast path if it's already at the tail */ + if (head->head.prev == node) + return; + + dlist_delete(node); + dlist_push_tail(head, node); + + dlist_check(head); +} + /* * Check whether 'node' has a following node. * Caution: unreliable if 'node' is not in the list. -- 2.16.3 From ade1f6bf389d834cd4428f302a5cc4deaf66be9e Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Wed, 13 Feb 2019 13:36:38 +0900 Subject: [PATCH 2/5] Memory consumption report reature of memorycontext This adds a feature that count memory consumption (in other words, internally allocated size for a chunk) and read it from others. This allows other features to know "(almost) real" consumption of memory. --- src/backend/utils/mmgr/aset.c | 13 +++++++++++++ src/backend/utils/mmgr/mcxt.c | 1 + src/include/nodes/memnodes.h | 4 ++++ 3 files changed, 18 insertions(+) diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c index 08aff333a4..3c5798734c 100644 --- a/src/backend/utils/mmgr/aset.c +++ b/src/backend/utils/mmgr/aset.c @@ -614,6 +614,9 @@ AllocSetReset(MemoryContext context) /* Reset block size allocation sequence, too */ set->nextBlockSize = set->initBlockSize; + + /* Reset consumption account */ + set->header.consumption = 0; } /* @@ -778,6 +781,8 @@ AllocSetAlloc(MemoryContext context, Size size) /* Disallow external access to private part of chunk header. */ VALGRIND_MAKE_MEM_NOACCESS(chunk, ALLOCCHUNK_PRIVATE_LEN); + context->consumption += chunk_size; + return AllocChunkGetPointer(chunk); } @@ -817,6 +822,8 @@ AllocSetAlloc(MemoryContext context, Size size) /* Disallow external access to private part of chunk header. */ VALGRIND_MAKE_MEM_NOACCESS(chunk, ALLOCCHUNK_PRIVATE_LEN); + context->consumption += chunk->size; + return AllocChunkGetPointer(chunk); } @@ -976,6 +983,8 @@ AllocSetAlloc(MemoryContext context, Size size) /* Disallow external access to private part of chunk header. */ VALGRIND_MAKE_MEM_NOACCESS(chunk, ALLOCCHUNK_PRIVATE_LEN); + context->consumption += chunk_size; + return AllocChunkGetPointer(chunk); } @@ -1022,6 +1031,7 @@ AllocSetFree(MemoryContext context, void *pointer) elog(ERROR, "could not find block containing chunk %p", chunk); /* OK, remove block from aset's list and free it */ + context->consumption -= chunk->size; if (block->prev) block->prev->next = block->next; else @@ -1039,6 +1049,7 @@ AllocSetFree(MemoryContext context, void *pointer) int fidx = AllocSetFreeIndex(chunk->size); chunk->aset = (void *) set->freelist[fidx]; + context->consumption -= chunk->size; #ifdef CLOBBER_FREED_MEMORY wipe_mem(pointer, chunk->size); @@ -1159,6 +1170,7 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size) /* Do the realloc */ chksize = MAXALIGN(size); blksize = chksize + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ; + context->consumption -= oldsize; block = (AllocBlock) realloc(block, blksize); if (block == NULL) { @@ -1178,6 +1190,7 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size) if (block->next) block->next->prev = block; chunk->size = chksize; + context->consumption += chksize; #ifdef MEMORY_CONTEXT_CHECKING #ifdef RANDOMIZE_ALLOCATED_MEMORY diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c index 43c58c351b..395fca9e5d 100644 --- a/src/backend/utils/mmgr/mcxt.c +++ b/src/backend/utils/mmgr/mcxt.c @@ -740,6 +740,7 @@ MemoryContextCreate(MemoryContext node, node->name = name; node->ident = NULL; node->reset_cbs = NULL; + node->consumption = 0; /* OK to link node into context tree */ if (parent) diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h index dbae98d3d9..cb0f23bac7 100644 --- a/src/include/nodes/memnodes.h +++ b/src/include/nodes/memnodes.h @@ -87,6 +87,7 @@ typedef struct MemoryContextData const char *name; /* context name (just for debugging) */ const char *ident; /* context ID if any (just for debugging) */ MemoryContextCallback *reset_cbs; /* list of reset/delete callbacks */ + uint64 consumption; /* accumulates consumed memory size */ } MemoryContextData; /* utils/palloc.h contains typedef struct MemoryContextData *MemoryContext */ @@ -105,3 +106,6 @@ typedef struct MemoryContextData IsA((context), GenerationContext))) #endif /* MEMNODES_H */ + +/* Interface routines for memory consumption-based accounting */ +#define MemoryContextGetConsumption(c) ((c)->consumption) -- 2.16.3 From 92c2a6f0c0696d1cef617115a199d09ae1fc0e76 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 16 Oct 2018 13:04:30 +0900 Subject: [PATCH 3/5] Remove entries that haven't been used for a certain time Catcache entries can be left alone for several reasons. It is not desirable that they eat up memory. With this patch, This adds consideration of removal of entries that haven't been used for a certain time before enlarging the hash array. This also can put a hard limit on the number of catcache entries. --- doc/src/sgml/config.sgml | 40 ++++ src/backend/tcop/postgres.c | 13 ++ src/backend/utils/cache/catcache.c | 283 +++++++++++++++++++++++++- src/backend/utils/init/globals.c | 1 + src/backend/utils/init/postinit.c | 11 + src/backend/utils/misc/guc.c | 43 ++++ src/backend/utils/misc/postgresql.conf.sample | 2 + src/include/miscadmin.h | 1 + src/include/utils/catcache.h | 50 ++++- src/include/utils/timeout.h | 1 + 10 files changed, 436 insertions(+), 9 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 07b847a8e9..4749ad61a9 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1661,6 +1661,46 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-catalog-cache-prune-min-age" xreflabel="catalog_cache_prune_min_age"> + <term><varname>catalog_cache_prune_min_age</varname> (<type>integer</type>) + <indexterm> + <primary><varname>catalog_cache_prune_min_age</varname> configuration + parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the minimum amount of unused time in seconds at which a + system catalog cache entry is removed. -1 indicates that this feature + is disabled at all. The value defaults to 300 seconds (<literal>5 + minutes</literal>). The catalog cache entries that are not used for + the duration can be removed to prevent it from being filled up with + useless entries. This behaviour is muted until the size of a catalog + cache exceeds <xref linkend="guc-catalog-cache-memory-target"/>. + </para> + </listitem> + </varlistentry> + + <varlistentry id="guc-catalog-cache-memory-target" xreflabel="catalog_cache_memory_target"> + <term><varname>catalog_cache_memory_target</varname> (<type>integer</type>) + <indexterm> + <primary><varname>catalog_cache_memory_target</varname> configuration + parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the maximum amount of memory to which a system catalog cache + can expand without pruning in kilobytes. The value defaults to 0, + indicating that age-based pruning is always considered. After + exceeding this size, catalog cache starts pruning according to + <xref linkend="guc-catalog-cache-prune-min-age"/>. If you need to keep + certain amount of catalog cache entries with intermittent usage, try + increase this setting. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth"> <term><varname>max_stack_depth</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 36cfd507b2..f192ee2ca6 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -71,6 +71,7 @@ #include "tcop/pquery.h" #include "tcop/tcopprot.h" #include "tcop/utility.h" +#include "utils/catcache.h" #include "utils/lsyscache.h" #include "utils/memutils.h" #include "utils/ps_status.h" @@ -2584,6 +2585,7 @@ start_xact_command(void) * not desired, the timeout has to be disabled explicitly. */ enable_statement_timeout(); + SetCatCacheClock(GetCurrentStatementStartTimestamp()); } static void @@ -3159,6 +3161,14 @@ ProcessInterrupts(void) if (ParallelMessagePending) HandleParallelMessages(); + + if (CatcacheClockTimeoutPending) + { + CatcacheClockTimeoutPending = 0; + + /* Update timetamp then set up the next timeout */ + UpdateCatCacheClock(); + } } @@ -4021,6 +4031,9 @@ PostgresMain(int argc, char *argv[], QueryCancelPending = false; /* second to avoid race condition */ stmt_timeout_active = false; + /* get sync with the timer state */ + catcache_clock_timeout_active = false; + /* Not reading from the client anymore. */ DoingCommandRead = false; diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 258a1d64cc..04a60a490a 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -39,6 +39,7 @@ #include "utils/rel.h" #include "utils/resowner_private.h" #include "utils/syscache.h" +#include "utils/timeout.h" /* #define CACHEDEBUG */ /* turns DEBUG elogs on */ @@ -71,9 +72,43 @@ #define CACHE6_elog(a,b,c,d,e,f,g) #endif +/* GUC variable to define the minimum age of entries that will be considered to + * be evicted in seconds. This variable is shared among various cache + * mechanisms. + */ +int catalog_cache_prune_min_age = 300; + +/* + * GUC variable to define the minimum size of hash to cosider entry eviction. + * This variable is shared among various cache mechanisms. + */ +int catalog_cache_memory_target = 0; + +/* + * GUC for limit by the number of entries. Entries are removed when the number + * of them goes above catalog_cache_entry_limit and leaving newer entries by + * the ratio specified by catalog_cache_prune_ratio. + */ +int catalog_cache_entry_limit = 0; +double catalog_cache_prune_ratio = 0.8; + +/* + * Flag to keep track of whether catcache clock timer is active. + */ +bool catcache_clock_timeout_active = false; + +/* + * Minimum interval between two success move of a cache entry in LRU list, + * in microseconds. + */ +#define MIN_LRU_UPDATE_INTERVAL 100000 /* 100ms */ + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Clock used to record the last accessed time of a catcache record. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -481,6 +516,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) /* delink from linked list */ dlist_delete(&ct->cache_elem); + dlist_delete(&ct->lru_node); /* * Free keys when we're dealing with a negative entry, normal entries just @@ -490,6 +526,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys, cache->cc_keyno, ct->keys); + cache->cc_memusage -= ct->size; pfree(ct); --cache->cc_ntup; @@ -779,6 +816,7 @@ InitCatCache(int id, MemoryContext oldcxt; size_t sz; int i; + uint64 base_size; /* * nbuckets is the initial number of hash buckets to use in this catcache. @@ -821,8 +859,12 @@ InitCatCache(int id, * * Note: we rely on zeroing to initialize all the dlist headers correctly */ + base_size = MemoryContextGetConsumption(CacheMemoryContext); sz = sizeof(CatCache) + PG_CACHE_LINE_SIZE; cp = (CatCache *) CACHELINEALIGN(palloc0(sz)); + cp->cc_head_alloc_size = + MemoryContextGetConsumption(CacheMemoryContext) - base_size; + cp->cc_bucket = palloc0(nbuckets * sizeof(dlist_head)); /* @@ -842,6 +884,11 @@ InitCatCache(int id, for (i = 0; i < nkeys; ++i) cp->cc_keyno[i] = key[i]; + /* cc_head_alloc_size + consumed size for cc_bucket */ + cp->cc_memusage = + MemoryContextGetConsumption(CacheMemoryContext) - base_size; + + dlist_init(&cp->cc_lru_list); /* * new cache is initialized as far as we can go for now. print some * debugging information, if appropriate. @@ -858,9 +905,185 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initialize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } +/* + * helper routine for SetCatCacheClock and UpdateCatCacheClockTimer. + * + * We need to maintain the catcache clock during a long query. + */ +void +SetupCatCacheClockTimer(void) +{ + long delay; + + /* stop timer if not needed */ + if (catalog_cache_prune_min_age == 0) + { + catcache_clock_timeout_active = false; + return; + } + + /* One 10th of the variable, in milliseconds */ + delay = catalog_cache_prune_min_age * 1000/10; + + /* Lower limit is 1 second */ + if (delay < 1000) + delay = 1000; + + enable_timeout_after(CATCACHE_CLOCK_TIMEOUT, delay); + + catcache_clock_timeout_active = true; +} + +/* + * Update catcacheclock: this is intended to be called from + * CATCACHE_CLOCK_TIMEOUT. The interval is expected more than 1 second (see + * above), so GetCurrentTime() doesn't harm. + */ +void +UpdateCatCacheClock(void) +{ + catcacheclock = GetCurrentTimestamp(); + SetupCatCacheClockTimer(); +} + +/* + * It may take an unexpectedly long time before the next clock update when + * catalog_cache_prune_min_age gets shorter. Disabling the current timer let + * the next update happen at the expected interval. We don't necessariry + * require this for increase the age but we don't need to avoid to disable + * either. + */ +void +assign_catalog_cache_prune_min_age(int newval, void *extra) +{ + if (catcache_clock_timeout_active) + disable_timeout(CATCACHE_CLOCK_TIMEOUT, false); + + catcache_clock_timeout_active = false; +} + +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries can be left alone for several reasons. We remove them if + * they are not accessed for a certain time to prevent catcache from + * bloating. The eviction is performed with the similar algorithm with buffer + * eviction using access counter. Entries that are accessed several times can + * live longer than those that have had less access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int nremoved = 0; + int nelems_before = cp->cc_ntup; + int ndelelems = 0; + bool prune_by_age = false; + bool prune_by_number = false; + dlist_mutable_iter iter; + + /* prune only if the size of the hash is above the target */ + if (catalog_cache_prune_min_age >= 0 && + cp->cc_memusage > (Size) catalog_cache_memory_target * 1024L) + prune_by_age = true; + + if (catalog_cache_entry_limit > 0 && + nelems_before >= catalog_cache_entry_limit) + { + ndelelems = nelems_before - + (int) (catalog_cache_entry_limit * catalog_cache_prune_ratio); + + /* an arbitrary lower limit.. */ + if (ndelelems < 256) + ndelelems = 256; + if (ndelelems > nelems_before) + ndelelems = nelems_before; + + prune_by_number = true; + } + + /* Return immediately if no pruning is wanted */ + if (!prune_by_age && !prune_by_number) + return false; + + /* Scan over LRU to find entries to remove */ + dlist_foreach_modify(iter, &cp->cc_lru_list) + { + CatCTup *ct = dlist_container(CatCTup, lru_node, iter.cur); + bool remove_this = false; + + /* We don't remove referenced entry */ + if (ct->refcount != 0 || + (ct->c_list && ct->c_list->refcount != 0)) + continue; + + /* check against age */ + if (prune_by_age) + { + long entry_age; + int us; + + /* + * Calculate the duration from the time of the last access to the + * "current" time. Since catcacheclock is not advanced within a + * transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + + if (entry_age < catalog_cache_prune_min_age) + { + /* no longer have a business with further entries, exit */ + prune_by_age = false; + break; + } + /* + * Entries that are not accessed after last pruning are removed in + * that seconds, and that has been accessed several times are + * removed after leaving alone for up to three times of the + * duration. We don't try shrink buckets since pruning effectively + * caps catcache expansion in the long term. + */ + if (ct->naccess > 0) + ct->naccess--; + else + remove_this = true; + } + + /* check against entry number */ + if (prune_by_number) + { + if (nremoved < ndelelems) + remove_this = true; + else + prune_by_number = false; /* we're satisfied */ + } + + /* exit immediately if all finished */ + if (!prune_by_age && !prune_by_number) + break; + + /* do the work */ + if (remove_this) + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + } + } + + if (nremoved > 0) + elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d", + cp->id, cp->cc_relname, nremoved, nelems_before); + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -870,6 +1093,7 @@ RehashCatCache(CatCache *cp) dlist_head *newbucket; int newnbuckets; int i; + uint64 base_size = MemoryContextGetConsumption(CacheMemoryContext); elog(DEBUG1, "rehashing catalog cache id %d for %s; %d tups, %d buckets", cp->id, cp->cc_relname, cp->cc_ntup, cp->cc_nbuckets); @@ -878,6 +1102,10 @@ RehashCatCache(CatCache *cp) newnbuckets = cp->cc_nbuckets * 2; newbucket = (dlist_head *) MemoryContextAllocZero(CacheMemoryContext, newnbuckets * sizeof(dlist_head)); + /* recalculate memory usage from the first */ + cp->cc_memusage = cp->cc_head_alloc_size + + MemoryContextGetConsumption(CacheMemoryContext) - base_size; + /* Move all entries from old hash table to new. */ for (i = 0; i < cp->cc_nbuckets; i++) { @@ -890,6 +1118,7 @@ RehashCatCache(CatCache *cp) dlist_delete(iter.cur); dlist_push_head(&newbucket[hashIndex], &ct->cache_elem); + cp->cc_memusage += ct->size; } } @@ -1274,6 +1503,21 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* Update access information for pruning */ + if (ct->naccess < 2) + ct->naccess++; + + /* + * We don't want too frequent update of + * LRU. catalog_cache_prune_min_age can be changed on-session so we + * need to maintain the LRU regardless of catalog_cache_prune_min_age. + */ + if (catcacheclock - ct->lastaccess > MIN_LRU_UPDATE_INTERVAL) + { + ct->lastaccess = catcacheclock; + dlist_move_tail(&cache->cc_lru_list, &ct->lru_node); + } + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1709,6 +1953,11 @@ SearchCatCacheList(CatCache *cache, /* Now we can build the CatCList entry. */ oldcxt = MemoryContextSwitchTo(CacheMemoryContext); nmembers = list_length(ctlist); + + /* + * Don't waste a time by counting the list in catcache memory usage, + * since it doesn't live a long life. + */ cl = (CatCList *) palloc(offsetof(CatCList, members) + nmembers * sizeof(CatCTup *)); @@ -1819,11 +2068,13 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, CatCTup *ct; HeapTuple dtp; MemoryContext oldcxt; + uint64 base_size = MemoryContextGetConsumption(CacheMemoryContext); /* negative entries have no tuple associated */ if (ntp) { int i; + int tupsize; Assert(!negative); @@ -1842,8 +2093,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, /* Allocate memory for CatCTup and the cached tuple in one go */ oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - ct = (CatCTup *) palloc(sizeof(CatCTup) + - MAXIMUM_ALIGNOF + dtp->t_len); + tupsize = sizeof(CatCTup) + MAXIMUM_ALIGNOF + dtp->t_len; + ct = (CatCTup *) palloc(tupsize); ct->tuple.t_len = dtp->t_len; ct->tuple.t_self = dtp->t_self; ct->tuple.t_tableOid = dtp->t_tableOid; @@ -1877,7 +2128,6 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, Assert(negative); oldcxt = MemoryContextSwitchTo(CacheMemoryContext); ct = (CatCTup *) palloc(sizeof(CatCTup)); - /* * Store keys - they'll point into separately allocated memory if not * by-value. @@ -1898,18 +2148,36 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; + dlist_push_tail(&cache->cc_lru_list, &ct->lru_node); dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); cache->cc_ntup++; CacheHdr->ch_ntup++; + ct->size = MemoryContextGetConsumption(CacheMemoryContext) - base_size; + cache->cc_memusage += ct->size; + + /* increase refcount so that this survives pruning */ + ct->refcount++; + /* - * If the hash table has become too full, enlarge the buckets array. Quite - * arbitrarily, we enlarge when fill factor > 2. + * If the hash table has become too full, try cleanup by removing + * infrequently used entries to make a room for the new entry. If it + * failed, enlarge the bucket array instead. Quite arbitrarily, we try + * this when fill factor > 2. */ - if (cache->cc_ntup > cache->cc_nbuckets * 2) + if (cache->cc_ntup > cache->cc_nbuckets * 2 && + !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); + /* we may still want to prune by entry number, check it */ + else if (catalog_cache_entry_limit > 0 && + cache->cc_ntup > catalog_cache_entry_limit) + CatCacheCleanupOldEntries(cache); + + ct->refcount--; return ct; } @@ -1940,7 +2208,7 @@ CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *keys) /* * Helper routine that copies the keys in the srckeys array into the dstkeys * one, guaranteeing that the datums are fully allocated in the current memory - * context. + * context. Returns allocated memory size. */ static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos, @@ -1976,7 +2244,6 @@ CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos, att->attbyval, att->attlen); } - } /* diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index fd51934aaf..0e8b972a29 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false; volatile sig_atomic_t ProcDiePending = false; volatile sig_atomic_t ClientConnectionLost = false; volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false; +volatile sig_atomic_t CatcacheClockTimeoutPending = false; volatile sig_atomic_t ConfigReloadPending = false; volatile uint32 InterruptHoldoffCount = 0; volatile uint32 QueryCancelHoldoffCount = 0; diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index a5ee209f91..9eb50e9676 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg); static void StatementTimeoutHandler(void); static void LockTimeoutHandler(void); static void IdleInTransactionSessionTimeoutHandler(void); +static void CatcacheClockTimeoutHandler(void); static bool ThereIsAtLeastOneRole(void); static void process_startup_options(Port *port, bool am_superuser); static void process_settings(Oid databaseid, Oid roleid); @@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler); RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT, IdleInTransactionSessionTimeoutHandler); + RegisterTimeout(CATCACHE_CLOCK_TIMEOUT, + CatcacheClockTimeoutHandler); } /* @@ -1238,6 +1241,14 @@ IdleInTransactionSessionTimeoutHandler(void) SetLatch(MyLatch); } +static void +CatcacheClockTimeoutHandler(void) +{ + CatcacheClockTimeoutPending = true; + InterruptPending = true; + SetLatch(MyLatch); +} + /* * Returns true if at least one role is defined in this database cluster. */ diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 41d477165c..c62d5ad8b8 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -81,6 +81,7 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/guc_tables.h" #include "utils/float.h" #include "utils/memutils.h" @@ -2205,6 +2206,38 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum unused duration of cache entries before removal."), + gettext_noop("Catalog cache entries that live unused for longer than this seconds are considered to be removed."), + GUC_UNIT_S + }, + &catalog_cache_prune_min_age, + 300, -1, INT_MAX, + NULL, assign_catalog_cache_prune_min_age, NULL + }, + + { + {"catalog_cache_memory_target", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum syscache size to keep."), + gettext_noop("Time-based cache pruning starts working after exceeding this size."), + GUC_UNIT_KB + }, + &catalog_cache_memory_target, + 0, 0, MAX_KILOBYTES, + NULL, NULL, NULL + }, + + { + {"catalog_cache_entry_limit", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the maximum entries of catcache."), + NULL + }, + &catalog_cache_entry_limit, + 0, 0, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if @@ -3368,6 +3401,16 @@ static struct config_real ConfigureNamesReal[] = NULL, NULL, NULL }, + { + {"catalog_cache_prune_ratio", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Reduce ratio of pruning caused by catalog_cache_entry_limit."), + NULL + }, + &catalog_cache_prune_ratio, + 0.8, 0.0, 1.0, + NULL, NULL, NULL + }, + /* End-of-list marker */ { {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index ad6c436f93..aeb5968e75 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -128,6 +128,8 @@ #work_mem = 4MB # min 64kB #maintenance_work_mem = 64MB # min 1MB #autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem +#catalog_cache_memory_target = 0kB # in kB +#catalog_cache_prune_min_age = 300s # -1 disables pruning #max_stack_depth = 2MB # min 100kB #shared_memory_type = mmap # the default is the first option # supported by the operating system: diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index c9e35003a5..33b800e80f 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending; extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending; extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending; extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending; +extern PGDLLIMPORT volatile sig_atomic_t CatcacheClockTimeoutPending; extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending; extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost; diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 65d816a583..0a714bf514 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -61,6 +62,11 @@ typedef struct catcache slist_node cc_next; /* list link */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ + dlist_head cc_lru_list; + int cc_head_alloc_size;/* consumed memory to allocate this struct */ + int cc_memusage; /* memory usage of this catcache (excluding + * header part) */ + int cc_nfreeent; /* # of entries currently not referenced */ /* * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS @@ -119,7 +125,10 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ - + int naccess; /* # of access to this entry, up to 2 */ + TimestampTz lastaccess; /* approx. timestamp of the last usage */ + dlist_node lru_node; /* LRU node */ + int size; /* palloc'ed size off this tuple */ /* * The tuple may also be a member of at most one CatCList. (If a single * catcache is list-searched with varying numbers of keys, we may have to @@ -189,6 +198,45 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int catalog_cache_prune_min_age; +extern int catalog_cache_memory_target; +extern int catalog_cache_entry_limit; +extern double catalog_cache_prune_ratio; + +/* to use as access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* + * Flag to keep track of whether catcache timestamp timer is active. + */ +extern bool catcache_clock_timeout_active; + +/* catcache prune time helper functions */ +extern void SetupCatCacheClockTimer(void); +extern void UpdateCatCacheClock(void); + +/* + * SetCatCacheClock - set timestamp for catcache access record and start + * maintenance timer if needed. We keep to update the clock even while pruning + * is disable so that we are not confused by bogus clock value. + */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; + + if (!catcache_clock_timeout_active && catalog_cache_prune_min_age > 0) + SetupCatCacheClockTimer(); +} + +static inline TimestampTz +GetCatCacheClock(void) +{ + return catcacheclock; +} + +extern void assign_catalog_cache_prune_min_age(int newval, void *extra); extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h index 9244a2a7b7..b2d97b4f7b 100644 --- a/src/include/utils/timeout.h +++ b/src/include/utils/timeout.h @@ -31,6 +31,7 @@ typedef enum TimeoutId STANDBY_TIMEOUT, STANDBY_LOCK_TIMEOUT, IDLE_IN_TRANSACTION_SESSION_TIMEOUT, + CATCACHE_CLOCK_TIMEOUT, /* First user-definable timeout reason */ USER_TIMEOUT, /* Maximum number of timeout reasons */ -- 2.16.3 From e4269e14958596676c2c1f0303ca171a88ae83f7 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 12 Feb 2019 20:31:16 +0900 Subject: [PATCH 4/5] Syscache usage tracking feature Collects syscache usage statictics and show it using the view pg_stat_syscache. The feature is controlled by the GUC variable track_syscache_usage_interval. --- doc/src/sgml/config.sgml | 16 ++ src/backend/catalog/system_views.sql | 17 +++ src/backend/postmaster/pgstat.c | 201 ++++++++++++++++++++++++-- src/backend/tcop/postgres.c | 23 +++ src/backend/utils/adt/pgstatfuncs.c | 134 +++++++++++++++++ src/backend/utils/cache/catcache.c | 93 +++++++++--- src/backend/utils/cache/syscache.c | 24 +++ src/backend/utils/init/globals.c | 1 + src/backend/utils/init/postinit.c | 11 ++ src/backend/utils/misc/guc.c | 10 ++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/catalog/pg_proc.dat | 9 ++ src/include/miscadmin.h | 1 + src/include/pgstat.h | 6 +- src/include/utils/catcache.h | 9 +- src/include/utils/syscache.h | 19 +++ src/include/utils/timeout.h | 1 + src/test/regress/expected/rules.out | 24 ++- 18 files changed, 564 insertions(+), 36 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 4749ad61a9..bc2bef0878 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -6702,6 +6702,22 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; </listitem> </varlistentry> + <varlistentry id="guc-track-catalog-cache-usage-interval" xreflabel="track_catalog_cache_usage_interval"> + <term><varname>track_catalog_cache_usage_interval</varname> (<type>integer</type>) + <indexterm> + <primary><varname>track_catlog_cache_usage_interval</varname> + configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the interval to collect catalog cache usage statistics on + the session in milliseconds. This parameter is 0 by default, which + means disabled. Only superusers can change this setting. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-track-io-timing" xreflabel="track_io_timing"> <term><varname>track_io_timing</varname> (<type>boolean</type>) <indexterm> diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 3e229c693c..f5d1aaf96f 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -906,6 +906,22 @@ CREATE VIEW pg_stat_progress_vacuum AS FROM pg_stat_get_progress_info('VACUUM') AS S LEFT JOIN pg_database D ON S.datid = D.oid; +CREATE VIEW pg_stat_syscache AS + SELECT + S.pid AS pid, + S.relid::regclass AS relname, + S.indid::regclass AS cache_name, + S.size AS size, + S.ntup AS ntuples, + S.searches AS searches, + S.hits AS hits, + S.neg_hits AS neg_hits, + S.ageclass AS ageclass, + S.last_update AS last_update + FROM pg_stat_activity A + JOIN LATERAL (SELECT A.pid, * FROM pg_get_syscache_stats(A.pid)) S + ON (A.pid = S.pid); + CREATE VIEW pg_user_mappings AS SELECT U.oid AS umid, @@ -1185,6 +1201,7 @@ GRANT EXECUTE ON FUNCTION pg_ls_waldir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_archive_statusdir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_tmpdir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_tmpdir(oid) TO pg_monitor; +GRANT EXECUTE ON FUNCTION pg_get_syscache_stats(int) TO pg_monitor; GRANT pg_read_all_settings TO pg_monitor; GRANT pg_read_all_stats TO pg_monitor; diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 81c6499251..8c4ab0aef9 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -66,6 +66,7 @@ #include "utils/ps_status.h" #include "utils/rel.h" #include "utils/snapmgr.h" +#include "utils/syscache.h" #include "utils/timestamp.h" @@ -124,6 +125,7 @@ bool pgstat_track_activities = false; bool pgstat_track_counts = false; int pgstat_track_functions = TRACK_FUNC_OFF; +int pgstat_track_syscache_usage_interval = 0; int pgstat_track_activity_query_size = 1024; /* ---------- @@ -236,6 +238,11 @@ typedef struct TwoPhasePgStatRecord bool t_truncated; /* was the relation truncated? */ } TwoPhasePgStatRecord; +/* bitmap symbols to specify target file types remove */ +#define PGSTAT_REMFILE_DBSTAT 1 /* remove only database stats files */ +#define PGSTAT_REMFILE_SYSCACHE 2 /* remove only syscache stats files */ +#define PGSTAT_REMFILE_ALL 3 /* remove both type of files */ + /* * Info about current "snapshot" of stats file */ @@ -335,6 +342,7 @@ static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len); static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len); static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len); static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len); +static void pgstat_remove_syscache_statsfile(void); /* ------------------------------------------------------------ * Public functions called from postmaster follow @@ -630,10 +638,13 @@ startup_failed: } /* - * subroutine for pgstat_reset_all + * remove stats files + * + * clean up stats files in specified directory. target is one of + * PGSTAT_REFILE_DBSTAT/SYSCACHE/ALL and restricts files to remove. */ static void -pgstat_reset_remove_files(const char *directory) +pgstat_reset_remove_files(const char *directory, int target) { DIR *dir; struct dirent *entry; @@ -644,25 +655,39 @@ pgstat_reset_remove_files(const char *directory) { int nchars; Oid tmp_oid; + int filetype = 0; /* * Skip directory entries that don't match the file names we write. * See get_dbstat_filename for the database-specific pattern. */ if (strncmp(entry->d_name, "global.", 7) == 0) + { + filetype = PGSTAT_REMFILE_DBSTAT; nchars = 7; + } else { + char head[2]; + nchars = 0; - (void) sscanf(entry->d_name, "db_%u.%n", - &tmp_oid, &nchars); - if (nchars <= 0) - continue; + (void) sscanf(entry->d_name, "%c%c_%u.%n", + head, head + 1, &tmp_oid, &nchars); + /* %u allows leading whitespace, so reject that */ - if (strchr("0123456789", entry->d_name[3]) == NULL) + if (nchars < 3 || !isdigit(entry->d_name[3])) continue; + + if (strncmp(head, "db", 2) == 0) + filetype = PGSTAT_REMFILE_DBSTAT; + else if (strncmp(head, "cc", 2) == 0) + filetype = PGSTAT_REMFILE_SYSCACHE; } + /* skip if this is not a target */ + if ((filetype & target) == 0) + continue; + if (strcmp(entry->d_name + nchars, "tmp") != 0 && strcmp(entry->d_name + nchars, "stat") != 0) continue; @@ -683,8 +708,9 @@ pgstat_reset_remove_files(const char *directory) void pgstat_reset_all(void) { - pgstat_reset_remove_files(pgstat_stat_directory); - pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY); + pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_ALL); + pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY, + PGSTAT_REMFILE_ALL); } #ifdef EXEC_BACKEND @@ -2963,6 +2989,10 @@ pgstat_beshutdown_hook(int code, Datum arg) if (OidIsValid(MyDatabaseId)) pgstat_report_stat(true); + /* clear syscache statistics files and temporary settings */ + if (MyBackendId != InvalidBackendId) + pgstat_remove_syscache_statsfile(); + /* * Clear my status entry, following the protocol of bumping st_changecount * before and after. We use a volatile pointer here to ensure the @@ -4287,6 +4317,9 @@ PgstatCollectorMain(int argc, char *argv[]) pgStatRunningInCollector = true; pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true); + /* Remove left-over syscache stats files */ + pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_SYSCACHE); + /* * Loop to process messages until we get SIGQUIT or detect ungraceful * death of our parent postmaster. @@ -6377,3 +6410,153 @@ pgstat_clip_activity(const char *raw_activity) return activity; } + +/* + * return the filename for a syscache stat file; filename is the output + * buffer, of length len. + */ +void +pgstat_get_syscachestat_filename(bool permanent, bool tempname, int backendid, + char *filename, int len) +{ + int printed; + + /* NB -- pgstat_reset_remove_files knows about the pattern this uses */ + printed = snprintf(filename, len, "%s/cc_%u.%s", + permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY : + pgstat_stat_directory, + backendid, + tempname ? "tmp" : "stat"); + if (printed >= len) + elog(ERROR, "overlength pgstat path"); +} + +/* removes syscache stats files of this backend */ +static void +pgstat_remove_syscache_statsfile(void) +{ + char fname[MAXPGPATH]; + + pgstat_get_syscachestat_filename(false, false, MyBackendId, + fname, MAXPGPATH); + unlink(fname); /* don't care of the result */ +} + +/* + * pgstat_write_syscache_stats() - + * Write the syscache statistics files. + * + * If 'force' is false, this function skips writing a file and returns the + * time remaining in the current interval in milliseconds. If 'force' is true, + * writes a file regardless of the remaining time and reset the interval. + */ +long +pgstat_write_syscache_stats(bool force) +{ + static TimestampTz last_report = 0; + TimestampTz now; + long elapsed; + long secs; + int usecs; + int cacheId; + FILE *fpout; + char statfile[MAXPGPATH]; + char tmpfile[MAXPGPATH]; + + /* Return if we don't want it */ + if (!force && pgstat_track_syscache_usage_interval <= 0) + { + /* disabled. remove the statistics file if any */ + if (last_report > 0) + { + last_report = 0; + pgstat_remove_syscache_statsfile(); + } + return 0; + } + + /* Check against the interval */ + now = GetCurrentTransactionStopTimestamp(); + TimestampDifference(last_report, now, &secs, &usecs); + elapsed = secs * 1000 + usecs / 1000; + + if (!force && elapsed < pgstat_track_syscache_usage_interval) + { + /* not yet the time, inform the remaining time to the caller */ + return pgstat_track_syscache_usage_interval - elapsed; + } + + /* now update the stats */ + last_report = now; + + pgstat_get_syscachestat_filename(false, true, + MyBackendId, tmpfile, MAXPGPATH); + pgstat_get_syscachestat_filename(false, false, + MyBackendId, statfile, MAXPGPATH); + + /* + * This function can be called from ProcessInterrupts(). Inhibit recursive + * interrupts to avoid recursive entry. + */ + HOLD_INTERRUPTS(); + + fpout = AllocateFile(tmpfile, PG_BINARY_W); + if (fpout == NULL) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not open temporary statistics file \"%s\": %m", + tmpfile))); + /* + * Failure writing this file is not critical. Just skip this time and + * tell caller to wait for the next interval. + */ + RESUME_INTERRUPTS(); + return pgstat_track_syscache_usage_interval; + } + + /* write out every catcache stats */ + for (cacheId = 0 ; cacheId < SysCacheSize ; cacheId++) + { + SysCacheStats *stats; + + stats = SysCacheGetStats(cacheId); + Assert (stats); + + /* write error is checked later using ferror() */ + fputc('T', fpout); + (void)fwrite(&cacheId, sizeof(int), 1, fpout); + (void)fwrite(&last_report, sizeof(TimestampTz), 1, fpout); + (void)fwrite(stats, sizeof(*stats), 1, fpout); + } + fputc('E', fpout); + + if (ferror(fpout)) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not write syscache statistics file \"%s\": %m", + tmpfile))); + FreeFile(fpout); + unlink(tmpfile); + } + else if (FreeFile(fpout) < 0) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not close syscache statistics file \"%s\": %m", + tmpfile))); + unlink(tmpfile); + } + else if (rename(tmpfile, statfile) < 0) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not rename syscache statistics file \"%s\" to \"%s\": %m", + tmpfile, statfile))); + unlink(tmpfile); + } + + RESUME_INTERRUPTS(); + return 0; +} diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index f192ee2ca6..d0afee189f 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -3159,6 +3159,12 @@ ProcessInterrupts(void) } + if (IdleSyscacheStatsUpdateTimeoutPending) + { + IdleSyscacheStatsUpdateTimeoutPending = false; + pgstat_write_syscache_stats(true); + } + if (ParallelMessagePending) HandleParallelMessages(); @@ -3743,6 +3749,7 @@ PostgresMain(int argc, char *argv[], sigjmp_buf local_sigjmp_buf; volatile bool send_ready_for_query = true; bool disable_idle_in_transaction_timeout = false; + bool disable_idle_syscache_update_timeout = false; /* Initialize startup process environment if necessary. */ if (!IsUnderPostmaster) @@ -4186,9 +4193,19 @@ PostgresMain(int argc, char *argv[], } else { + long timeout; + ProcessCompletedNotifies(); pgstat_report_stat(false); + timeout = pgstat_write_syscache_stats(false); + + if (timeout > 0) + { + disable_idle_syscache_update_timeout = true; + enable_timeout_after(IDLE_SYSCACHE_STATS_UPDATE_TIMEOUT, + timeout); + } set_ps_display("idle", false); pgstat_report_activity(STATE_IDLE, NULL); } @@ -4231,6 +4248,12 @@ PostgresMain(int argc, char *argv[], disable_idle_in_transaction_timeout = false; } + if (disable_idle_syscache_update_timeout) + { + disable_timeout(IDLE_SYSCACHE_STATS_UPDATE_TIMEOUT, false); + disable_idle_syscache_update_timeout = false; + } + /* * (6) check for any other interesting events that happened while we * slept. diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index b6ba856ebe..a314f431c6 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -14,6 +14,8 @@ */ #include "postgres.h" +#include <sys/stat.h> + #include "access/htup_details.h" #include "catalog/pg_authid.h" #include "catalog/pg_type.h" @@ -28,6 +30,7 @@ #include "utils/acl.h" #include "utils/builtins.h" #include "utils/inet.h" +#include "utils/syscache.h" #include "utils/timestamp.h" #define UINT32_ACCESS_ONCE(var) ((uint32)(*((volatile uint32 *)&(var)))) @@ -1899,3 +1902,134 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS) PG_RETURN_DATUM(HeapTupleGetDatum( heap_form_tuple(tupdesc, values, nulls))); } + +Datum +pgstat_get_syscache_stats(PG_FUNCTION_ARGS) +{ +#define PG_GET_SYSCACHE_SIZE 9 + int pid = PG_GETARG_INT32(0); + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; + TupleDesc tupdesc; + Tuplestorestate *tupstore; + MemoryContext per_query_ctx; + MemoryContext oldcontext; + PgBackendStatus *beentry; + int beid; + char fname[MAXPGPATH]; + FILE *fpin; + char c; + + if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("set-valued function called in context that cannot accept a set"))); + if (!(rsinfo->allowedModes & SFRM_Materialize)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("materialize mode required, but it is not " \ + "allowed in this context"))); + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + + per_query_ctx = rsinfo->econtext->ecxt_per_query_memory; + + oldcontext = MemoryContextSwitchTo(per_query_ctx); + tupstore = tuplestore_begin_heap(true, false, work_mem); + rsinfo->returnMode = SFRM_Materialize; + rsinfo->setResult = tupstore; + rsinfo->setDesc = tupdesc; + + MemoryContextSwitchTo(oldcontext); + + /* find beentry for given pid*/ + beentry = NULL; + for (beid = 1; + (beentry = pgstat_fetch_stat_beentry(beid)) && + beentry->st_procpid != pid ; + beid++); + + /* + * we silently return empty result on failure or insufficient privileges + */ + if (!beentry || + (!has_privs_of_role(GetUserId(), beentry->st_userid) && + !is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS))) + goto no_data; + + pgstat_get_syscachestat_filename(false, false, beid, fname, MAXPGPATH); + + if ((fpin = AllocateFile(fname, PG_BINARY_R)) == NULL) + { + if (errno != ENOENT) + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not open statistics file \"%s\": %m", + fname))); + /* also return empty on no statistics file */ + goto no_data; + } + + /* read the statistics file into tuplestore */ + while ((c = fgetc(fpin)) == 'T') + { + TimestampTz last_update; + SysCacheStats stats; + int cacheid; + Datum values[PG_GET_SYSCACHE_SIZE]; + bool nulls[PG_GET_SYSCACHE_SIZE] = {0}; + Datum datums[SYSCACHE_STATS_NAGECLASSES * 2]; + bool arrnulls[SYSCACHE_STATS_NAGECLASSES * 2] = {0}; + int dims[] = {SYSCACHE_STATS_NAGECLASSES, 2}; + int lbs[] = {1, 1}; + ArrayType *arr; + int i, j; + + if (fread(&cacheid, sizeof(int), 1, fpin) != 1 || + fread(&last_update, sizeof(TimestampTz), 1, fpin) != 1 || + fread(&stats, 1, sizeof(stats), fpin) != sizeof(stats)) + { + ereport(WARNING, + (errmsg("corrupted syscache statistics file \"%s\"", + fname))); + goto no_data; + } + + i = 0; + values[i++] = ObjectIdGetDatum(stats.reloid); + values[i++] = ObjectIdGetDatum(stats.indoid); + values[i++] = Int64GetDatum(stats.size); + values[i++] = Int64GetDatum(stats.ntuples); + values[i++] = Int64GetDatum(stats.nsearches); + values[i++] = Int64GetDatum(stats.nhits); + values[i++] = Int64GetDatum(stats.nneg_hits); + + for (j = 0 ; j < SYSCACHE_STATS_NAGECLASSES ; j++) + { + datums[j * 2] = Int32GetDatum((int32) stats.ageclasses[j]); + datums[j * 2 + 1] = Int32GetDatum((int32) stats.nclass_entries[j]); + } + + arr = construct_md_array(datums, arrnulls, 2, dims, lbs, + INT4OID, sizeof(int32), true, 'i'); + values[i++] = PointerGetDatum(arr); + + values[i++] = TimestampTzGetDatum(last_update); + + Assert (i == PG_GET_SYSCACHE_SIZE); + + tuplestore_putvalues(tupstore, tupdesc, values, nulls); + } + + /* check for the end of file. abandon the result if file is broken */ + if (c != 'E' || fgetc(fpin) != EOF) + tuplestore_clear(tupstore); + + FreeFile(fpin); + +no_data: + tuplestore_donestoring(tupstore); + return (Datum) 0; +} diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 04a60a490a..fa0d19a9c3 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -109,6 +109,10 @@ static CatCacheHeader *CacheHdr = NULL; /* Clock used to record the last accessed time of a catcache record. */ TimestampTz catcacheclock = 0; +/* age classes for pruning */ +static double ageclass[SYSCACHE_STATS_NAGECLASSES] + = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -640,9 +644,7 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue) else CatCacheRemoveCTup(cache, ct); CACHE1_elog(DEBUG2, "CatCacheInvalidate: invalidated"); -#ifdef CATCACHE_STATS cache->cc_invals++; -#endif /* could be multiple matches, so keep looking! */ } } @@ -718,9 +720,7 @@ ResetCatalogCache(CatCache *cache) } else CatCacheRemoveCTup(cache, ct); -#ifdef CATCACHE_STATS cache->cc_invals++; -#endif } } } @@ -1030,10 +1030,10 @@ CatCacheCleanupOldEntries(CatCache *cp) int us; /* - * Calculate the duration from the time of the last access to the - * "current" time. Since catcacheclock is not advanced within a - * transaction, the entries that are accessed within the current - * transaction won't be pruned. + * Calculate the duration from the time from the last access to + * the "current" time. Since catcacheclock is not advanced within + * a transaction, the entries that are accessed within the current + * transaction always get 0 as the result. */ TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); @@ -1459,9 +1459,7 @@ SearchCatCacheInternal(CatCache *cache, if (unlikely(cache->cc_tupdesc == NULL)) CatalogCacheInitializeCache(cache); -#ifdef CATCACHE_STATS cache->cc_searches++; -#endif /* Initialize local parameter array */ arguments[0] = v1; @@ -1531,9 +1529,7 @@ SearchCatCacheInternal(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_hits++; -#endif return &ct->tuple; } @@ -1542,9 +1538,7 @@ SearchCatCacheInternal(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_neg_hits++; -#endif return NULL; } @@ -1672,9 +1666,7 @@ SearchCatCacheMiss(CatCache *cache, CACHE3_elog(DEBUG2, "SearchCatCache(%s): put in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_newloads++; -#endif return &ct->tuple; } @@ -1785,9 +1777,7 @@ SearchCatCacheList(CatCache *cache, Assert(nkeys > 0 && nkeys < cache->cc_nkeys); -#ifdef CATCACHE_STATS cache->cc_lsearches++; -#endif /* Initialize local parameter array */ arguments[0] = v1; @@ -1844,9 +1834,7 @@ SearchCatCacheList(CatCache *cache, CACHE2_elog(DEBUG2, "SearchCatCacheList(%s): found list", cache->cc_relname); -#ifdef CATCACHE_STATS cache->cc_lhits++; -#endif return cl; } @@ -2367,3 +2355,68 @@ PrintCatCacheListLeakWarning(CatCList *list) list->my_cache->cc_relname, list->my_cache->id, list, list->refcount); } + +/* + * CatCacheGetStats - fill in SysCacheStats struct. + * + * This is a support routine for SysCacheGetStats, substantially fills in the + * result. The classification here is based on the same criteria to + * CatCacheCleanupOldEntries(). + */ +void +CatCacheGetStats(CatCache *cache, SysCacheStats *stats) +{ + int i, j; + + Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0); + + /* fill in the stats struct */ + stats->size = cache->cc_memusage; + stats->ntuples = cache->cc_ntup; + stats->nsearches = cache->cc_searches; + stats->nhits = cache->cc_hits; + stats->nneg_hits = cache->cc_neg_hits; + + /* + * catalog_cache_prune_min_age can be changed on-session, fill it every + * time + */ + for (i = 0 ; i < SYSCACHE_STATS_NAGECLASSES ; i++) + stats->ageclasses[i] = + (int) (catalog_cache_prune_min_age * ageclass[i]); + + /* + * nth element in nclass_entries stores the number of cache entries that + * have lived unaccessed for corresponding multiple in ageclass of + * catalog_cache_prune_min_age. + */ + memset(stats->nclass_entries, 0, sizeof(int) * SYSCACHE_STATS_NAGECLASSES); + + /* Scan the whole hash */ + for (i = 0; i < cache->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cache->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + /* + * Calculate the duration from the time from the last access to + * the "current" time. Since catcacheclock is not advanced within + * a transaction, the entries that are accessed within the current + * transaction won't be pruned. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + + j = 0; + while (j < SYSCACHE_STATS_NAGECLASSES - 1 && + entry_age > stats->ageclasses[j]) + j++; + + stats->nclass_entries[j]++; + } + } +} diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c index ac98c19155..7b38a06708 100644 --- a/src/backend/utils/cache/syscache.c +++ b/src/backend/utils/cache/syscache.c @@ -20,6 +20,9 @@ */ #include "postgres.h" +#include <sys/stat.h> +#include <unistd.h> + #include "access/htup_details.h" #include "access/sysattr.h" #include "catalog/indexing.h" @@ -1534,6 +1537,27 @@ RelationSupportsSysCache(Oid relid) return false; } +/* + * SysCacheGetStats - returns stats of specified syscache + * + * This routine returns the address of its local static memory. + */ +SysCacheStats * +SysCacheGetStats(int cacheId) +{ + static SysCacheStats stats; + + Assert(cacheId >=0 && cacheId < SysCacheSize); + + memset(&stats, 0, sizeof(stats)); + + stats.reloid = cacheinfo[cacheId].reloid; + stats.indoid = cacheinfo[cacheId].indoid; + + CatCacheGetStats(SysCache[cacheId], &stats); + + return &stats; +} /* * OID comparator for pg_qsort diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index 0e8b972a29..b7c647b5e0 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false; volatile sig_atomic_t ClientConnectionLost = false; volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false; volatile sig_atomic_t CatcacheClockTimeoutPending = false; +volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending = false; volatile sig_atomic_t ConfigReloadPending = false; volatile uint32 InterruptHoldoffCount = 0; volatile uint32 QueryCancelHoldoffCount = 0; diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index 9eb50e9676..2f3251e8d5 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -73,6 +73,7 @@ static void StatementTimeoutHandler(void); static void LockTimeoutHandler(void); static void IdleInTransactionSessionTimeoutHandler(void); static void CatcacheClockTimeoutHandler(void); +static void IdleSyscacheStatsUpdateTimeoutHandler(void); static bool ThereIsAtLeastOneRole(void); static void process_startup_options(Port *port, bool am_superuser); static void process_settings(Oid databaseid, Oid roleid); @@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, IdleInTransactionSessionTimeoutHandler); RegisterTimeout(CATCACHE_CLOCK_TIMEOUT, CatcacheClockTimeoutHandler); + RegisterTimeout(IDLE_SYSCACHE_STATS_UPDATE_TIMEOUT, + IdleSyscacheStatsUpdateTimeoutHandler); } /* @@ -1249,6 +1252,14 @@ CatcacheClockTimeoutHandler(void) SetLatch(MyLatch); } +static void +IdleSyscacheStatsUpdateTimeoutHandler(void) +{ + IdleSyscacheStatsUpdateTimeoutPending = true; + InterruptPending = true; + SetLatch(MyLatch); +} + /* * Returns true if at least one role is defined in this database cluster. */ diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index c62d5ad8b8..7f1670fa5b 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -3178,6 +3178,16 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"track_catalog_cache_usage_interval", PGC_SUSET, STATS_COLLECTOR, + gettext_noop("Sets the interval between syscache usage collection, in milliseconds. Zero disables syscache usagetracking."), + NULL + }, + &pgstat_track_syscache_usage_interval, + 0, 0, INT_MAX / 2, + NULL, NULL, NULL + }, + { {"gin_pending_list_limit", PGC_USERSET, CLIENT_CONN_STATEMENT, gettext_noop("Sets the maximum size of the pending list for GIN index."), diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index aeb5968e75..797f52fa2a 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -556,6 +556,7 @@ #track_io_timing = off #track_functions = none # none, pl, all #track_activity_query_size = 1024 # (change requires restart) +#track_catlog_cache_usage_interval = 0 # zero disables tracking #stats_temp_directory = 'pg_stat_tmp' diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 24f99f7fc4..fc35b6be47 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -9689,6 +9689,15 @@ proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}', proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}', prosrc => 'pg_get_replication_slots' }, +{ oid => '3425', + descr => 'syscache statistics', + proname => 'pg_get_syscache_stats', prorows => '100', proisstrict => 'f', + proretset => 't', provolatile => 'v', prorettype => 'record', + proargtypes => 'int4', + proallargtypes => '{int4,oid,oid,int8,int8,int8,int8,int8,_int4,timestamptz}', + proargmodes => '{i,o,o,o,o,o,o,o,o,o}', + proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,last_update}', + prosrc => 'pgstat_get_syscache_stats' }, { oid => '3786', descr => 'set up a logical replication slot', proname => 'pg_create_logical_replication_slot', provolatile => 'v', proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool', diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index 33b800e80f..767c94a63c 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -83,6 +83,7 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending; extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending; extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending; extern PGDLLIMPORT volatile sig_atomic_t CatcacheClockTimeoutPending; +extern PGDLLIMPORT volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending; extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending; extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost; diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 88a75fb798..b6bfd7d644 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -1144,6 +1144,7 @@ extern bool pgstat_track_activities; extern bool pgstat_track_counts; extern int pgstat_track_functions; extern PGDLLIMPORT int pgstat_track_activity_query_size; +extern int pgstat_track_syscache_usage_interval; extern char *pgstat_stat_directory; extern char *pgstat_stat_tmpname; extern char *pgstat_stat_filename; @@ -1228,7 +1229,8 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id); extern void pgstat_initstats(Relation rel); extern char *pgstat_clip_activity(const char *raw_activity); - +extern void pgstat_get_syscachestat_filename(bool permanent, + bool tempname, int backendid, char *filename, int len); /* ---------- * pgstat_report_wait_start() - * @@ -1363,5 +1365,5 @@ extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid); extern int pgstat_fetch_stat_numbackends(void); extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void); extern PgStat_GlobalStats *pgstat_fetch_global(void); - +extern long pgstat_write_syscache_stats(bool force); #endif /* PGSTAT_H */ diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 0a714bf514..95cd885c16 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -69,10 +69,8 @@ typedef struct catcache int cc_nfreeent; /* # of entries currently not referenced */ /* - * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS - * doesn't break ABI for other modules + * Statistics entries */ -#ifdef CATCACHE_STATS long cc_searches; /* total # searches against this cache */ long cc_hits; /* # of matches against existing entry */ long cc_neg_hits; /* # of matches against negative entry */ @@ -85,7 +83,6 @@ typedef struct catcache long cc_invals; /* # of entries invalidated from cache */ long cc_lsearches; /* total # list-searches */ long cc_lhits; /* # of matches against existing lists */ -#endif } CatCache; @@ -276,4 +273,8 @@ extern void PrepareToInvalidateCacheTuple(Relation relation, extern void PrintCatCacheLeakWarning(HeapTuple tuple); extern void PrintCatCacheListLeakWarning(CatCList *list); +/* defined in syscache.h */ +typedef struct syscachestats SysCacheStats; +extern void CatCacheGetStats(CatCache *cache, SysCacheStats *syscachestats); + #endif /* CATCACHE_H */ diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h index 95ee48954e..71b399c902 100644 --- a/src/include/utils/syscache.h +++ b/src/include/utils/syscache.h @@ -112,6 +112,24 @@ enum SysCacheIdentifier #define SysCacheSize (USERMAPPINGUSERSERVER + 1) }; +#define SYSCACHE_STATS_NAGECLASSES 6 +/* Struct for catcache tracking information */ +typedef struct syscachestats +{ + Oid reloid; /* target relation */ + Oid indoid; /* index */ + size_t size; /* size of the catcache */ + int ntuples; /* number of tuples resides in the catcache */ + int nsearches; /* number of searches */ + int nhits; /* number of cache hits */ + int nneg_hits; /* number of negative cache hits */ + /* age classes in seconds */ + int ageclasses[SYSCACHE_STATS_NAGECLASSES]; + /* number of tuples fall into the corresponding age class */ + int nclass_entries[SYSCACHE_STATS_NAGECLASSES]; +} SysCacheStats; + + extern void InitCatalogCache(void); extern void InitCatalogCachePhase2(void); @@ -164,6 +182,7 @@ extern void SysCacheInvalidate(int cacheId, uint32 hashValue); extern bool RelationInvalidatesSnapshotsOnly(Oid relid); extern bool RelationHasSysCache(Oid relid); extern bool RelationSupportsSysCache(Oid relid); +extern SysCacheStats *SysCacheGetStats(int cacheId); /* * The use of the macros below rather than direct calls to the corresponding diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h index b2d97b4f7b..0677978923 100644 --- a/src/include/utils/timeout.h +++ b/src/include/utils/timeout.h @@ -32,6 +32,7 @@ typedef enum TimeoutId STANDBY_LOCK_TIMEOUT, IDLE_IN_TRANSACTION_SESSION_TIMEOUT, CATCACHE_CLOCK_TIMEOUT, + IDLE_SYSCACHE_STATS_UPDATE_TIMEOUT, /* First user-definable timeout reason */ USER_TIMEOUT, /* Maximum number of timeout reasons */ diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out index 2c8e21baa7..7bd77e9972 100644 --- a/src/test/regress/expected/rules.out +++ b/src/test/regress/expected/rules.out @@ -1921,6 +1921,28 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid, pg_stat_all_tables.autoanalyze_count FROM pg_stat_all_tables WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname~ '^pg_toast'::text)); +pg_stat_syscache| SELECT s.pid, + (s.relid)::regclass AS relname, + (s.indid)::regclass AS cache_name, + s.size, + s.ntup AS ntuples, + s.searches, + s.hits, + s.neg_hits, + s.ageclass, + s.last_update + FROM (pg_stat_activity a + JOIN LATERAL ( SELECT a.pid, + pg_get_syscache_stats.relid, + pg_get_syscache_stats.indid, + pg_get_syscache_stats.size, + pg_get_syscache_stats.ntup, + pg_get_syscache_stats.searches, + pg_get_syscache_stats.hits, + pg_get_syscache_stats.neg_hits, + pg_get_syscache_stats.ageclass, + pg_get_syscache_stats.last_update + FROM pg_get_syscache_stats(a.pid) pg_get_syscache_stats(relid, indid, size, ntup, searches, hits, neg_hits, ageclass,last_update)) s ON ((a.pid = s.pid))); pg_stat_user_functions| SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, @@ -2352,7 +2374,7 @@ pg_settings|pg_settings_n|CREATE RULE pg_settings_n AS ON UPDATE TO pg_catalog.pg_settings DO INSTEAD NOTHING; pg_settings|pg_settings_u|CREATE RULE pg_settings_u AS ON UPDATE TO pg_catalog.pg_settings - WHERE (new.name = old.name) DO SELECT set_config(old.name, new.setting, false) AS set_config; + WHERE (new.name = old.name) DO SELECT set_config(old.name, new.setting, false, false) AS set_config; rtest_emp|rtest_emp_del|CREATE RULE rtest_emp_del AS ON DELETE TO public.rtest_emp DO INSERT INTO rtest_emplog (ename, who, action, newsal, oldsal) VALUES (old.ename, CURRENT_USER, 'fired'::bpchar, '$0.00'::money, old.salary); -- 2.16.3 From 05a75bff3a48007f393bf5f99e354ec0619d00c9 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Wed, 13 Feb 2019 14:34:46 +0900 Subject: [PATCH 5/5] Global LRU based cache pruning. This adds a feature that removes leaset recently used cache entry among all catcaches when the total memory amount goes avove catalog_cache_max_size. --- doc/src/sgml/config.sgml | 20 +++++++ src/backend/utils/cache/catcache.c | 106 +++++++++++++++++++++++-------------- src/backend/utils/misc/guc.c | 21 +++----- src/include/utils/catcache.h | 5 +- 4 files changed, 94 insertions(+), 58 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index bc2bef0878..daa6085693 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1701,6 +1701,26 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-catalog-cache-max-size" xreflabel="catalog_cache_max_size"> + <term><varname>catalog_cache_max_size</varname> (<type>integer</type>) + <indexterm> + <primary><varname>catalog_cache_max_size</varname> configuration + parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the maximum total amount of memory allowed for all system + catalog caches in kilobytes. The value defaults to 0, indicating that + pruning by this parameter is disabled at all. After the amount of + memory used by all catalog caches exceeds this size, a new cache entry + creation will remove one or more not-recently-used cache entries. This + means frequent creation of new cache entry may lead to a slight + slowdown of queries. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth"> <term><varname>max_stack_depth</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index fa0d19a9c3..3336ff6dc3 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -86,11 +86,9 @@ int catalog_cache_memory_target = 0; /* * GUC for limit by the number of entries. Entries are removed when the number - * of them goes above catalog_cache_entry_limit and leaving newer entries by - * the ratio specified by catalog_cache_prune_ratio. + * of them goes above catalog_cache_max_size in kilobytes */ -int catalog_cache_entry_limit = 0; -double catalog_cache_prune_ratio = 0.8; +int catalog_cache_max_size = 0; /* * Flag to keep track of whether catcache clock timer is active. @@ -108,6 +106,8 @@ static CatCacheHeader *CacheHdr = NULL; /* Clock used to record the last accessed time of a catcache record. */ TimestampTz catcacheclock = 0; +dlist_head cc_lru_list = {0}; +Size global_size = 0; /* age classes for pruning */ static double ageclass[SYSCACHE_STATS_NAGECLASSES] @@ -531,6 +531,8 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) cache->cc_keyno, ct->keys); cache->cc_memusage -= ct->size; + global_size -= ct->size; + pfree(ct); --cache->cc_ntup; @@ -887,8 +889,12 @@ InitCatCache(int id, /* cc_head_alloc_size + consumed size for cc_bucket */ cp->cc_memusage = MemoryContextGetConsumption(CacheMemoryContext) - base_size; + global_size += cp->cc_memusage; + + /* initialize global LRU if not yet */ + if (cc_lru_list.head.next == NULL) + dlist_init(&cc_lru_list); - dlist_init(&cp->cc_lru_list); /* * new cache is initialized as far as we can go for now. print some * debugging information, if appropriate. @@ -981,39 +987,27 @@ assign_catalog_cache_prune_min_age(int newval, void *extra) static bool CatCacheCleanupOldEntries(CatCache *cp) { + static TimestampTz prev_warn_emit = 0; int nremoved = 0; int nelems_before = cp->cc_ntup; - int ndelelems = 0; bool prune_by_age = false; - bool prune_by_number = false; + bool prune_by_size = false; dlist_mutable_iter iter; - /* prune only if the size of the hash is above the target */ if (catalog_cache_prune_min_age >= 0 && cp->cc_memusage > (Size) catalog_cache_memory_target * 1024L) prune_by_age = true; - if (catalog_cache_entry_limit > 0 && - nelems_before >= catalog_cache_entry_limit) - { - ndelelems = nelems_before - - (int) (catalog_cache_entry_limit * catalog_cache_prune_ratio); - - /* an arbitrary lower limit.. */ - if (ndelelems < 256) - ndelelems = 256; - if (ndelelems > nelems_before) - ndelelems = nelems_before; - - prune_by_number = true; - } + if (catalog_cache_max_size > 0 && + global_size >= (Size) catalog_cache_max_size * 1024) + prune_by_size = true; /* Return immediately if no pruning is wanted */ - if (!prune_by_age && !prune_by_number) + if (!prune_by_age && !prune_by_size) return false; /* Scan over LRU to find entries to remove */ - dlist_foreach_modify(iter, &cp->cc_lru_list) + dlist_foreach_modify(iter, &cc_lru_list) { CatCTup *ct = dlist_container(CatCTup, lru_node, iter.cur); bool remove_this = false; @@ -1023,8 +1017,8 @@ CatCacheCleanupOldEntries(CatCache *cp) (ct->c_list && ct->c_list->refcount != 0)) continue; - /* check against age */ - if (prune_by_age) + /* check against age. prune within this cache */ + if (prune_by_age && ct->owner == cp) { long entry_age; int us; @@ -1056,31 +1050,58 @@ CatCacheCleanupOldEntries(CatCache *cp) remove_this = true; } - /* check against entry number */ - if (prune_by_number) + /* check against global size. removes from all cache */ + if (prune_by_size && !remove_this) { - if (nremoved < ndelelems) + if (global_size >= (Size) catalog_cache_max_size * 1024) remove_this = true; else - prune_by_number = false; /* we're satisfied */ + prune_by_size = false; /* we're satisfied */ } + if (!remove_this) + continue; + /* exit immediately if all finished */ - if (!prune_by_age && !prune_by_number) + if (!prune_by_age && !prune_by_size) break; /* do the work */ - if (remove_this) - { - CatCacheRemoveCTup(cp, ct); - nremoved++; - } + CatCacheRemoveCTup(ct->owner, ct); + nremoved++; } if (nremoved > 0) elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d", cp->id, cp->cc_relname, nremoved, nelems_before); + /* + * Warn of too small setting of catalog_cache_max_size. Take 5 seconds + * between messages, using statement start timestamp to avoid frequent + * gettimeofday(). + */ + if (prune_by_size && + (prev_warn_emit == 0 || + GetCurrentStatementStartTimestamp() - prev_warn_emit > 5000000)) + { + ErrorContextCallback *oldcb; + + /* cancel error context callbacks */ + oldcb = error_context_stack; + error_context_stack = NULL; + + ereport(LOG, ( + errmsg ("cannot reduce cache size to %d kilobytes, reduced to %d kilobytes", + catalog_cache_max_size, (int)(global_size / 1024)), + errdetail ("Consider increasing the configuration parameter \"catalog_cache_max_size\"."), + errhidecontext(true), + errhidestmt(true))); + + error_context_stack = oldcb; + + prev_warn_emit = GetCurrentStatementStartTimestamp(); + } + return nremoved > 0; } @@ -1103,6 +1124,7 @@ RehashCatCache(CatCache *cp) newbucket = (dlist_head *) MemoryContextAllocZero(CacheMemoryContext, newnbuckets * sizeof(dlist_head)); /* recalculate memory usage from the first */ + global_size -= cp->cc_memusage; cp->cc_memusage = cp->cc_head_alloc_size + MemoryContextGetConsumption(CacheMemoryContext) - base_size; @@ -1122,6 +1144,8 @@ RehashCatCache(CatCache *cp) } } + global_size += cp->cc_memusage; + /* Switch to the new array. */ pfree(cp->cc_bucket); cp->cc_nbuckets = newnbuckets; @@ -1513,7 +1537,7 @@ SearchCatCacheInternal(CatCache *cache, if (catcacheclock - ct->lastaccess > MIN_LRU_UPDATE_INTERVAL) { ct->lastaccess = catcacheclock; - dlist_move_tail(&cache->cc_lru_list, &ct->lru_node); + dlist_move_tail(&cc_lru_list, &ct->lru_node); } /* @@ -2138,7 +2162,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->hash_value = hashValue; ct->naccess = 0; ct->lastaccess = catcacheclock; - dlist_push_tail(&cache->cc_lru_list, &ct->lru_node); + ct->owner = cache; + dlist_push_tail(&cc_lru_list, &ct->lru_node); dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); @@ -2147,6 +2172,7 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->size = MemoryContextGetConsumption(CacheMemoryContext) - base_size; cache->cc_memusage += ct->size; + global_size += ct->size; /* increase refcount so that this survives pruning */ ct->refcount++; @@ -2161,8 +2187,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); /* we may still want to prune by entry number, check it */ - else if (catalog_cache_entry_limit > 0 && - cache->cc_ntup > catalog_cache_entry_limit) + else if (catalog_cache_max_size > 0 && + global_size > catalog_cache_max_size * 1024) CatCacheCleanupOldEntries(cache); ct->refcount--; diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 7f1670fa5b..7a52c70649 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -2229,12 +2229,13 @@ static struct config_int ConfigureNamesInt[] = }, { - {"catalog_cache_entry_limit", PGC_USERSET, RESOURCES_MEM, - gettext_noop("Sets the maximum entries of catcache."), - NULL + {"catalog_cache_max_size", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the maximum size of catcache in kilobytes."), + NULL, + GUC_UNIT_KB }, - &catalog_cache_entry_limit, - 0, 0, INT_MAX, + &catalog_cache_max_size, + 0, 0, MAX_KILOBYTES, NULL, NULL, NULL }, @@ -3411,16 +3412,6 @@ static struct config_real ConfigureNamesReal[] = NULL, NULL, NULL }, - { - {"catalog_cache_prune_ratio", PGC_USERSET, RESOURCES_MEM, - gettext_noop("Reduce ratio of pruning caused by catalog_cache_entry_limit."), - NULL - }, - &catalog_cache_prune_ratio, - 0.8, 0.0, 1.0, - NULL, NULL, NULL - }, - /* End-of-list marker */ { {NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 95cd885c16..1e2d6e7bd7 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -62,7 +62,6 @@ typedef struct catcache slist_node cc_next; /* list link */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ - dlist_head cc_lru_list; int cc_head_alloc_size;/* consumed memory to allocate this struct */ int cc_memusage; /* memory usage of this catcache (excluding * header part) */ @@ -125,6 +124,7 @@ typedef struct catctup int naccess; /* # of access to this entry, up to 2 */ TimestampTz lastaccess; /* approx. timestamp of the last usage */ dlist_node lru_node; /* LRU node */ + CatCache *owner; /* owner catcache */ int size; /* palloc'ed size off this tuple */ /* * The tuple may also be a member of at most one CatCList. (If a single @@ -198,8 +198,7 @@ extern PGDLLIMPORT MemoryContext CacheMemoryContext; /* for guc.c, not PGDLLPMPORT'ed */ extern int catalog_cache_prune_min_age; extern int catalog_cache_memory_target; -extern int catalog_cache_entry_limit; -extern double catalog_cache_prune_ratio; +extern int catalog_cache_max_size; /* to use as access timestamp of catcache entries */ extern TimestampTz catcacheclock; -- 2.16.3
At Wed, 13 Feb 2019 02:15:42 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1FB97CF1@G01JPEXMBYT05> > From: Tomas Vondra [mailto:tomas.vondra@2ndquadrant.com] > > > I didin't consider planning that happen within a function. If > > > 5min is the default for catalog_cache_prune_min_age, 10% of it > > > (30s) seems enough and gettieofday() with such intervals wouldn't > > > affect forground jobs. I'd choose catalog_c_p_m_age/10 rather > > > than fixed value 30s and 1s as the minimal. > > > > > > > Actually, I see CatCacheCleanupOldEntries contains this comment: > > > > /* > > * Calculate the duration from the time of the last access to the > > * "current" time. Since catcacheclock is not advanced within a > > * transaction, the entries that are accessed within the current > > * transaction won't be pruned. > > */ > > > > which I think is pretty much what I've been saying ... But the question > > is whether we need to do something about it. > > Hmm, I'm surprised at v14 patch about this. I remember that previous patches renewed the cache clock on every statement,and it is correct. If the cache clock is only updated at the beginning of a transaction, the following TODO itemwould not be solved: > > https://wiki.postgresql.org/wiki/Todo Sorry, its just a stale comment. In v15, it is alreday.... ouch! still left alone. (Actually CatCacheGetStats doesn't perform pruning.) I'll remove it in the next version. It is called in start_xact_command, which is called per statement, provided with statement timestamp. > /* > * Calculate the duration from the time from the last access to > * the "current" time. catcacheclock is updated per-statement > * basis and additionaly udpated periodically during a long > * running query. > */ > TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); > " Reduce memory use when analyzing many tables in a single command by making catcache and syscache flushable or bounded." In v14 and v15, addition to it a timer that fires with the interval of catalog_cache_prune_min_age/10 - 30s when the parameter is 5min - updates the catcache clock using gettimeofday(), which in turn is the source of LRU timestamp. > Also, Tom mentioned pg_dump in this thread (protect syscache...). pg_dump runs in a single transaction, touching all systemcatalogs. That may result in OOM, and this patch can rescue it. So, all the problem will be addressed in v14. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Tue, 12 Feb 2019 18:33:46 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in <d3b291ff-d993-78d1-8d28-61bcf72793d6@2ndquadrant.com> > > "catalog_cache_prune_min_age", "catalog_cache_memory_target", (if > > exists) "catalog_cache_entry_limit" and > > "catalog_cache_prune_ratio" make sense? > > > > I think "catalog_cache" sounds about right, although my point was simply > that there's a discrepancy between sgml docs and code. system_catalog_cache is too long for parameter names. So I named parameters "catalog_cache_*" and "system catalog cache" or "catalog cache" in documentation. > >> 2) "cache_entry_limit" is not mentioned in sgml docs at all, and it's > >> defined three times in guc.c for some reason. > > > > It is just PoC, added to show how it looks. (The multiple > > instances must bex a result of a convulsion of my fingers..) I > > think this is not useful unless it can be specfied per-relation > > or per-cache basis. I'll remove the GUC and add reloptions for > > the purpose. (But it won't work for pg_class and pg_attribute > > for now). > > > > OK, although I'd just keep it as simple as possible. TBH I can't really > imagine users tuning limits for individual caches in any meaningful way. I also fee like so, but anyway (:p), in v15, it is evoleved into a feature that limits cache size with the total size based on global LRU list. > > I didin't consider planning that happen within a function. If > > 5min is the default for catalog_cache_prune_min_age, 10% of it > > (30s) seems enough and gettieofday() with such intervals wouldn't > > affect forground jobs. I'd choose catalog_c_p_m_age/10 rather > > than fixed value 30s and 1s as the minimal. > > > > Actually, I see CatCacheCleanupOldEntries contains this comment: > > /* > * Calculate the duration from the time of the last access to the > * "current" time. Since catcacheclock is not advanced within a > * transaction, the entries that are accessed within the current > * transaction won't be pruned. > */ > > which I think is pretty much what I've been saying ... But the question > is whether we need to do something about it. As I wrote in the messages just replied to Tsunakawa-san, it just a bogus comment. The corrent one is the following. I'll replace it in the next version. > * Calculate the duration from the time from the last access to > * the "current" time. catcacheclock is updated per-statement > * basis and additionaly udpated periodically during a long > * running query. > > I obeserved significant degradation by setting up timer at every > > statement start. The patch is doing the followings to get rid of > > the degradation. > > > > (1) Every statement updates the catcache timestamp as currently > > does. (SetCatCacheClock) > > > > (2) The timestamp is also updated periodically using timer > > separately from (1). The timer starts if not yet at the time > > of (1). (SetCatCacheClock, UpdateCatCacheClock) > > > > (3) Statement end and transaction end don't stop the timer, to > > avoid overhead of setting up a timer. ( > > > > (4) But it stops by error. I choosed not to change the thing in > > PostgresMain that it kills all timers on error. > > > > (5) Also changing the GUC catalog_cache_prune_min_age kills the > > timer, in order to reflect the change quickly especially when > > it is shortened. > > > > Interesting. What was the frequency of the timer / how often was it > executed? Can you share the code somehow? Please find it in v14 [1] or v15 [2], which contain the same code for teh purpose. [1] https://www.postgresql.org/message-id/20190212.203628.118792892.horiguchi.kyotaro@lab.ntt.co.jp [2] https://www.postgresql.org/message-id/20190213.153114.239737674.horiguchi.kyotaro%40lab.ntt.co.jp regarsd. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Feb 12, 2019 at 02:53:40AM +0100, Tomas Vondra wrote: > Right. But the logic behind time-based approach is that evicting such > entries should not cause any issues exactly because they are accessed > infrequently. It might incur some latency when we need them for the > first time after the eviction, but IMHO that's acceptable (although I > see Andres did not like that). > > FWIW we might even evict entries after some time passes since inserting > them into the cache - that's what memcached et al do, IIRC. The logic is > that frequently accessed entries will get immediately loaded back (thus > keeping cache hit ratio high). But there are reasons why the other dbs > do that - like not having any cache invalidation (unlike us). Agreed. If this fixes 90% of the issues people will have, and it applies to the 99.9% of users who will never tune this, it is a clear win. If we want to add something that requires tuning later, we can consider it once the non-tuning solution is done. > That being said, having a "minimal size" threshold before starting with > the time-based eviction may be a good idea. Agreed. I see the minimal size as a way to keep the systems tables in cache, which we know we will need for the next query. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
From: Bruce Momjian [mailto:bruce@momjian.us] > > That being said, having a "minimal size" threshold before starting with > > the time-based eviction may be a good idea. > > Agreed. I see the minimal size as a way to keep the systems tables in > cache, which we know we will need for the next query. Isn't it the maximum size, not minimal size? Maximum size allows to keep desired amount of system tables in memory as wellas to control memory consumption to avoid out-of-memory errors (OS crash!). I'm wondering why people want to take adifferent approach to catcatch, which is unlike other PostgreSQL memory e.g. shared_buffers, temp_buffers, SLRU buffers,work_mem, and other DBMSs. Regards Takayuki Tsunakawa
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] > It is too complex as I was afraid. The indirect calls causes siginicant > degradation. (Anyway the previous code was bogus in that it passes > CACHELINEALIGN'ed pointer to get_chunk_size..) > > Instead, I added an accounting(?) interface function. > > | MemoryContextGettConsumption(MemoryContext cxt); > > The API returns the current consumption in this memory context. This allows > "real" memory accounting almost without overhead. That looks like a great idea! Actually, I was thinking of using MemoryContextStats() or its new lightweight variant to getthe used amount, but I was afraid it would be too costly to call in catcache code. You are smarter, and I was just stupid. > (2) Another new patch v15-0005 on top of previous design of > limit-by-number-of-a-cache feature converts it to > limit-by-size-on-all-caches feature, which I think is > Tsunakawa-san wanted. Thank you very, very much! I look forward to reviewing v15. I'll be away from the office tomorrow, so I'd like to reviewit on this weekend or the beginning of next week. I've confirmed and am sure that 0001 can be committed. > As far as I can see no significant degradation is found in usual (as long > as pruning doesn't happen) code paths. > > About the new global-size based evicition(2), cache entry creation becomes > slow after the total size reached to the limit since every one new entry > evicts one or more old (= > not-recently-used) entries. Because of not needing knbos for each cache, > it become far realistic. So I added documentation of > "catalog_cache_max_size" in 0005. Could you show us the comparison of before and after the pruning starts, if you already have it? If you lost the data, I'mOK to see the data after the code review. Regards Takayuki Tsunakawa
>From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] > > >(2) Another new patch v15-0005 on top of previous design of > limit-by-number-of-a-cache feature converts it to > limit-by-size-on-all-caches feature, which I think is > Tsunakawa-san wanted. Yeah, size looks better to me. >As far as I can see no significant degradation is found in usual (as long as pruning >doesn't happen) code paths. > >About the new global-size based evicition(2), cache entry creation becomes slow after >the total size reached to the limit since every one new entry evicts one or more old (= >not-recently-used) entries. Because of not needing knbos for each cache, it become >far realistic. So I added documentation of "catalog_cache_max_size" in 0005. Now I'm also trying to benchmark, which will be posted in another email. Here are things I noticed: [1] compiler warning catcache.c:109:1: warning: missing braces around initializer [-Wmissing-braces] dlist_head cc_lru_list = {0}; ^ catcache.c:109:1: warning: (near initialization for ‘cc_lru_list.head’) [-Wmissing-braces] [2] catalog_cache_max_size is not appered in postgresql.conf.sample [3] global lru list and global size can be included in CatCacheHeader, which seems to me good place because this structure contains global cache information regardless of kind of CatCache [4] when applying patch with git am, there are several warnings about trailing white space at v15-0003 Regards, Takeshi Ideriha
Hi, On 2019-02-13 15:31:14 +0900, Kyotaro HORIGUCHI wrote: > Instead, I added an accounting(?) interface function. > > | MemoryContextGettConsumption(MemoryContext cxt); > > The API returns the current consumption in this memory > context. This allows "real" memory accounting almost without > overhead. That's definitely *NOT* almost without overhead. This adds additional instructions to one postgres' hottest set of codepaths. I think you're not working incrementally enough here. I strongly suggest solving the negative cache entry problem, and then incrementally go from there after that's committed. The likelihood of this patch ever getting merged otherwise seems extremely small. Greetings, Andres Freund
On Thu, Feb 14, 2019 at 12:40:10AM -0800, Andres Freund wrote: > Hi, > > On 2019-02-13 15:31:14 +0900, Kyotaro HORIGUCHI wrote: > > Instead, I added an accounting(?) interface function. > > > > | MemoryContextGettConsumption(MemoryContext cxt); > > > > The API returns the current consumption in this memory > > context. This allows "real" memory accounting almost without > > overhead. > > That's definitely *NOT* almost without overhead. This adds additional > instructions to one postgres' hottest set of codepaths. > > I think you're not working incrementally enough here. I strongly suggest > solving the negative cache entry problem, and then incrementally go from > there after that's committed. The likelihood of this patch ever getting > merged otherwise seems extremely small. Agreed --- the patch is going in the wrong direction. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Thu, Feb 14, 2019 at 01:31:49AM +0000, Tsunakawa, Takayuki wrote: > From: Bruce Momjian [mailto:bruce@momjian.us] > > > That being said, having a "minimal size" threshold before starting > > > with the time-based eviction may be a good idea. > > > > Agreed. I see the minimal size as a way to keep the systems tables > > in cache, which we know we will need for the next query. > > Isn't it the maximum size, not minimal size? Maximum size allows > to keep desired amount of system tables in memory as well as to > control memory consumption to avoid out-of-memory errors (OS crash!). > I'm wondering why people want to take a different approach to > catcatch, which is unlike other PostgreSQL memory e.g. shared_buffers, > temp_buffers, SLRU buffers, work_mem, and other DBMSs. Well, that is an _excellent_ question, and one I had to think about. I think, in general, smaller is better, as long as making something smaller doesn't remove data that is frequently accessed. Having a timer to expire only old entries seems like it accomplished this goal. Having a minimum size and not taking it to zero size makes sense if we know we will need certain entries like pg_class in the next query. However, if the session is idle for hours, we should just probably remove everything, so maybe the minimum doesn't make sense --- just remove everything. As for why we don't do this with everything --- we can't do it with shared_buffers since we can't change its size while the server is running. For work_mem, we assume all the work_mem data is for the current query, and therefore frequently accessed. Also, work_mem is not memory we can just free if it is not used since it contains intermediate results required by the current query. I think temp_buffers, since it can be resized in the session, actually could use a similar minimizing feature, though that would mean it behaves slightly differently from shared_buffers, and it might not be worth it. Also, I assume the value of temp_buffers was mostly for use by the current query --- yes, it can be used for cross-query caching, but I am not sure if that is its primary purpose. I thought its goal was to prevent shared_buffers from being populated with temporary per-session buffers. I don't think other DBMSs are a good model since they have a reputation for requiring a lot of tuning --- tuning that we have often automated. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
>From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com] >>About the new global-size based evicition(2), cache entry creation >>becomes slow after the total size reached to the limit since every one >>new entry evicts one or more old (= >>not-recently-used) entries. Because of not needing knbos for each >>cache, it become far realistic. So I added documentation of >"catalog_cache_max_size" in 0005. > >Now I'm also trying to benchmark, which will be posted in another email. According to recent comments by Andres and Bruce maybe we should address negative cache bloat step by step for example by reviewing Tom's patch. But at the same time, I did some benchmark with only hard limit option enabled and time-related option disabled, because the figures of this case are not provided in this thread. So let me share it. I did two experiments. One is to show negative cache bloat is suppressed. This thread originated from the issue that negative cache of pg_statistics is bloating as creating and dropping temp table is repeatedly executed. https://www.postgresql.org/message-id/20161219.201505.11562604.horiguchi.kyotaro%40lab.ntt.co.jp Using the script attached the first email in this thread, I repeated create and drop temp table at 10000 times. (experiment is repeated 5 times. catalog_cache_max_size = 500kB. compared master branch and patch with hard memory limit) Here are TPS and CacheMemoryContext 'used' memory (total - freespace) calculated by MemoryContextPrintStats() at 100, 1000, 10000 times of create-and-drop transaction. The result shows cache bloating is suppressed after exceeding the limit (at 10000) but tps declines regardless of the limit. number of tx (create and drop) | 100 |1000 |10000 ----------------------------------------------------------- used CacheMemoryContext (master) |610296|2029256 |15909024 used CacheMemoryContext (patch) |755176|880552 |880592 ----------------------------------------------------------- TPS (master) |414 |407 |399 TPS (patch) |242 |225 |220 Another experiment is using Tomas's script posted while ago, The scenario is do select 1 from multiple tables randomly (uniform distribution). (experiment is repeated 5 times. catalog_cache_max_size = 10MB. compared master branch and patch with only hard memory limit enabled) Before doing the benchmark, I checked pruning is happened only at 10000 tables using debug option. The result shows degradation regardless of before or after pruning. I personally still need hard size limitation but I'm surprised that the difference is so significant. number of tables | 100 |1000 |10000 ----------------------------------------------------------- TPS (master) |10966 |10654 |9099 TPS (patch) |4491 |2099 |378 Regards, Takeshi Ideriha
On 2/13/19 1:23 AM, Tsunakawa, Takayuki wrote: > From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] >> I'm at a loss how call syscache for users. I think it is "catalog >> cache". The most basic component is called catcache, which is >> covered by the syscache layer, both of then are not revealed to >> users, and it is shown to user as "catalog cache". >> >> "catalog_cache_prune_min_age", "catalog_cache_memory_target", (if >> exists) "catalog_cache_entry_limit" and >> "catalog_cache_prune_ratio" make sense? > > PostgreSQL documentation uses "system catalog" in its table of contents, so syscat_cache_xxx would be a bit more familiar? I'm for either catalog_ and syscat_, but what name shall we use for the relation cache? catcache and relcachehave different element sizes and possibly different usage patterns, so they may as well have different parametersjust like MySQL does. If we follow that idea, then the name would be relation_cache_xxx. However, from the user'sviewpoint, the relation cache is also created from the system catalog like pg_class and pg_attribute... > I think "catalog_cache_..." is fine. If we end up with a similar patchfor relcache, we can probably call it "relation_cache_". I'd be OK even with "system_catalog_cache_..." - I don't think it's overly long (better to have a longer but descriptive name), and "syscat" just seems like unnecessary abbreviation. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2/14/19 3:46 PM, Bruce Momjian wrote: > On Thu, Feb 14, 2019 at 12:40:10AM -0800, Andres Freund wrote: >> Hi, >> >> On 2019-02-13 15:31:14 +0900, Kyotaro HORIGUCHI wrote: >>> Instead, I added an accounting(?) interface function. >>> >>> | MemoryContextGettConsumption(MemoryContext cxt); >>> >>> The API returns the current consumption in this memory >>> context. This allows "real" memory accounting almost without >>> overhead. >> >> That's definitely *NOT* almost without overhead. This adds additional >> instructions to one postgres' hottest set of codepaths. >> >> I think you're not working incrementally enough here. I strongly suggest >> solving the negative cache entry problem, and then incrementally go from >> there after that's committed. The likelihood of this patch ever getting >> merged otherwise seems extremely small. > > Agreed --- the patch is going in the wrong direction. > I recall endless discussions about memory accounting in the "memory-bounded hash-aggregate" patch a couple of years ago, and the overhead was one of the main issues there. So yeah, trying to solve that problem here is likely to kill this patch (or at least significantly delay it). ISTM there's a couple of ways to deal with that: 1) Ignore the memory amounts entirely, and do just time-base eviction. 2) If we want some size thresholds (e.g. to disable eviction for backends with small caches etc.) use the number of entries instead. I don't think that's particularly worse that specifying size in MB. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2/14/19 4:49 PM, 'Bruce Momjian' wrote: > On Thu, Feb 14, 2019 at 01:31:49AM +0000, Tsunakawa, Takayuki wrote: >> From: Bruce Momjian [mailto:bruce@momjian.us] >>>> That being said, having a "minimal size" threshold before starting >>>> with the time-based eviction may be a good idea. >>> >>> Agreed. I see the minimal size as a way to keep the systems tables >>> in cache, which we know we will need for the next query. >> >> Isn't it the maximum size, not minimal size? Maximum size allows >> to keep desired amount of system tables in memory as well as to >> control memory consumption to avoid out-of-memory errors (OS crash!). >> I'm wondering why people want to take a different approach to >> catcatch, which is unlike other PostgreSQL memory e.g. shared_buffers, >> temp_buffers, SLRU buffers, work_mem, and other DBMSs. > > Well, that is an _excellent_ question, and one I had to think about. > I think we're talking about two different concepts here: 1) minimal size - We don't do any extra eviction at all, until we reach this cache size. So we don't get any extra overhead from it. If a system does not have issues. 2) maximal size - We ensure the cache size is below this threshold. If there's more data, we evict enough entries to get below it. My proposal is essentially to do just (1), so the cache can grow very large if needed but then it shrinks again after a while. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: Tomas Vondra [mailto:tomas.vondra@2ndquadrant.com] > I think "catalog_cache_..." is fine. If we end up with a similar > patchfor relcache, we can probably call it "relation_cache_". Agreed, those are not too long or too short, and they are sufficiently descriptive. Regards Takayuki Tsunakawa
On 2019-Feb-15, Tomas Vondra wrote: > ISTM there's a couple of ways to deal with that: > > 1) Ignore the memory amounts entirely, and do just time-base eviction. > > 2) If we want some size thresholds (e.g. to disable eviction for > backends with small caches etc.) use the number of entries instead. I > don't think that's particularly worse that specifying size in MB. Why is there a *need* for size-based eviction? Seems that time-based should be sufficient. Is the proposed approach to avoid eviction at all until the size threshold has been reached? I'm not sure I see the point of that. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Horiguchi-san, I've looked through your patches. This is the first part of my review results. Let me post the rest after another worktoday. BTW, how about merging 0003 and 0005, and separating and deferring 0004 in another thread? That may help to relieve othercommunity members by making this patch set not so large and complex. [Bottleneck investigation] Ideriha-san and I are trying to find the bottleneck. My first try shows there's little overhead. Here's what I did: <postgresql.conf> shared_buffers = 1GB catalog_cache_prune_min_age = -1 catalog_cache_max_size = 10MB <benchmark> $ pgbench -i -s 10 $ pg_ctl stop and then start $ cache all data in shared buffers by running pg_prewarm on branches, tellers, accounts, and their indexes $ pgbench --select-only -c 1 -T 60 <result> master : 8612 tps patched: 8553 tps (-0.7%) There's little (0.7%) performance overhead with: * one additional dlist_move_tail() in every catcache access * memory usage accounting in operations other than catcache access (relevant catcache entries should be cached in the firstpgbench transaction) I'll check other patterns to find out how big overhead there is. [Source code review] Below are my findings on the patch set v15: (1) patch 0001 All right. (2) patch 0002 @@ -87,6 +87,7 @@ typedef struct MemoryContextData const char *name; /* context name (just for debugging) */ const char *ident; /* context ID if any (just for debugging) */ MemoryContextCallback *reset_cbs; /* list of reset/delete callbacks */ + uint64 consumption; /* accumulates consumed memory size */ } MemoryContextData; Size is more appropriate as a data type than uint64 because other places use Size for memory size variables. How about "usedspace" instead of "consumption"? Because that aligns better with the naming used for MemoryContextCounters'smember variables, totalspace and freespace. (3) patch 0002 + context->consumption += chunk_size; (and similar sites) The used space should include the size of the context-type-specific chunk header, so that the count is closer to the actualmemory size seen by the user. Here, let's make consensus on what the used space represents. Is it either of the following? a) The total space allocated from OS. i.e., the sum of the malloc()ed regions for a given memory context. b) The total space of all chunks, including their headers, of a given memory context. a) is better because that's the actual memory usage from the DBA's standpoint. But a) cannot be used because CacheMemoryContextis used for various things. So we have to compromise on b). Is this OK? One possible future improvement is to use a separate memory context exclusively for the catcache, which is a child of CacheMemoryContext. That way, we can adopt a). (4) patch 0002 @@ -614,6 +614,9 @@ AllocSetReset(MemoryContext context) + set->header.consumption = 0; This can be put in MemoryContextResetOnly() instead of context-type-specific reset functions. Regards Takayuki Tsunakawa
Hi Horiguchi-san, This is the rest of my review comments. (5) patch 0003 CatcacheClockTimeoutPending = 0; + + /* Update timetamp then set up the next timeout */ + false is better than 0, to follow other **Pending variables. timetamp -> timestamp (6) patch 0003 GetCatCacheClock() is not used now. Why don't we add it when the need arises? (7) patch 0003 Why don't we remove the catcache timer (Setup/UpdateCatCacheClockTimer), unless we need it by all means? That simplifiesthe code. Long-running queries can be thought as follows: * A single lengthy SQL statement, e.g. SELECT for reporting/analytics, COPY for data loading, and UPDATE/DELETE for batchprocessing, should only require small number of catalog entries during their query analysis/planning. They won't sufferfrom cache eviction during query execution. * Do not have to evict cache entries while executing a long-running stored procedure, because its constituent SQL statementsmay access the same tables. If the stored procedure accesses so many tables that you are worried about the catcachememory overuse, then catalog_cache_max_size can be used. Another natural idea would be to update the cache clockwhen SPI executes each SQL statement. (8) patch 0003 + uint64 base_size; + uint64 base_size = MemoryContextGetConsumption(CacheMemoryContext); This may also as well be Size, not uint64. (9) patch 0003 @@ -1940,7 +2208,7 @@ CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *keys) /* * Helper routine that copies the keys in the srckeys array into the dstkeys * one, guaranteeing that the datums are fully allocated in the current memory - * context. + * context. Returns allocated memory size. */ static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos, @@ -1976,7 +2244,6 @@ CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos, att->attbyval, att->attlen); } - } This change seem to be no longer necessary thanks to the memory accounting. (10) patch 0004 How about separating this in another thread, so that the rest of the patch set becomes easier to review and commit? Regarding the design, I'm inclined to avoid each backend writing the file. To simplify the code, I think we can take advantageof the fortunate situation -- the number of backends and catcaches are fixed at server startup. My rough sketchis: * Allocate an array of statistics entries in shared memory, whose element is (pid or backend id, catcache id or name, hits,misses, ...). The number of array elements is MaxBackends * number of catcaches (some dozens). * Each backend updates its own entry in the shared memory during query execution. * Stats collector periodically scans the array and write it to the stats file. (11) patch 0005 +dlist_head cc_lru_list = {0}; +Size global_size = 0; It is better to put these in CatCacheHeader. That way, backends that do not access the catcache (archiver, stats collector,etc.) do not have to waste memory for these global variables. (12) patch 0005 + else if (catalog_cache_max_size > 0 && + global_size > catalog_cache_max_size * 1024) CatCacheCleanupOldEntries(cache); On the second line, catalog_cache_max_size should be cast to Size to avoid overflow. (13) patch 0005 + gettext_noop("Sets the maximum size of catcache in kilobytes."), catcache -> catalog cache (14) patch 0005 + CatCache *owner; /* owner catcache */ CatCTup already has my_cache member. (15) patch 0005 if (nremoved > 0) elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d", cp->id, cp->cc_relname, nremoved, nelems_before); In prune-by-size case, this elog doesn't very meaningful data. How about dividing this function into two, one is for prune-by-ageand another for prune-by-size? I supppose that would make the functions easier to understand. Regards Takayuki Tsunakawa
From: 'Bruce Momjian' [mailto:bruce@momjian.us] > I think, in general, smaller is better, as long as making something > smaller doesn't remove data that is frequently accessed. Having a timer > to expire only old entries seems like it accomplished this goal. > > Having a minimum size and not taking it to zero size makes sense if we > know we will need certain entries like pg_class in the next query. > However, if the session is idle for hours, we should just probably > remove everything, so maybe the minimum doesn't make sense --- just > remove everything. That's another interesting idea. A somewhat relevant feature is Oracle's "ALTER SYSTEM FLUSH SHARED_POOL". It flushes alldictionary cache, library cache, and SQL plan entries. The purpose is different: not to release memory, but to defragmentthe shared memory. > I don't think other DBMSs are a good model since they have a reputation > for requiring a lot of tuning --- tuning that we have often automated. Yeah, I agree that PostgreSQL is easier to use in many aspects. On the other hand, although I hesitate to say this (please don't get upset...), I feel PostgreSQL is a bit too loose aboutmemory usage. To my memory, PostgreSQL crashed OS due to OOM in our user environments: * Creating and dropping temp tables repeatedly in a stored PL/pgSQL function. This results in infinite CacheMemoryContextbloat. This is referred to at the beginning of this mail thread. Oracle and MySQL can limit the size of the dictionary cache. * Each pair of SAVEPOINT/RELEASE leaves 8KB of CurTransactionContext. The customer used psqlODBC to run a batch app, whichran millions of SQL statements in a transaction. psqlODBC wraps each SQL statement with SAVEPOINT and RELEASE by default. I guess this is what caused the crash of AWS Aurora in last year's Amazon Prime Day. * Setting a large value to work_mem, and then run many concurrent large queries. Oracle can limit the total size of all sessions' memory with PGA_AGGREGATE_TARGET parameter. We all have to manage things within resource constraints. The DBA wants to make sure the server doesn't overuse memory toavoid crash or slowdown due to swapping. Oracle does it, and another open source database, MySQL, does it too. PostgreSQLdoes it with shared_buffers, wal_buffers, and work_mem (within a single session). Then, I thought it's naturalto do it with catcache/relcache/plancache. Regards Takayuki Tsunakawa
On 2/19/19 12:43 AM, Tsunakawa, Takayuki wrote: > Hi Horiguchi-san, > > I've looked through your patches. This is the first part of my review results. Let me post the rest after another worktoday. > > BTW, how about merging 0003 and 0005, and separating and deferring 0004 in another thread? That may help to relieve othercommunity members by making this patch set not so large and complex. > > > > [Bottleneck investigation] > Ideriha-san and I are trying to find the bottleneck. My first try shows there's little overhead. Here's what I did: > > <postgresql.conf> > shared_buffers = 1GB > catalog_cache_prune_min_age = -1 > catalog_cache_max_size = 10MB > > <benchmark> > $ pgbench -i -s 10 > $ pg_ctl stop and then start > $ cache all data in shared buffers by running pg_prewarm on branches, tellers, accounts, and their indexes > $ pgbench --select-only -c 1 -T 60 > > <result> > master : 8612 tps > patched: 8553 tps (-0.7%) > > There's little (0.7%) performance overhead with: > * one additional dlist_move_tail() in every catcache access > * memory usage accounting in operations other than catcache access (relevant catcache entries should be cached in the firstpgbench transaction) > > I'll check other patterns to find out how big overhead there is. > 0.7% may easily be just a noise, possibly due to differences in layout of the binary. How many runs? What was the variability of the results between runs? What hardware was this tested on? FWIW I doubt tests with such small small schema are proving anything - the cache/lists are likely tiny. That's why I tested with much larger number of relations. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com] >But at the same time, I did some benchmark with only hard limit option enabled and >time-related option disabled, because the figures of this case are not provided in this >thread. >So let me share it. I'm sorry but I'm taking back result about patch and correcting it. I configured postgresql (master) with only CFLAGS=O2 but I misconfigured postgres (path applied) with --enable-cassert --enable-debug --enable-tap-tests 'CFLAGS=-O0'. These debug option (especially --enable-cassert) caused enourmous overhead. (I thought I checked the configure option.. I was maybe tired.) So I changed these to only 'CFLAGS=-O2' and re-measured them. >I did two experiments. One is to show negative cache bloat is suppressed. >This thread originated from the issue that negative cache of pg_statistics is bloating as >creating and dropping temp table is repeatedly executed. >https://www.postgresql.org/message-id/20161219.201505.11562604.horiguchi.kyot >aro%40lab.ntt.co.jp >Using the script attached the first email in this thread, I repeated create and drop >temp table at 10000 times. >(experiment is repeated 5 times. catalog_cache_max_size = 500kB. > compared master branch and patch with hard memory limit) > >Here are TPS and CacheMemoryContext 'used' memory (total - freespace) calculated >by MemoryContextPrintStats() at 100, 1000, 10000 times of create-and-drop >transaction. The result shows cache bloating is suppressed after exceeding the limit >(at 10000) but tps declines regardless of the limit. > >number of tx (create and drop) | 100 |1000 |10000 >----------------------------------------------------------- >used CacheMemoryContext (master) |610296|2029256 |15909024 used >CacheMemoryContext (patch) |755176|880552 |880592 >----------------------------------------------------------- >TPS (master) |414 |407 |399 >TPS (patch) |242 |225 |220 Correct one: number of tx (create and drop) | 100 |1000 |10000 ----------------------------------------------------------- TPS (master) |414 |407 |399 TPS (patch) |447 |415 |409 The results between master and patch is almost same. >Another experiment is using Tomas's script posted while ago, The scenario is do select >1 from multiple tables randomly (uniform distribution). >(experiment is repeated 5 times. catalog_cache_max_size = 10MB. > compared master branch and patch with only hard memory limit enabled) > >Before doing the benchmark, I checked pruning is happened only at 10000 tables using >debug option. The result shows degradation regardless of before or after pruning. >I personally still need hard size limitation but I'm surprised that the difference is so >significant. > >number of tables | 100 |1000 |10000 >----------------------------------------------------------- >TPS (master) |10966 |10654 |9099 >TPS (patch) |4491 |2099 |378 Correct one: number of tables | 100 |1000 |10000 ----------------------------------------------------------- TPS (master) |10966 |10654 |9099 TPS (patch) | 11137 (+1%) |10710 (+0%) |772 (-91%) It seems that before cache exceeding the limit (no pruning at 100 and 1000), the results are almost same with master but after exceeding the limit (at 10000) the decline happens. Regards, Takeshi Ideriha
From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com] > number of tables | 100 |1000 |10000 > ----------------------------------------------------------- > TPS (master) |10966 |10654 |9099 > TPS (patch) | 11137 (+1%) |10710 (+0%) |772 (-91%) > > It seems that before cache exceeding the limit (no pruning at 100 and 1000), > the results are almost same with master but after exceeding the limit (at > 10000) > the decline happens. How many concurrent clients? Can you show the perf's call graph sampling profiles of both the unpatched and patched version, to confirm that the bottleneckis around catcache eviction and refill? Regards Takayuki Tsunakawa
At Thu, 14 Feb 2019 00:40:10 -0800, Andres Freund <andres@anarazel.de> wrote in <20190214084010.bdn6tmba2j7szo3m@alap3.anarazel.de> > Hi, > > On 2019-02-13 15:31:14 +0900, Kyotaro HORIGUCHI wrote: > > Instead, I added an accounting(?) interface function. > > > > | MemoryContextGettConsumption(MemoryContext cxt); > > > > The API returns the current consumption in this memory > > context. This allows "real" memory accounting almost without > > overhead. > > That's definitely *NOT* almost without overhead. This adds additional > instructions to one postgres' hottest set of codepaths. I'm not sure how much the two instructions in AllocSetAlloc actually impacts, but I agree that it is doubtful that the size-limit feature worth the possible slowdown in any extent. # I faintly remember that I tried the same thing before.. > I think you're not working incrementally enough here. I strongly suggest > solving the negative cache entry problem, and then incrementally go from > there after that's committed. The likelihood of this patch ever getting > merged otherwise seems extremely small. Mmm. Scoping to the negcache prolem, my very first patch posted two-years ago does that based on invalidation for pg_statistic and pg_class, like I think Tom have suggested somewhere in this thread. https://www.postgresql.org/message-id/20161219.201505.11562604.horiguchi.kyotaro@lab.ntt.co.jp This is completely different approach from the current shape and it would be useless after pruning is introduced. So I'd like to go for the generic pruning by age. Difference from v15: Removed AllocSet accounting stuff. We use approximate memory size for catcache. Removed prune-by-number(or size) stuff. Adressing comments from Tsunakawa-san and Ideriha-san . Separated catcache monitoring feature. (Removed from this set) (But it is crucial to check this feature...) Is this small enough ? regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 191496e02abd4d7b261705e8d2a0ef4aed5827c7 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Thu, 7 Feb 2019 14:56:07 +0900 Subject: [PATCH 1/2] Add dlist_move_tail We have dlist_push_head/tail and dlist_move_head but not dlist_move_tail. Add it. --- src/include/lib/ilist.h | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/src/include/lib/ilist.h b/src/include/lib/ilist.h index b1a5974ee4..659ab1ac87 100644 --- a/src/include/lib/ilist.h +++ b/src/include/lib/ilist.h @@ -394,6 +394,25 @@ dlist_move_head(dlist_head *head, dlist_node *node) dlist_check(head); } +/* + * Move element from its current position in the list to the tail position in + * the same list. + * + * Undefined behaviour if 'node' is not already part of the list. + */ +static inline void +dlist_move_tail(dlist_head *head, dlist_node *node) +{ + /* fast path if it's already at the tail */ + if (head->head.prev == node) + return; + + dlist_delete(node); + dlist_push_tail(head, node); + + dlist_check(head); +} + /* * Check whether 'node' has a following node. * Caution: unreliable if 'node' is not in the list. -- 2.16.3 From 59f53da08abb70398611b33f635b46bda87a7534 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Tue, 16 Oct 2018 13:04:30 +0900 Subject: [PATCH 2/2] Remove entries that haven't been used for a certain time Catcache entries can be left alone for several reasons. It is not desirable that they eat up memory. With this patch, This adds consideration of removal of entries that haven't been used for a certain time before enlarging the hash array. This also can put a hard limit on the number of catcache entries. --- doc/src/sgml/config.sgml | 40 +++++ src/backend/tcop/postgres.c | 13 ++ src/backend/utils/cache/catcache.c | 243 ++++++++++++++++++++++++-- src/backend/utils/init/globals.c | 1 + src/backend/utils/init/postinit.c | 11 ++ src/backend/utils/misc/guc.c | 23 +++ src/backend/utils/misc/postgresql.conf.sample | 2 + src/include/miscadmin.h | 1 + src/include/utils/catcache.h | 43 ++++- src/include/utils/timeout.h | 1 + 10 files changed, 364 insertions(+), 14 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 8bd57f376b..7a93aef659 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1661,6 +1661,46 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-catalog-cache-prune-min-age" xreflabel="catalog_cache_prune_min_age"> + <term><varname>catalog_cache_prune_min_age</varname> (<type>integer</type>) + <indexterm> + <primary><varname>catalog_cache_prune_min_age</varname> configuration + parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the minimum amount of unused time in seconds at which a + system catalog cache entry is removed. -1 indicates that this feature + is disabled at all. The value defaults to 300 seconds (<literal>5 + minutes</literal>). The catalog cache entries that are not used for + the duration can be removed to prevent it from being filled up with + useless entries. This behaviour is muted until the size of a catalog + cache exceeds <xref linkend="guc-catalog-cache-memory-target"/>. + </para> + </listitem> + </varlistentry> + + <varlistentry id="guc-catalog-cache-memory-target" xreflabel="catalog_cache_memory_target"> + <term><varname>catalog_cache_memory_target</varname> (<type>integer</type>) + <indexterm> + <primary><varname>catalog_cache_memory_target</varname> configuration + parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the maximum amount of memory to which a system catalog cache + can expand without pruning in kilobytes. The value defaults to 0, + indicating that age-based pruning is always considered. After + exceeding this size, catalog cache starts pruning according to + <xref linkend="guc-catalog-cache-prune-min-age"/>. If you need to keep + certain amount of catalog cache entries with intermittent usage, try + increase this setting. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth"> <term><varname>max_stack_depth</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 8b4d94c9a1..d9a54ed37f 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -71,6 +71,7 @@ #include "tcop/pquery.h" #include "tcop/tcopprot.h" #include "tcop/utility.h" +#include "utils/catcache.h" #include "utils/lsyscache.h" #include "utils/memutils.h" #include "utils/ps_status.h" @@ -2584,6 +2585,7 @@ start_xact_command(void) * not desired, the timeout has to be disabled explicitly. */ enable_statement_timeout(); + SetCatCacheClock(GetCurrentStatementStartTimestamp()); } static void @@ -3159,6 +3161,14 @@ ProcessInterrupts(void) if (ParallelMessagePending) HandleParallelMessages(); + + if (CatcacheClockTimeoutPending) + { + CatcacheClockTimeoutPending = false; + + /* Update timestamp then set up the next timeout */ + UpdateCatCacheClock(); + } } @@ -4021,6 +4031,9 @@ PostgresMain(int argc, char *argv[], QueryCancelPending = false; /* second to avoid race condition */ stmt_timeout_active = false; + /* get sync with the timer state */ + catcache_clock_timeout_active = false; + /* Not reading from the client anymore. */ DoingCommandRead = false; diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 78dd5714fa..30ab710aaa 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -39,6 +39,7 @@ #include "utils/rel.h" #include "utils/resowner_private.h" #include "utils/syscache.h" +#include "utils/timeout.h" /* #define CACHEDEBUG */ /* turns DEBUG elogs on */ @@ -61,9 +62,35 @@ #define CACHE_elog(...) #endif +/* GUC variable to define the minimum age of entries that will be considered to + * be evicted in seconds. This variable is shared among various cache + * mechanisms. + */ +int catalog_cache_prune_min_age = 300; + +/* + * GUC variable to define the minimum size of hash to cosider entry eviction. + * This variable is shared among various cache mechanisms. + */ +int catalog_cache_memory_target = 0; + +/* + * Flag to keep track of whether catcache clock timer is active. + */ +bool catcache_clock_timeout_active = false; + +/* + * Minimum interval between two success move of a cache entry in LRU list, + * in microseconds. + */ +#define MIN_LRU_UPDATE_INTERVAL 100000 /* 100ms */ + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Clock used to record the last accessed time of a catcache record. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -97,7 +124,7 @@ static CatCTup *CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *keys); -static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos, +static size_t CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *srckeys, Datum *dstkeys); @@ -469,6 +496,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) /* delink from linked list */ dlist_delete(&ct->cache_elem); + dlist_delete(&ct->lru_node); /* * Free keys when we're dealing with a negative entry, normal entries just @@ -478,6 +506,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys, cache->cc_keyno, ct->keys); + cache->cc_memusage -= ct->size; pfree(ct); --cache->cc_ntup; @@ -811,7 +840,9 @@ InitCatCache(int id, */ sz = sizeof(CatCache) + PG_CACHE_LINE_SIZE; cp = (CatCache *) CACHELINEALIGN(palloc0(sz)); - cp->cc_bucket = palloc0(nbuckets * sizeof(dlist_head)); + cp->cc_head_size = sz; + sz = nbuckets * sizeof(dlist_head); + cp->cc_bucket = palloc0(sz); /* * initialize the cache's relation information for the relation @@ -830,6 +861,9 @@ InitCatCache(int id, for (i = 0; i < nkeys; ++i) cp->cc_keyno[i] = key[i]; + cp->cc_memusage = cp->cc_head_size + sz; + + dlist_init(&cp->cc_lru_list); /* * new cache is initialized as far as we can go for now. print some * debugging information, if appropriate. @@ -846,9 +880,143 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initialize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } +/* + * helper routine for SetCatCacheClock and UpdateCatCacheClockTimer. + * + * We need to maintain the catcache clock during a long query. + */ +void +SetupCatCacheClockTimer(void) +{ + long delay; + + /* stop timer if not needed */ + if (catalog_cache_prune_min_age == 0) + { + catcache_clock_timeout_active = false; + return; + } + + /* One 10th of the variable, in milliseconds */ + delay = catalog_cache_prune_min_age * 1000/10; + + /* Lower limit is 1 second */ + if (delay < 1000) + delay = 1000; + + enable_timeout_after(CATCACHE_CLOCK_TIMEOUT, delay); + + catcache_clock_timeout_active = true; +} + +/* + * Update catcacheclock: this is intended to be called from + * CATCACHE_CLOCK_TIMEOUT. The interval is expected more than 1 second (see + * above), so GetCurrentTime() doesn't harm. + */ +void +UpdateCatCacheClock(void) +{ + catcacheclock = GetCurrentTimestamp(); + SetupCatCacheClockTimer(); +} + +/* + * It may take an unexpectedly long time before the next clock update when + * catalog_cache_prune_min_age gets shorter. Disabling the current timer let + * the next update happen at the expected interval. We don't necessariry + * require this for increase the age but we don't need to avoid to disable + * either. + */ +void +assign_catalog_cache_prune_min_age(int newval, void *extra) +{ + if (catcache_clock_timeout_active) + disable_timeout(CATCACHE_CLOCK_TIMEOUT, false); + + catcache_clock_timeout_active = false; +} + +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries can be left alone for several reasons. We remove them if + * they are not accessed for a certain time to prevent catcache from + * bloating. The eviction is performed with the similar algorithm with buffer + * eviction using access counter. Entries that are accessed several times can + * live longer than those that have had less access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int nremoved = 0; + dlist_mutable_iter iter; + + /* Return immediately if no pruning is wanted */ + if (catalog_cache_prune_min_age == 0 || + cp->cc_memusage <= (Size) catalog_cache_memory_target * 1024L) + return false; + + /* Scan over LRU to find entries to remove */ + dlist_foreach_modify(iter, &cp->cc_lru_list) + { + CatCTup *ct = dlist_container(CatCTup, lru_node, iter.cur); + long entry_age; + int us; + + /* We don't remove referenced entry */ + if (ct->refcount != 0 || + (ct->c_list && ct->c_list->refcount != 0)) + continue; + + /* + * Calculate the duration from the time from the last access to + * the "current" time. catcacheclock is updated per-statement + * basis and additionaly udpated periodically during a long + * running query. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + + if (entry_age < catalog_cache_prune_min_age) + { + /* + * no longer have a business with further entries, exit. At least + * one removal is enough to prevent rehashing this time. + */ + return nremoved > 0; + } + + /* + * Entries that are not accessed after last pruning are removed in + * that seconds, and that has been accessed several times are + * removed after leaving alone for up to three times of the + * duration. We don't try shrink buckets since pruning effectively + * caps catcache expansion in the long term. + */ + if (ct->naccess > 0) + ct->naccess--; + else + { + /* remove this entry */ + CatCacheRemoveCTup(cp, ct); + nremoved++; + } + } + + if (nremoved > 0) + elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d", + cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved); + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -858,13 +1026,18 @@ RehashCatCache(CatCache *cp) dlist_head *newbucket; int newnbuckets; int i; + size_t sz; elog(DEBUG1, "rehashing catalog cache id %d for %s; %d tups, %d buckets", cp->id, cp->cc_relname, cp->cc_ntup, cp->cc_nbuckets); /* Allocate a new, larger, hash table. */ newnbuckets = cp->cc_nbuckets * 2; - newbucket = (dlist_head *) MemoryContextAllocZero(CacheMemoryContext, newnbuckets * sizeof(dlist_head)); + sz = newnbuckets * sizeof(dlist_head); + newbucket = (dlist_head *) MemoryContextAllocZero(CacheMemoryContext, sz); + + /* reset memory usage */ + cp->cc_memusage = cp->cc_head_size + sz; /* Move all entries from old hash table to new. */ for (i = 0; i < cp->cc_nbuckets; i++) @@ -878,6 +1051,7 @@ RehashCatCache(CatCache *cp) dlist_delete(iter.cur); dlist_push_head(&newbucket[hashIndex], &ct->cache_elem); + cp->cc_memusage += ct->size; } } @@ -1260,6 +1434,21 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* Update access information for pruning */ + if (ct->naccess < 2) + ct->naccess++; + + /* + * We don't want too frequent update of + * LRU. catalog_cache_prune_min_age can be changed on-session so we + * need to maintain the LRU regardless of catalog_cache_prune_min_age. + */ + if (catcacheclock - ct->lastaccess > MIN_LRU_UPDATE_INTERVAL) + { + ct->lastaccess = catcacheclock; + dlist_move_tail(&cache->cc_lru_list, &ct->lru_node); + } + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1695,6 +1884,11 @@ SearchCatCacheList(CatCache *cache, /* Now we can build the CatCList entry. */ oldcxt = MemoryContextSwitchTo(CacheMemoryContext); nmembers = list_length(ctlist); + + /* + * Don't waste a time by counting the list in catcache memory usage, + * since it doesn't live a long life. + */ cl = (CatCList *) palloc(offsetof(CatCList, members) + nmembers * sizeof(CatCTup *)); @@ -1805,6 +1999,7 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, CatCTup *ct; HeapTuple dtp; MemoryContext oldcxt; + int tupsize; /* negative entries have no tuple associated */ if (ntp) @@ -1828,8 +2023,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, /* Allocate memory for CatCTup and the cached tuple in one go */ oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - ct = (CatCTup *) palloc(sizeof(CatCTup) + - MAXIMUM_ALIGNOF + dtp->t_len); + tupsize = sizeof(CatCTup) + MAXIMUM_ALIGNOF + dtp->t_len; + ct = (CatCTup *) palloc(tupsize); ct->tuple.t_len = dtp->t_len; ct->tuple.t_self = dtp->t_self; ct->tuple.t_tableOid = dtp->t_tableOid; @@ -1862,14 +2057,16 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, { Assert(negative); oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - ct = (CatCTup *) palloc(sizeof(CatCTup)); + tupsize = sizeof(CatCTup); + ct = (CatCTup *) palloc(tupsize); /* * Store keys - they'll point into separately allocated memory if not * by-value. */ - CatCacheCopyKeys(cache->cc_tupdesc, cache->cc_nkeys, cache->cc_keyno, - arguments, ct->keys); + tupsize += + CatCacheCopyKeys(cache->cc_tupdesc, cache->cc_nkeys, + cache->cc_keyno, arguments, ct->keys); MemoryContextSwitchTo(oldcxt); } @@ -1884,19 +2081,33 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; + dlist_push_tail(&cache->cc_lru_list, &ct->lru_node); dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); cache->cc_ntup++; CacheHdr->ch_ntup++; + ct->size = tupsize; + cache->cc_memusage += ct->size; + + /* increase refcount so that this survives pruning */ + ct->refcount++; + /* - * If the hash table has become too full, enlarge the buckets array. Quite - * arbitrarily, we enlarge when fill factor > 2. + * If the hash table has become too full, try cleanup by removing + * infrequently used entries to make a room for the new entry. If it + * failed, enlarge the bucket array instead. Quite arbitrarily, we try + * this when fill factor > 2. */ - if (cache->cc_ntup > cache->cc_nbuckets * 2) + if (cache->cc_ntup > cache->cc_nbuckets * 2 && + !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); + ct->refcount--; + return ct; } @@ -1926,13 +2137,14 @@ CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *keys) /* * Helper routine that copies the keys in the srckeys array into the dstkeys * one, guaranteeing that the datums are fully allocated in the current memory - * context. + * context. Returns allocated memory size. */ -static void +static size_t CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *srckeys, Datum *dstkeys) { int i; + size_t sz = 0; /* * XXX: memory and lookup performance could possibly be improved by @@ -1961,8 +2173,13 @@ CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos, dstkeys[i] = datumCopy(src, att->attbyval, att->attlen); + + /* approximate size */ + if (!att->attbyval) + sz += VARHDRSZ + att->attlen; } + return sz; } /* diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index fd51934aaf..0e8b972a29 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false; volatile sig_atomic_t ProcDiePending = false; volatile sig_atomic_t ClientConnectionLost = false; volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false; +volatile sig_atomic_t CatcacheClockTimeoutPending = false; volatile sig_atomic_t ConfigReloadPending = false; volatile uint32 InterruptHoldoffCount = 0; volatile uint32 QueryCancelHoldoffCount = 0; diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index a5ee209f91..9eb50e9676 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg); static void StatementTimeoutHandler(void); static void LockTimeoutHandler(void); static void IdleInTransactionSessionTimeoutHandler(void); +static void CatcacheClockTimeoutHandler(void); static bool ThereIsAtLeastOneRole(void); static void process_startup_options(Port *port, bool am_superuser); static void process_settings(Oid databaseid, Oid roleid); @@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler); RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT, IdleInTransactionSessionTimeoutHandler); + RegisterTimeout(CATCACHE_CLOCK_TIMEOUT, + CatcacheClockTimeoutHandler); } /* @@ -1238,6 +1241,14 @@ IdleInTransactionSessionTimeoutHandler(void) SetLatch(MyLatch); } +static void +CatcacheClockTimeoutHandler(void) +{ + CatcacheClockTimeoutPending = true; + InterruptPending = true; + SetLatch(MyLatch); +} + /* * Returns true if at least one role is defined in this database cluster. */ diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 156d147c85..d863c8dec8 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -81,6 +81,7 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/guc_tables.h" #include "utils/float.h" #include "utils/memutils.h" @@ -2205,6 +2206,28 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum unused duration of cache entries before removal."), + gettext_noop("Catalog cache entries that live unused for longer than this seconds are considered to be removed."), + GUC_UNIT_S + }, + &catalog_cache_prune_min_age, + 300, -1, INT_MAX, + NULL, assign_catalog_cache_prune_min_age, NULL + }, + + { + {"catalog_cache_memory_target", PGC_USERSET, RESOURCES_MEM, + gettext_noop("Sets the minimum syscache size to keep."), + gettext_noop("Time-based cache pruning starts working after exceeding this size."), + GUC_UNIT_KB + }, + &catalog_cache_memory_target, + 0, 0, MAX_KILOBYTES, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 194f312096..7c82b0eca7 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -128,6 +128,8 @@ #work_mem = 4MB # min 64kB #maintenance_work_mem = 64MB # min 1MB #autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem +#catalog_cache_memory_target = 0kB # in kB +#catalog_cache_prune_min_age = 300s # -1 disables pruning #max_stack_depth = 2MB # min 100kB #shared_memory_type = mmap # the default is the first option # supported by the operating system: diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index c9e35003a5..33b800e80f 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending; extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending; extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending; extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending; +extern PGDLLIMPORT volatile sig_atomic_t CatcacheClockTimeoutPending; extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending; extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost; diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 65d816a583..1ae49b4819 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -61,6 +62,10 @@ typedef struct catcache slist_node cc_next; /* list link */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ + dlist_head cc_lru_list; + int cc_head_size; /* memory usage of catcache header */ + int cc_memusage; /* total memory usage of this catcache */ + int cc_nfreeent; /* # of entries currently not referenced */ /* * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS @@ -119,7 +124,10 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ - + int naccess; /* # of access to this entry, up to 2 */ + TimestampTz lastaccess; /* approx. timestamp of the last usage */ + dlist_node lru_node; /* LRU node */ + int size; /* palloc'ed size off this tuple */ /* * The tuple may also be a member of at most one CatCList. (If a single * catcache is list-searched with varying numbers of keys, we may have to @@ -189,6 +197,39 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int catalog_cache_prune_min_age; +extern int catalog_cache_memory_target; +extern int catalog_cache_entry_limit; +extern double catalog_cache_prune_ratio; + +/* to use as access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* + * Flag to keep track of whether catcache timestamp timer is active. + */ +extern bool catcache_clock_timeout_active; + +/* catcache prune time helper functions */ +extern void SetupCatCacheClockTimer(void); +extern void UpdateCatCacheClock(void); + +/* + * SetCatCacheClock - set timestamp for catcache access record and start + * maintenance timer if needed. We keep to update the clock even while pruning + * is disable so that we are not confused by bogus clock value. + */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; + + if (!catcache_clock_timeout_active && catalog_cache_prune_min_age > 0) + SetupCatCacheClockTimer(); +} + +extern void assign_catalog_cache_prune_min_age(int newval, void *extra); extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h index 9244a2a7b7..b2d97b4f7b 100644 --- a/src/include/utils/timeout.h +++ b/src/include/utils/timeout.h @@ -31,6 +31,7 @@ typedef enum TimeoutId STANDBY_TIMEOUT, STANDBY_LOCK_TIMEOUT, IDLE_IN_TRANSACTION_SESSION_TIMEOUT, + CATCACHE_CLOCK_TIMEOUT, /* First user-definable timeout reason */ USER_TIMEOUT, /* Maximum number of timeout reasons */ -- 2.16.3
On Tue, Feb 19, 2019 at 11:15 PM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Difference from v15: > > Removed AllocSet accounting stuff. We use approximate memory > size for catcache. > > Removed prune-by-number(or size) stuff. > > Adressing comments from Tsunakawa-san and Ideriha-san . > > Separated catcache monitoring feature. (Removed from this set) > (But it is crucial to check this feature...) > > Is this small enough ? The commit message in 0002 says 'This also can put a hard limit on the number of catcache entries.' but neither of the GUCs that you've documented have that effect. Is that a leftover from a previous version? I'd like to see some evidence that catalog_cache_memory_target has any value, vs. just always setting it to zero. I came up with the following somewhat artificial example that shows that it might have value. rhaas=# create table foo (a int primary key, b text) partition by hash (a); [rhaas pgsql]$ perl -e 'for (0..9999) { print "CREATE TABLE foo$_ PARTITION OF foo FOR VALUES WITH (MODULUS 10000, REMAINDER $_);\n"; }' | psql First execution of 'select * from foo' in a brand new session takes about 1.9 seconds; subsequent executions take about 0.7 seconds. So, if catalog_cache_memory_target were set to a high enough value to allow all of that stuff to remain in cache, we could possibly save about 1.2 seconds coming off the blocks after a long idle period. That might be enough to justify having the parameter. But I'm not quite sure how high the value would need to be set to actually get the benefit in a case like that, or what happens if you set it to a value that's not quite high enough. I think it might be good to play around some more with cases like this, just to get a feeling for how much time you can save in exchange for how much memory. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>From: Tsunakawa, Takayuki >>From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com] >> number of tables | 100 |1000 |10000 >> ----------------------------------------------------------- >> TPS (master) |10966 |10654 |9099 >> TPS (patch) | 11137 (+1%) |10710 (+0%) |772 (-91%) >> >> It seems that before cache exceeding the limit (no pruning at 100 and >> 1000), the results are almost same with master but after exceeding the >> limit (at >> 10000) >> the decline happens. > >How many concurrent clients? One client (default setting). >Can you show the perf's call graph sampling profiles of both the unpatched and >patched version, to confirm that the bottleneck is around catcache eviction and refill? I checked it with perf record -avg and perf report. The following shows top 20 symbols during benchmark including kernel space. The main difference between master (unpatched) and patched one seems that patched one consumes cpu catcache-evict-and-refill functions including SearchCatCacheMiss(), CatalogCacheCreateEntry(), CatCacheCleanupOldEntries(). So it seems to me that these functions needs further inspection to suppress the performace decline as much as possible Master(%) master |patch (%) patch 51.25% cpu_startup_entry | 51.45% cpu_startup_entry 51.13% arch_cpu_idle | 51.19% arch_cpu_idle 51.13% default_idle | 51.19% default_idle 51.13% native_safe_halt | 50.95% native_safe_halt 36.27% PostmasterMain | 46.98% PostmasterMain 36.27% main | 46.98% main 36.27% __libc_start_main | 46.98% __libc_start_main 36.07% ServerLoop | 46.93% ServerLoop 35.75% PostgresMain | 46.89% PostgresMain 26.03% exec_simple_query | 45.99% exec_simple_query 26.00% rest_init | 43.40% SearchCatCacheMiss 26.00% start_kernel | 42.80% CatalogCacheCreateEntry 26.00% x86_64_start_reservations | 42.75% CatCacheCleanupOldEntries 26.00% x86_64_start_kernel | 27.04% rest_init 25.26% start_secondary | 27.04% start_kernel 10.25% pg_plan_queries | 27.04% x86_64_start_reservations 10.17% pg_plan_query | 27.04% x86_64_start_kernel 10.16% main | 24.42% start_secondary 10.16% __libc_start_main | 22.35% pg_analyze_and_rewrite 10.03% standard_planner | 22.35% parse_analyze Regards, Takeshi Ideriha
From: Ideriha, Takeshi/出利葉 健 > I checked it with perf record -avg and perf report. > The following shows top 20 symbols during benchmark including kernel space. > The main difference between master (unpatched) and patched one seems that > patched one consumes cpu catcache-evict-and-refill functions including > SearchCatCacheMiss(), CatalogCacheCreateEntry(), > CatCacheCleanupOldEntries(). > So it seems to me that these functions needs further inspection > to suppress the performace decline as much as possible Thank you. It's good to see the expected functions, rather than strange behavior. The performance drop is natural justlike the database cache's hit ratio is low. The remedy for performance by the user is also the same as the databasecache -- increase the catalog cache. Regards Takayuki Tsunakawa
From: Robert Haas [mailto:robertmhaas@gmail.com] > That might be enough to justify having the parameter. But I'm not > quite sure how high the value would need to be set to actually get the > benefit in a case like that, or what happens if you set it to a value > that's not quite high enough. I think it might be good to play around > some more with cases like this, just to get a feeling for how much > time you can save in exchange for how much memory. Why don't we consider this just like the database cache and other DBMS's dictionary caches? That is, * If you want to avoid infinite memory bloat, set the upper limit on size. * To find a better limit, check the hit ratio with the statistics view (based on Horiguchi-san's original 0004 patch, althoughthat seems modification anyway) Why do people try to get away from a familiar idea... Am I missing something? Ideriha-san, Could you try simplifying the v15 patch set to see how simple the code would look or not? That is: * 0001: add dlist_push_tail() ... as is * 0002: memory accounting, with correction based on feedback * 0003: merge the original 0003 and 0005, with correction based on feedback Regards Takayuki Tsunakawa
On Tue, Feb 19, 2019 at 07:08:14AM +0000, Tsunakawa, Takayuki wrote: > We all have to manage things within resource constraints. The DBA > wants to make sure the server doesn't overuse memory to avoid crash > or slowdown due to swapping. Oracle does it, and another open source > database, MySQL, does it too. PostgreSQL does it with shared_buffers, > wal_buffers, and work_mem (within a single session). Then, I thought > it's natural to do it with catcache/relcache/plancache. I already addressed these questions in an email from Feb 14: https://www.postgresql.org/message-id/20190214154955.GB19578@momjian.us I understand the operational needs of limiting resources in some cases, but there is also the history of OS's using working set to allocate things, which didn't work too well: https://en.wikipedia.org/wiki/Working_set I think we need to address the most pressing problem of unlimited cache size bloat and then take a holistic look at all memory allocation. If we are going to address that in a global way, I don't see the relation cache as the place to start. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
>From: Tsunakawa, Takayuki >Ideriha-san, >Could you try simplifying the v15 patch set to see how simple the code would look or >not? That is: > >* 0001: add dlist_push_tail() ... as is >* 0002: memory accounting, with correction based on feedback >* 0003: merge the original 0003 and 0005, with correction based on feedback Attached are simpler version based on Horiguchi san's ver15 patch, which means cache is pruned by both time and size. (Still cleanup function is complex but it gets much simpler.) Regards, Takeshi Ideriha
Attachment
On Thu, Feb 21, 2019 at 1:38 AM Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > Why don't we consider this just like the database cache and other DBMS's dictionary caches? That is, > > * If you want to avoid infinite memory bloat, set the upper limit on size. > > * To find a better limit, check the hit ratio with the statistics view (based on Horiguchi-san's original 0004 patch, althoughthat seems modification anyway) > > Why do people try to get away from a familiar idea... Am I missing something? I don't understand the idea that we would add something to PostgreSQL without proving that it has value. Sure, other systems have somewhat similar systems, and they have knobs to tune them. But, first, we don't know that those other systems made all the right decisions, and second, even they are, that doesn't mean that we'll derive similar benefits in a system with a completely different code base and many other internal differences. You need to demonstrate that each and every GUC you propose to add has a real, measurable benefit in some plausible scenario. You can't just argue that other people have something kinda like this so we should have it too. Or, well, you can argue that, but if you do, then -1 from me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
At Wed, 20 Feb 2019 13:09:08 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoZXw+SwK_9Tp=wLqZDstW_X+Ant=rd7K+q4zmYONPuL=w@mail.gmail.com> > On Tue, Feb 19, 2019 at 11:15 PM Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > Difference from v15: > > > > Removed AllocSet accounting stuff. We use approximate memory > > size for catcache. > > > > Removed prune-by-number(or size) stuff. > > > > Adressing comments from Tsunakawa-san and Ideriha-san . > > > > Separated catcache monitoring feature. (Removed from this set) > > (But it is crucial to check this feature...) > > > > Is this small enough ? > > The commit message in 0002 says 'This also can put a hard limit on the > number of catcache entries.' but neither of the GUCs that you've > documented have that effect. Is that a leftover from a previous > version? Mmm. Right. Thank you for pointing that and sorry for that. Fixed it including another mistake in the commit message in my repo. It will appear in the next version. | Remove entries that haven't been used for a certain time | | Catcache entries can be left alone for several reasons. It is not | desirable that they eat up memory. With this patch, entries that | haven't been used for a certain time are considered to be removed | before enlarging hash array. > I'd like to see some evidence that catalog_cache_memory_target has any > value, vs. just always setting it to zero. I came up with the > following somewhat artificial example that shows that it might have > value. > > rhaas=# create table foo (a int primary key, b text) partition by hash (a); > [rhaas pgsql]$ perl -e 'for (0..9999) { print "CREATE TABLE foo$_ > PARTITION OF foo FOR VALUES WITH (MODULUS 10000, REMAINDER $_);\n"; }' > | psql > > First execution of 'select * from foo' in a brand new session takes > about 1.9 seconds; subsequent executions take about 0.7 seconds. So, > if catalog_cache_memory_target were set to a high enough value to > allow all of that stuff to remain in cache, we could possibly save > about 1.2 seconds coming off the blocks after a long idle period. > That might be enough to justify having the parameter. But I'm not > quite sure how high the value would need to be set to actually get the > benefit in a case like that, or what happens if you set it to a value > that's not quite high enough. It is artificial (or acutually wont't be repeatedly executed in a session) but anyway what can get benefit from catalog_cache_memory_target would be a kind of extreme. I think the two parameters are to be tuned in the following steps. - If the default setting sutisfies you, leave it alone. (as a general suggestion) - If you find your (syscache-sensitive) query are to be executed with rather longer intervals, say 10-30 minutes, and it gets slower than shorter intervals, consider increase catalog_cache_prune_min_age to about the query interval. If you don't suffer process-bloat, that's fine. - If you find the process too much "bloat"s and you (intuirively) suspect the cause is system cache, set it to certain shorter value, say 1 minutes, and set the catalog_cache_memory_target to allowable amount of memory for each process. The memory usage will be stable at (un)certain amount above the target. Or, if you want determine the setting previously with rather strict limit, and if the monitoring feature were a part of this patchset, a user can check how much memory is used for the query. $ perl -e 'print "set track_catalog_cache_usage_interval = 1000;\n"; for (0..9999) { print "CREATE TABLE foo$_ PARTITIONOF foo FOR VALUES WITH (MODULUS 10000, REMAINDER $_);\n"; } print "select sum(size) from pg_stat_syscache";' | psql sum --------- 7088523 In this case, set catalog_cache_memory_target to 7MB and catalog_cache_memory_target to '1min'. Since the target doesn't work strictly (checked only at every resizing time), possibly you need further tuning. > that's not quite high enough. I think it might be good to play around > some more with cases like this, just to get a feeling for how much > time you can save in exchange for how much memory. All kind of tuning is something of that kind, I think. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Mon, 25 Feb 2019 15:23:22 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190225.152322.104148315.horiguchi.kyotaro@lab.ntt.co.jp> > I think the two parameters are to be tuned in the following > steps. > > - If the default setting sutisfies you, leave it alone. (as a > general suggestion) > > - If you find your (syscache-sensitive) query are to be executed > with rather longer intervals, say 10-30 minutes, and it gets > slower than shorter intervals, consider increase > catalog_cache_prune_min_age to about the query interval. If you > don't suffer process-bloat, that's fine. > > - If you find the process too much "bloat"s and you (intuirively) > suspect the cause is system cache, set it to certain shorter > value, say 1 minutes, and set the catalog_cache_memory_target > to allowable amount of memory for each process. The memory > usage will be stable at (un)certain amount above the target. > > > Or, if you want determine the setting previously with rather > strict limit, and if the monitoring feature were a part of this > patchset, a user can check how much memory is used for the query. > > $ perl -e 'print "set track_catalog_cache_usage_interval = 1000;\n"; for (0..9999) { print "CREATE TABLE foo$_ PARTITIONOF foo FOR VALUES WITH (MODULUS 10000, REMAINDER $_);\n"; } print "select sum(size) from pg_stat_syscache";' | psql > > sum > --------- > 7088523 It's not substantial, but the number is for catalog_cache_prune_min_age = 300s, I had 12MB when it is disabled. perl -e 'print "set catalog_cache_prune_min_age to 0; set track_catalog_cache_usage_interval = 1000;\n"; for (0..9999) {print "CREATE TABLE foo$_ PARTITION OF foo FOR VALUES WITH (MODULUS 10000, REMAINDER $_);\n"; } print "select sum(size)from pg_stat_syscache";' | psql sum ---------- 12642321 > In this case, set catalog_cache_memory_target to 7MB and > catalog_cache_memory_target to '1min'. Since the target doesn't > work strictly (checked only at every resizing time), possibly > you need further tuning. regards. - Kyotaro Horiguchi NTT Open Source Software Center
From: Robert Haas [mailto:robertmhaas@gmail.com] > I don't understand the idea that we would add something to PostgreSQL > without proving that it has value. Sure, other systems have somewhat > similar systems, and they have knobs to tune them. But, first, we > don't know that those other systems made all the right decisions, and > second, even they are, that doesn't mean that we'll derive similar > benefits in a system with a completely different code base and many > other internal differences. I understand that general idea. So, I don't have an idea why the proposed approach, eviction based only on elapsed timeonly at hash table expansion, is better for PostgreSQL's code base and other internal differences... > You need to demonstrate that each and every GUC you propose to add has > a real, measurable benefit in some plausible scenario. You can't just > argue that other people have something kinda like this so we should > have it too. Or, well, you can argue that, but if you do, then -1 > from me. The benefit of the size limit are: * Controllable and predictable memory usage. The DBA can be sure that OOM won't happen. * Smoothed (non-abnormal) transaction response time. This is due to the elimination of bulk eviction of cache entries. I'm not sure how to tune catalog_cache_prune_min_age and catalog_cache_memory_target. Let me pick up a test scenario ina later mail in response to Horiguchi-san. Regards Takayuki Tsunakawa
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] > - If you find the process too much "bloat"s and you (intuirively) > suspect the cause is system cache, set it to certain shorter > value, say 1 minutes, and set the catalog_cache_memory_target > to allowable amount of memory for each process. The memory > usage will be stable at (un)certain amount above the target. Could you guide me how to tune these parameters in an example scenario? Let me take the original problematic case referencedat the beginning of this thread. That is: * A PL/pgSQL function that creates a temp table, accesses it, (accesses other non-temp tables), and drop the temp table. * An application repeatedly begins a transaction, calls the stored function, and commits the transaction. With v16 patch applied, and leaving the catalog_cache_xxx parameters set to their defaults, CacheMemoryContext continuedto increase as follows: CacheMemoryContext: 1065016 total in 9 blocks; 104168 free (17 chunks); 960848 used CacheMemoryContext: 8519736 total in 12 blocks; 3765504 free (19 chunks); 4754232 used CacheMemoryContext: 25690168 total in 14 blocks; 8372096 free (21 chunks); 17318072 used CacheMemoryContext: 42991672 total in 16 blocks; 11741024 free (21761 chunks); 31250648 used How can I make sure that this context won't exceed, say, 10 MB to avoid OOM? I'm afraid that once the catcache hash table becomes large in a short period, the eviction would happen less frequently,leading to memory bloat. Regards Takayuki Tsunakawa
>>From: Tsunakawa, Takayuki >>Ideriha-san, >>Could you try simplifying the v15 patch set to see how simple the code >>would look or not? That is: >> >>* 0001: add dlist_push_tail() ... as is >>* 0002: memory accounting, with correction based on feedback >>* 0003: merge the original 0003 and 0005, with correction based on >>feedback > >Attached are simpler version based on Horiguchi san's ver15 patch, which means >cache is pruned by both time and size. >(Still cleanup function is complex but it gets much simpler.) I don't mean to disregard what Horiguchi san and others have developed and discussed. But I refactored again the v15 patch to reduce complexity of v15 patch because it seems to me one of the reason for dropping feature for pruning by size stems from code complexity. Another thing is there's been discussed about over memory accounting overhead but the overhead effect hasn't been measured in this thread. So I'd like to measure it. Regards, Takeshi Ideriha
Attachment
On Mon, Feb 25, 2019 at 3:50 AM Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > How can I make sure that this context won't exceed, say, 10 MB to avoid OOM? As Tom has said before and will probably say again, I don't think you actually want that. We know that PostgreSQL gets roughly 100x slower with the system caches disabled - try running with CLOBBER_CACHE_ALWAYS. If you are accessing the same system cache entries repeatedly in a loop - which is not at all an unlikely scenario, just run the same query or sequence of queries in a loop - and if the number of entries exceeds 10MB even, perhaps especially, by just a tiny bit, you are going to see a massive performance hit. Maybe it won't be 100x because some more-commonly-used entries will always stay cached, but it's going to be really big, I think. Now you could say - well it's still better than running out of memory. However, memory usage is quite unpredictable. It depends on how many backends are active and how many copies of work_mem and/or maintenance_work_mem are in use, among other things. I don't think we can say that just imposing a limit on the size of the system caches is going to be enough to reliably prevent an out of memory condition unless the other use of memory on the machine happens to be extremely stable. So I think what's going to happen if you try to impose a hard-limit on the size of the system cache is that you will cause some workloads to slow down by 3x or more without actually preventing out of memory conditions. What you need to do is accept that system caches need to grow as big as they need to grow, and if that causes you to run out of memory, either buy more memory or reduce the number of concurrent sessions you allow. It would be fine to instead limit the cache memory if those cache entries only had a mild effect on performance, but I don't think that's the case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Feb 25, 2019 at 1:27 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > I'd like to see some evidence that catalog_cache_memory_target has any > > value, vs. just always setting it to zero. > > It is artificial (or acutually wont't be repeatedly executed in a > session) but anyway what can get benefit from > catalog_cache_memory_target would be a kind of extreme. I agree. So then let's not have it. We shouldn't add more mechanism here than actually has value. It seems pretty clear that keeping cache entries that go unused for long periods can't be that important; even if we need them again eventually, reloading them every 5 or 10 minutes can't hurt that much. On the other hand, I think it's also pretty clear that evicting cache entries that are being used frequently will have disastrous effects on performance; as I noted in the other email I just sent, consider the effects of CLOBBER_CACHE_ALWAYS. No reasonable user is going to want to incur a massive slowdown to save a little bit of memory. I see that *in theory* there is a value to catalog_cache_memory_target, because *maybe* there is a workload where tuning that GUC will lead to better performance at lower memory usage than any competing proposal. But unless we can actually see an example of such a workload, which so far I don't, we're adding a knob that everybody has to think about how to tune when in fact we have no idea how to tune it or whether it even needs to be tuned. That doesn't make sense. We have to be able to document the parameters we have and explain to users how they should be used. And as far as this parameter is concerned I think we are not at that point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com] >>>* 0001: add dlist_push_tail() ... as is >>>* 0002: memory accounting, with correction based on feedback >>>* 0003: merge the original 0003 and 0005, with correction based on >>>feedback >> >>Attached are simpler version based on Horiguchi san's ver15 patch, >>which means cache is pruned by both time and size. >>(Still cleanup function is complex but it gets much simpler.) > >I don't mean to disregard what Horiguchi san and others have developed and >discussed. >But I refactored again the v15 patch to reduce complexity of v15 patch because it >seems to me one of the reason for dropping feature for pruning by size stems from >code complexity. > >Another thing is there's been discussed about over memory accounting overhead but >the overhead effect hasn't been measured in this thread. So I'd like to measure it. I measured the memory context accounting overhead using Tomas's tool palloc_bench, which he made it a while ago in the similar discussion. https://www.postgresql.org/message-id/53F7E83C.3020304@fuzzy.cz This tool is a little bit outdated so I fixed it but basically I followed him. Things I did: - make one MemoryContext - run both palloc() and pfree() for 32kB area 1,000,000 times. - And measure this time The result shows that master is 30 times faster than patched one. So as Andres mentioned in upper thread it seems it has overhead. [master (without v15 patch)] 61.52 ms 60.96 ms 61.40 ms 61.42 ms 61.14 ms [with v15 patch] 1838.02 ms 1754.84 ms 1755.83 ms 1789.69 ms 1789.44 ms Regards, Takeshi Ideriha
>From: Robert Haas [mailto:robertmhaas@gmail.com] > >On Mon, Feb 25, 2019 at 3:50 AM Tsunakawa, Takayuki ><tsunakawa.takay@jp.fujitsu.com> wrote: >> How can I make sure that this context won't exceed, say, 10 MB to avoid OOM? > >As Tom has said before and will probably say again, I don't think you actually want that. >We know that PostgreSQL gets roughly 100x slower with the system caches disabled >- try running with CLOBBER_CACHE_ALWAYS. If you are accessing the same system >cache entries repeatedly in a loop - which is not at all an unlikely scenario, just run the >same query or sequence of queries in a loop - and if the number of entries exceeds >10MB even, perhaps especially, by just a tiny bit, you are going to see a massive >performance hit. >Maybe it won't be 100x because some more-commonly-used entries will always stay >cached, but it's going to be really big, I think. > >Now you could say - well it's still better than running out of memory. >However, memory usage is quite unpredictable. It depends on how many backends >are active and how many copies of work_mem and/or maintenance_work_mem are in >use, among other things. I don't think we can say that just imposing a limit on the >size of the system caches is going to be enough to reliably prevent an out of memory >condition unless the other use of memory on the machine happens to be extremely >stable. >So I think what's going to happen if you try to impose a hard-limit on the size of the >system cache is that you will cause some workloads to slow down by 3x or more >without actually preventing out of memory conditions. What you need to do is accept >that system caches need to grow as big as they need to grow, and if that causes you >to run out of memory, either buy more memory or reduce the number of concurrent >sessions you allow. It would be fine to instead limit the cache memory if those cache >entries only had a mild effect on performance, but I don't think that's the case. I'm afraid I may be quibbling about it. What about users who understand performance drops but don't want to add memory or decrease concurrency? I think that PostgreSQL has a parameter which most of users don't mind and use is as default but a few of users want to change it. In this case as you said, introducing hard limit parameter causes performance decrease significantly so how about adding detailed caution to the document like planner cost parameter? Regards, Takeshi Ideriha
From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com] > I measured the memory context accounting overhead using Tomas's tool > palloc_bench, > which he made it a while ago in the similar discussion. > https://www.postgresql.org/message-id/53F7E83C.3020304@fuzzy.cz > > This tool is a little bit outdated so I fixed it but basically I followed > him. > Things I did: > - make one MemoryContext > - run both palloc() and pfree() for 32kB area 1,000,000 times. > - And measure this time > > The result shows that master is 30 times faster than patched one. > So as Andres mentioned in upper thread it seems it has overhead. > > [master (without v15 patch)] > 61.52 ms > 60.96 ms > 61.40 ms > 61.42 ms > 61.14 ms > > [with v15 patch] > 1838.02 ms > 1754.84 ms > 1755.83 ms > 1789.69 ms > 1789.44 ms > I'm afraid the measurement is not correct. First, the older discussion below shows that the accounting overhead is much,much smaller, even with a more complex accounting. 9.5: Better memory accounting, towards memory-bounded HashAg https://www.postgresql.org/message-id/flat/1407012053.15301.53.camel%40jeff-desktop Second, allocation/free of memory > 8 KB calls malloc()/free(). I guess the accounting overhead will be more likely to behidden under the overhead of malloc() and free(). What we'd like to know the overhead when malloc() and free() are notcalled. And are you sure you didn't enable assert checking? Regards Takayuki Tsunakawa
>From: Tsunakawa, Takayuki [mailto:tsunakawa.takay@jp.fujitsu.com] >From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com] >> I measured the memory context accounting overhead using Tomas's tool >> palloc_bench, which he made it a while ago in the similar discussion. >> https://www.postgresql.org/message-id/53F7E83C.3020304@fuzzy.cz >> >> This tool is a little bit outdated so I fixed it but basically I >> followed him. >> Things I did: >> - make one MemoryContext >> - run both palloc() and pfree() for 32kB area 1,000,000 times. >> - And measure this time >And are you sure you didn't enable assert checking? Ah, sorry.. I misconfigured it. >I'm afraid the measurement is not correct. First, the older discussion below shows >that the accounting overhead is much, much smaller, even with a more complex >accounting. >Second, allocation/free of memory > 8 KB calls malloc()/free(). I guess the >accounting overhead will be more likely to be hidden under the overhead of malloc() >and free(). What we'd like to know the overhead when malloc() and free() are not >called. Here is the average of 50 times measurement. Palloc-pfree for 800byte with 1,000,000 times, and 32kB with 1,000,000 times. I checked malloc is not called at size=800 using gdb. [Size=800, iter=1,000,000] Master |15.763 Patched|16.262 (+3%) [Size=32768, iter=1,000,000] Master |61.3076 Patched|62.9566 (+2%) At least compared to previous HashAg version, the overhead is smaller. It has some overhead but is increase by 2 or 3% a little bit? Regards, Takeshi Ideriha
On Wed, Feb 27, 2019 at 3:16 AM Ideriha, Takeshi <ideriha.takeshi@jp.fujitsu.com> wrote: > I'm afraid I may be quibbling about it. > What about users who understand performance drops but don't want to > add memory or decrease concurrency? > I think that PostgreSQL has a parameter > which most of users don't mind and use is as default > but a few of users want to change it. > In this case as you said, introducing hard limit parameter causes > performance decrease significantly so how about adding detailed caution > to the document like planner cost parameter? There's nothing wrong with a parameter that is useful to some people and harmless to everyone else, but the people who are proposing that parameter still have to demonstrate that it has those properties. This email thread is really short on clear demonstrations that X or Y is useful. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
From: Ideriha, Takeshi/出利葉 健 > [Size=800, iter=1,000,000] > Master |15.763 > Patched|16.262 (+3%) > > [Size=32768, iter=1,000,000] > Master |61.3076 > Patched|62.9566 (+2%) What's the unit, second or millisecond? Why is the number of digits to the right of the decimal point? Is the measurement correct? I'm wondering because the difference is larger in the latter case. Isn't the accounting processingalmost the sane in both cases? * former: 16.262 - 15.763 = 4.99 * latter: 62.956 - 61.307 = 16.49 > At least compared to previous HashAg version, the overhead is smaller. > It has some overhead but is increase by 2 or 3% a little bit? I think the overhead is sufficiently small. It may get even smaller with a trivial tweak. You added the new member usedspace at the end of MemoryContextData. The original size of MemoryContextData is 72 bytes,and Intel Xeon's cache line is 64 bytes. So, the new member will be on a separate cache line. Try putting usedspacebefore the name member. Regards Takayuki Tsunakawa
Robert> This email thread is really short on clear demonstrations that X or Y Robert> is useful. It is useful when the whole database does **not** crash, isn't it? Case A (==current PostgeSQL mode): syscache grows, then OOMkiller chimes in, kills the database process, and it leads to the complete cluster failure (all other PG processes terminate themselves). Case B (==limit syscache by 10MiB or whatever as Tsunakawa, Takayuki asks): a single ill-behaved process works a bit slower and/or consumers more CPU than the other ones. The whole DB is still alive. I'm quite sure "case B" is much better for the end users and for the database administrators. So, +1 to Tsunakawa, Takayuki, it would be so great if there was a way to limit the memory consumption of a single process (e.g. syscache, workmem, etc, etc). Robert> However, memory usage is quite unpredictable. It depends on how many Robert> backends are active The number of backends can be limited by ensuring a proper limits at application connection pool level and/or pgbouncer and/or things like that. Robert>how many copies of work_mem and/or Robert> maintenance_work_mem are in use There might be other patches to cap the total use of work_mem/maintenance_work_mem, Robert>I don't think we Robert> can say that just imposing a limit on the size of the system caches is Robert> going to be enough to reliably prevent an out of memory condition The less possibilities there are for OOM the better. Quite often it is much better to fail a single SQL rather than kill all the DB processes. Vladimir
At Tue, 26 Feb 2019 10:55:18 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+Tgmoa2b-LUF9h3wugD9ZA5MP0xyu2kJYHC9L6sdLywNSmhBQ@mail.gmail.com> > On Mon, Feb 25, 2019 at 1:27 AM Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > > I'd like to see some evidence that catalog_cache_memory_target has any > > > value, vs. just always setting it to zero. > > > > It is artificial (or acutually wont't be repeatedly executed in a > > session) but anyway what can get benefit from > > catalog_cache_memory_target would be a kind of extreme. > > I agree. So then let's not have it. Ah... Yeah! I see. Andres' concern was that crucial syscache entries might be blown away during a long idle time. If that happens, it's enough to just turn off in the almost all of such cases. We no longer need to count memory usage without the feature. That sutff is moved to monitoring feature, which is out of the scope of the current status of this patch. > We shouldn't add more mechanism here than actually has value. It > seems pretty clear that keeping cache entries that go unused for long > periods can't be that important; even if we need them again > eventually, reloading them every 5 or 10 minutes can't hurt that much. > On the other hand, I think it's also pretty clear that evicting cache > entries that are being used frequently will have disastrous effects on > performance; as I noted in the other email I just sent, consider the > effects of CLOBBER_CACHE_ALWAYS. No reasonable user is going to want > to incur a massive slowdown to save a little bit of memory. > > I see that *in theory* there is a value to > catalog_cache_memory_target, because *maybe* there is a workload where > tuning that GUC will lead to better performance at lower memory usage > than any competing proposal. But unless we can actually see an > example of such a workload, which so far I don't, we're adding a knob > that everybody has to think about how to tune when in fact we have no > idea how to tune it or whether it even needs to be tuned. That > doesn't make sense. We have to be able to document the parameters we > have and explain to users how they should be used. And as far as this > parameter is concerned I think we are not at that point. In the attached v18, catalog_cache_memory_target is removed, removed some leftover of removing the hard limit feature, separated catcache clock update during a query into 0003. attached 0004 (monitor part) in order just to see how it is working. v18-0001-Add-dlist_move_tail: Just adds dlist_move_tail v18-0002-Remove-entries-that-haven-t-been-used-for-a-certain-: Revised pruning feature. ==== v18-0003-Asynchronous-update-of-catcache-clock: Separated catcache clock update feature. v18-0004-Syscache-usage-tracking-feature: Usage tracking feature. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 54388a7452eda1faadaa108e1bc21d51844f9224 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Thu, 7 Feb 2019 14:56:07 +0900 Subject: [PATCH 1/6] Add dlist_move_tail We have dlist_push_head/tail and dlist_move_head but not dlist_move_tail. Add it. --- src/include/lib/ilist.h | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/src/include/lib/ilist.h b/src/include/lib/ilist.h index b1a5974ee4..659ab1ac87 100644 --- a/src/include/lib/ilist.h +++ b/src/include/lib/ilist.h @@ -394,6 +394,25 @@ dlist_move_head(dlist_head *head, dlist_node *node) dlist_check(head); } +/* + * Move element from its current position in the list to the tail position in + * the same list. + * + * Undefined behaviour if 'node' is not already part of the list. + */ +static inline void +dlist_move_tail(dlist_head *head, dlist_node *node) +{ + /* fast path if it's already at the tail */ + if (head->head.prev == node) + return; + + dlist_delete(node); + dlist_push_tail(head, node); + + dlist_check(head); +} + /* * Check whether 'node' has a following node. * Caution: unreliable if 'node' is not in the list. -- 2.16.3 From c79d5fc86f45e6545cbc257040e46125ffc5cb92 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Fri, 1 Mar 2019 13:32:51 +0900 Subject: [PATCH 2/6] Remove entries that haven't been used for a certain time Catcache entries happen to be left alone for several reasons. It is not desirable that such useless entries eat up memory. Catcache pruning feature removes entries that haven't been accessed for a certain time before enlarging hash array. --- doc/src/sgml/config.sgml | 19 ++++ src/backend/tcop/postgres.c | 2 + src/backend/utils/cache/catcache.c | 122 +++++++++++++++++++++++++- src/backend/utils/misc/guc.c | 12 +++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/utils/catcache.h | 18 ++++ 6 files changed, 171 insertions(+), 3 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 6d42b7afe7..737a156bb4 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1661,6 +1661,25 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-catalog-cache-prune-min-age" xreflabel="catalog_cache_prune_min_age"> + <term><varname>catalog_cache_prune_min_age</varname> (<type>integer</type>) + <indexterm> + <primary><varname>catalog_cache_prune_min_age</varname> configuration + parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the minimum amount of unused time in seconds at which a + system catalog cache entry is removed. -1 indicates that this feature + is disabled at all. The value defaults to 300 seconds (<literal>5 + minutes</literal>). The entries that are not used for the duration + can be removed to prevent catalog cache from bloating with useless + entries. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth"> <term><varname>max_stack_depth</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 8b4d94c9a1..02b9ef98aa 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -71,6 +71,7 @@ #include "tcop/pquery.h" #include "tcop/tcopprot.h" #include "tcop/utility.h" +#include "utils/catcache.h" #include "utils/lsyscache.h" #include "utils/memutils.h" #include "utils/ps_status.h" @@ -2584,6 +2585,7 @@ start_xact_command(void) * not desired, the timeout has to be disabled explicitly. */ enable_statement_timeout(); + SetCatCacheClock(GetCurrentStatementStartTimestamp()); } static void diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 78dd5714fa..4386957497 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -39,6 +39,7 @@ #include "utils/rel.h" #include "utils/resowner_private.h" #include "utils/syscache.h" +#include "utils/timeout.h" /* #define CACHEDEBUG */ /* turns DEBUG elogs on */ @@ -61,9 +62,24 @@ #define CACHE_elog(...) #endif +/* + * GUC variable to define the minimum age of entries that will be considered + * to be evicted in seconds. -1 to disable the feature. + */ +int catalog_cache_prune_min_age = 300; + +/* + * Minimum interval between two successive moves of a cache entry in LRU list, + * in microseconds. + */ +#define MIN_LRU_UPDATE_INTERVAL 100000 /* 100ms */ + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Clock for the last accessed time of a catcache entry. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -469,6 +485,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) /* delink from linked list */ dlist_delete(&ct->cache_elem); + dlist_delete(&ct->lru_node); /* * Free keys when we're dealing with a negative entry, normal entries just @@ -829,6 +846,7 @@ InitCatCache(int id, cp->cc_nkeys = nkeys; for (i = 0; i < nkeys; ++i) cp->cc_keyno[i] = key[i]; + dlist_init(&cp->cc_lru_list); /* * new cache is initialized as far as we can go for now. print some @@ -846,9 +864,83 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initialize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries happen to be left unused for a long time for several + * reasons. Remove such entries to prevent catcache from bloating. It is based + * on the similar algorithm with buffer eviction. Entries that are accessed + * several times in a certain period live longer than those that have had less + * access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int nremoved = 0; + dlist_mutable_iter iter; + + /* Return immediately if disabled */ + if (catalog_cache_prune_min_age == 0) + return false; + + /* Scan over LRU to find entries to remove */ + dlist_foreach_modify(iter, &cp->cc_lru_list) + { + CatCTup *ct = dlist_container(CatCTup, lru_node, iter.cur); + long entry_age; + int us; + + /* Don't remove referenced entries */ + if (ct->refcount != 0 || + (ct->c_list && ct->c_list->refcount != 0)) + continue; + + /* + * Calculate the duration from the time from the last access to + * the "current" time. catcacheclock is updated per-statement + * basis. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + + if (entry_age < catalog_cache_prune_min_age) + { + /* + * We don't have older entries, exit. At least one removal + * prevents rehashing this time. + */ + break; + } + + /* + * Entries that are not accessed after the last pruning are removed in + * that seconds, and their lives are prolonged according to how many + * times they are accessed up to three times of the duration. We don't + * try shrink buckets since pruning effectively caps catcache + * expansion in the long term. + */ + if (ct->naccess > 0) + ct->naccess--; + else + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + } + } + + if (nremoved > 0) + elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d", + cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved); + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -1260,6 +1352,20 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* prolong life of this entry */ + if (ct->naccess < 2) + ct->naccess++; + + /* + * Don't update LRU too frequently. We need to maintain the LRU even + * if pruning is inactive since it can be turned on on-session. + */ + if (catcacheclock - ct->lastaccess > MIN_LRU_UPDATE_INTERVAL) + { + ct->lastaccess = catcacheclock; + dlist_move_tail(&cache->cc_lru_list, &ct->lru_node); + } + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1884,19 +1990,29 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; + dlist_push_tail(&cache->cc_lru_list, &ct->lru_node); dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); cache->cc_ntup++; CacheHdr->ch_ntup++; + /* increase refcount so that the new entry survives pruning */ + ct->refcount++; + /* - * If the hash table has become too full, enlarge the buckets array. Quite - * arbitrarily, we enlarge when fill factor > 2. + * If the hash table has become too full, try removing infrequently used + * entries to make a room for the new entry. If failed, enlarge the bucket + * array instead. Quite arbitrarily, we try this when fill factor > 2. */ - if (cache->cc_ntup > cache->cc_nbuckets * 2) + if (cache->cc_ntup > cache->cc_nbuckets * 2 && + !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); + ct->refcount--; + return ct; } diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 156d147c85..3acc86cd07 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -81,6 +81,7 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/guc_tables.h" #include "utils/float.h" #include "utils/memutils.h" @@ -2205,6 +2206,17 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("System catalog cache entries that live unused for longer than this seconds are considered forremoval."), + gettext_noop("The value of -1 turns off pruning."), + GUC_UNIT_S + }, + &catalog_cache_prune_min_age, + 300, -1, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index bd6ea65d0c..e9e3acc903 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -128,6 +128,7 @@ #work_mem = 4MB # min 64kB #maintenance_work_mem = 64MB # min 1MB #autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem +#catalog_cache_prune_min_age = 300s # -1 disables pruning #max_stack_depth = 2MB # min 100kB #shared_memory_type = mmap # the default is the first option # supported by the operating system: diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 65d816a583..a21c53644a 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -61,6 +62,7 @@ typedef struct catcache slist_node cc_next; /* list link */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ + dlist_head cc_lru_list; /* * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS @@ -119,6 +121,9 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ + int naccess; /* # of access to this entry, up to 2 */ + TimestampTz lastaccess; /* timestamp of the last usage */ + dlist_node lru_node; /* LRU node */ /* * The tuple may also be a member of at most one CatCList. (If a single @@ -189,6 +194,19 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int catalog_cache_prune_min_age; + +/* source clock for access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* SetCatCacheClock - set catcache timestamp source clodk */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, -- 2.16.3 From 5c6357cc575bf0f1d03740c2f2e94d3d79a53f4e Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Fri, 1 Mar 2019 14:16:55 +0900 Subject: [PATCH 3/6] Asynchronous update of catcache clock The catcache pruning feature fails to work while a long running query executes many commands and fetches many syscache entries. This patch asynchronously updates the catcache clock to make the pruning work even in the case. --- src/backend/tcop/postgres.c | 11 +++++++ src/backend/utils/cache/catcache.c | 65 ++++++++++++++++++++++++++++++++++++-- src/backend/utils/init/globals.c | 1 + src/backend/utils/init/postinit.c | 14 ++++++++ src/backend/utils/misc/guc.c | 2 +- src/include/miscadmin.h | 1 + src/include/utils/catcache.h | 23 +++++++++++++- src/include/utils/timeout.h | 1 + 8 files changed, 114 insertions(+), 4 deletions(-) diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 02b9ef98aa..d9a54ed37f 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -3161,6 +3161,14 @@ ProcessInterrupts(void) if (ParallelMessagePending) HandleParallelMessages(); + + if (CatcacheClockTimeoutPending) + { + CatcacheClockTimeoutPending = false; + + /* Update timestamp then set up the next timeout */ + UpdateCatCacheClock(); + } } @@ -4023,6 +4031,9 @@ PostgresMain(int argc, char *argv[], QueryCancelPending = false; /* second to avoid race condition */ stmt_timeout_active = false; + /* get sync with the timer state */ + catcache_clock_timeout_active = false; + /* Not reading from the client anymore. */ DoingCommandRead = false; diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 4386957497..e0ecfe09d4 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -69,7 +69,12 @@ int catalog_cache_prune_min_age = 300; /* - * Minimum interval between two successive moves of a cache entry in LRU list, + * Flag to keep track of whether catcache clock timer is active. + */ +bool catcache_clock_timeout_active = false; + +/* + * Minimum interval between two success move of a cache entry in LRU list, * in microseconds. */ #define MIN_LRU_UPDATE_INTERVAL 100000 /* 100ms */ @@ -871,6 +876,61 @@ InitCatCache(int id, return cp; } +/* + * Helper routine for SetCatCacheClock and UpdateCatCacheClockTimer. + * + * Maintains the catcache clock during a long query. + */ +void +SetupCatCacheClockTimer(void) +{ + long delay; + + /* stop timer if no longer needed */ + if (catalog_cache_prune_min_age == 0) + { + catcache_clock_timeout_active = false; + return; + } + + /* One 10th of the prune age, in milliseconds */ + delay = catalog_cache_prune_min_age * 1000/10; + + /* We don't need to update the clock so frequently. */ + if (delay < 1000) + delay = 1000; + + enable_timeout_after(CATCACHE_CLOCK_TIMEOUT, delay); + + catcache_clock_timeout_active = true; +} + +/* + * Update catcacheclock: + * + * Intended to be called when CATCACHE_CLOCK_TIMEOUT fires. The interval is + * expected more than 1 second (see above), so GetCurrentTime() doesn't harm. + */ +void +UpdateCatCacheClock(void) +{ + catcacheclock = GetCurrentTimestamp(); + SetupCatCacheClockTimer(); +} + +/* + * Change of catalog_cache_prune_min_age requires rearming of the timer. Just + * disabling here causes later rearming as needed. + */ +void +assign_catalog_cache_prune_min_age(int newval, void *extra) +{ + if (catcache_clock_timeout_active) + disable_timeout(CATCACHE_CLOCK_TIMEOUT, false); + + catcache_clock_timeout_active = false; +} + /* * CatCacheCleanupOldEntries - Remove infrequently-used entries * @@ -905,7 +965,8 @@ CatCacheCleanupOldEntries(CatCache *cp) /* * Calculate the duration from the time from the last access to * the "current" time. catcacheclock is updated per-statement - * basis. + * basis and additionaly udpated periodically during a long + * running query. */ TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index fd51934aaf..0e8b972a29 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false; volatile sig_atomic_t ProcDiePending = false; volatile sig_atomic_t ClientConnectionLost = false; volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false; +volatile sig_atomic_t CatcacheClockTimeoutPending = false; volatile sig_atomic_t ConfigReloadPending = false; volatile uint32 InterruptHoldoffCount = 0; volatile uint32 QueryCancelHoldoffCount = 0; diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index a5ee209f91..eb17103595 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -72,6 +72,7 @@ static void ShutdownPostgres(int code, Datum arg); static void StatementTimeoutHandler(void); static void LockTimeoutHandler(void); static void IdleInTransactionSessionTimeoutHandler(void); +static void CatcacheClockTimeoutHandler(void); static bool ThereIsAtLeastOneRole(void); static void process_startup_options(Port *port, bool am_superuser); static void process_settings(Oid databaseid, Oid roleid); @@ -628,6 +629,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler); RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT, IdleInTransactionSessionTimeoutHandler); + RegisterTimeout(CATCACHE_CLOCK_TIMEOUT, + CatcacheClockTimeoutHandler); } /* @@ -1238,6 +1241,17 @@ IdleInTransactionSessionTimeoutHandler(void) SetLatch(MyLatch); } +/* + * CATCACHE_CLOCK_TIMEOUT handler: trigger a catcache source clock update + */ +static void +CatcacheClockTimeoutHandler(void) +{ + CatcacheClockTimeoutPending = true; + InterruptPending = true; + SetLatch(MyLatch); +} + /* * Returns true if at least one role is defined in this database cluster. */ diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 3acc86cd07..0bdea0c383 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -2214,7 +2214,7 @@ static struct config_int ConfigureNamesInt[] = }, &catalog_cache_prune_min_age, 300, -1, INT_MAX, - NULL, NULL, NULL + NULL, assign_catalog_cache_prune_min_age, NULL }, /* diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index c9e35003a5..33b800e80f 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending; extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending; extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending; extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending; +extern PGDLLIMPORT volatile sig_atomic_t CatcacheClockTimeoutPending; extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending; extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost; diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index a21c53644a..5141f57bac 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -200,13 +200,34 @@ extern int catalog_cache_prune_min_age; /* source clock for access timestamp of catcache entries */ extern TimestampTz catcacheclock; -/* SetCatCacheClock - set catcache timestamp source clodk */ +/* + * Flag to keep track of whether catcache timestamp timer is active. + */ +extern bool catcache_clock_timeout_active; + +/* catcache prune time helper functions */ +extern void SetupCatCacheClockTimer(void); +extern void UpdateCatCacheClock(void); + +/* + * SetCatCacheClock - set catcache timestamp source clock + * + * The clock is passively updated per-query basis. We need to update it + * asynchronously in the case where a long running query executes many + * commands. Setup the timeout to do that. Setting a timout is so complex that + * we don't want do that for every query start so it runs until + * catalog_cache_prune_min_age is changed. See UpdateCatCacheClock(). + */ static inline void SetCatCacheClock(TimestampTz ts) { catcacheclock = ts; + + if (!catcache_clock_timeout_active && catalog_cache_prune_min_age > 0) + SetupCatCacheClockTimer(); } +extern void assign_catalog_cache_prune_min_age(int newval, void *extra); extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h index 9244a2a7b7..b2d97b4f7b 100644 --- a/src/include/utils/timeout.h +++ b/src/include/utils/timeout.h @@ -31,6 +31,7 @@ typedef enum TimeoutId STANDBY_TIMEOUT, STANDBY_LOCK_TIMEOUT, IDLE_IN_TRANSACTION_SESSION_TIMEOUT, + CATCACHE_CLOCK_TIMEOUT, /* First user-definable timeout reason */ USER_TIMEOUT, /* Maximum number of timeout reasons */ -- 2.16.3 From 89f64ee52ea4656b8397524d511abbdf793521b9 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Fri, 1 Mar 2019 12:00:26 +0900 Subject: [PATCH 4/6] Syscache usage tracking feature Collects syscache usage statictics and show it using the view pg_stat_syscache. The feature is controlled by the GUC variable track_syscache_usage_interval. --- doc/src/sgml/config.sgml | 16 ++ src/backend/catalog/system_views.sql | 17 +++ src/backend/postmaster/pgstat.c | 201 ++++++++++++++++++++++++-- src/backend/tcop/postgres.c | 23 +++ src/backend/utils/adt/pgstatfuncs.c | 133 +++++++++++++++++ src/backend/utils/cache/catcache.c | 145 +++++++++++++++---- src/backend/utils/cache/syscache.c | 24 +++ src/backend/utils/init/globals.c | 1 + src/backend/utils/init/postinit.c | 11 ++ src/backend/utils/misc/guc.c | 10 ++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/catalog/pg_proc.dat | 9 ++ src/include/miscadmin.h | 1 + src/include/pgstat.h | 4 + src/include/utils/catcache.h | 13 +- src/include/utils/syscache.h | 19 +++ src/include/utils/timeout.h | 1 + src/test/regress/expected/rules.out | 24 ++- 18 files changed, 612 insertions(+), 41 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 737a156bb4..850fe4ea90 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -6689,6 +6689,22 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; </listitem> </varlistentry> + <varlistentry id="guc-track-catalog-cache-usage-interval" xreflabel="track_catalog_cache_usage_interval"> + <term><varname>track_catalog_cache_usage_interval</varname> (<type>integer</type>) + <indexterm> + <primary><varname>track_catlog_cache_usage_interval</varname> + configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the interval to collect catalog cache usage statistics on + the session in milliseconds. This parameter is 0 by default, which + means disabled. Only superusers can change this setting. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-track-io-timing" xreflabel="track_io_timing"> <term><varname>track_io_timing</varname> (<type>boolean</type>) <indexterm> diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 3e229c693c..f5d1aaf96f 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -906,6 +906,22 @@ CREATE VIEW pg_stat_progress_vacuum AS FROM pg_stat_get_progress_info('VACUUM') AS S LEFT JOIN pg_database D ON S.datid = D.oid; +CREATE VIEW pg_stat_syscache AS + SELECT + S.pid AS pid, + S.relid::regclass AS relname, + S.indid::regclass AS cache_name, + S.size AS size, + S.ntup AS ntuples, + S.searches AS searches, + S.hits AS hits, + S.neg_hits AS neg_hits, + S.ageclass AS ageclass, + S.last_update AS last_update + FROM pg_stat_activity A + JOIN LATERAL (SELECT A.pid, * FROM pg_get_syscache_stats(A.pid)) S + ON (A.pid = S.pid); + CREATE VIEW pg_user_mappings AS SELECT U.oid AS umid, @@ -1185,6 +1201,7 @@ GRANT EXECUTE ON FUNCTION pg_ls_waldir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_archive_statusdir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_tmpdir() TO pg_monitor; GRANT EXECUTE ON FUNCTION pg_ls_tmpdir(oid) TO pg_monitor; +GRANT EXECUTE ON FUNCTION pg_get_syscache_stats(int) TO pg_monitor; GRANT pg_read_all_settings TO pg_monitor; GRANT pg_read_all_stats TO pg_monitor; diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 81c6499251..b15a3273ca 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -66,6 +66,7 @@ #include "utils/ps_status.h" #include "utils/rel.h" #include "utils/snapmgr.h" +#include "utils/syscache.h" #include "utils/timestamp.h" @@ -124,6 +125,7 @@ bool pgstat_track_activities = false; bool pgstat_track_counts = false; int pgstat_track_functions = TRACK_FUNC_OFF; +int pgstat_track_syscache_usage_interval = 0; int pgstat_track_activity_query_size = 1024; /* ---------- @@ -236,6 +238,11 @@ typedef struct TwoPhasePgStatRecord bool t_truncated; /* was the relation truncated? */ } TwoPhasePgStatRecord; +/* bitmap symbols to specify target file types remove */ +#define PGSTAT_REMFILE_DBSTAT 1 /* remove only database stats files */ +#define PGSTAT_REMFILE_SYSCACHE 2 /* remove only syscache stats files */ +#define PGSTAT_REMFILE_ALL 3 /* remove both type of files */ + /* * Info about current "snapshot" of stats file */ @@ -335,6 +342,7 @@ static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len); static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len); static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len); static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len); +static void pgstat_remove_syscache_statsfile(void); /* ------------------------------------------------------------ * Public functions called from postmaster follow @@ -630,10 +638,13 @@ startup_failed: } /* - * subroutine for pgstat_reset_all + * remove stats files + * + * clean up stats files in specified directory. target is one of + * PGSTAT_REFILE_DBSTAT/SYSCACHE/ALL and restricts files to remove. */ static void -pgstat_reset_remove_files(const char *directory) +pgstat_reset_remove_files(const char *directory, int target) { DIR *dir; struct dirent *entry; @@ -644,25 +655,39 @@ pgstat_reset_remove_files(const char *directory) { int nchars; Oid tmp_oid; + int filetype = 0; /* * Skip directory entries that don't match the file names we write. * See get_dbstat_filename for the database-specific pattern. */ if (strncmp(entry->d_name, "global.", 7) == 0) + { + filetype = PGSTAT_REMFILE_DBSTAT; nchars = 7; + } else { + char head[2]; + nchars = 0; - (void) sscanf(entry->d_name, "db_%u.%n", - &tmp_oid, &nchars); - if (nchars <= 0) - continue; + (void) sscanf(entry->d_name, "%c%c_%u.%n", + head, head + 1, &tmp_oid, &nchars); + /* %u allows leading whitespace, so reject that */ - if (strchr("0123456789", entry->d_name[3]) == NULL) + if (nchars < 3 || !isdigit(entry->d_name[3])) continue; + + if (strncmp(head, "db", 2) == 0) + filetype = PGSTAT_REMFILE_DBSTAT; + else if (strncmp(head, "cc", 2) == 0) + filetype = PGSTAT_REMFILE_SYSCACHE; } + /* skip if this is not a target */ + if ((filetype & target) == 0) + continue; + if (strcmp(entry->d_name + nchars, "tmp") != 0 && strcmp(entry->d_name + nchars, "stat") != 0) continue; @@ -683,8 +708,9 @@ pgstat_reset_remove_files(const char *directory) void pgstat_reset_all(void) { - pgstat_reset_remove_files(pgstat_stat_directory); - pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY); + pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_ALL); + pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY, + PGSTAT_REMFILE_ALL); } #ifdef EXEC_BACKEND @@ -2963,6 +2989,10 @@ pgstat_beshutdown_hook(int code, Datum arg) if (OidIsValid(MyDatabaseId)) pgstat_report_stat(true); + /* clear syscache statistics files and temporary settings */ + if (MyBackendId != InvalidBackendId) + pgstat_remove_syscache_statsfile(); + /* * Clear my status entry, following the protocol of bumping st_changecount * before and after. We use a volatile pointer here to ensure the @@ -4287,6 +4317,9 @@ PgstatCollectorMain(int argc, char *argv[]) pgStatRunningInCollector = true; pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true); + /* Remove left-over syscache stats files */ + pgstat_reset_remove_files(pgstat_stat_directory, PGSTAT_REMFILE_SYSCACHE); + /* * Loop to process messages until we get SIGQUIT or detect ungraceful * death of our parent postmaster. @@ -6377,3 +6410,153 @@ pgstat_clip_activity(const char *raw_activity) return activity; } + +/* + * return the filename for a syscache stat file; filename is the output + * buffer, of length len. + */ +void +pgstat_get_syscachestat_filename(bool permanent, bool tempname, int backendid, + char *filename, int len) +{ + int printed; + + /* NB -- pgstat_reset_remove_files knows about the pattern this uses */ + printed = snprintf(filename, len, "%s/cc_%u.%s", + permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY : + pgstat_stat_directory, + backendid, + tempname ? "tmp" : "stat"); + if (printed >= len) + elog(ERROR, "overlength pgstat path"); +} + +/* removes syscache stats files of this backend */ +static void +pgstat_remove_syscache_statsfile(void) +{ + char fname[MAXPGPATH]; + + pgstat_get_syscachestat_filename(false, false, MyBackendId, + fname, MAXPGPATH); + unlink(fname); /* don't care of the result */ +} + +/* + * pgstat_write_syscache_stats() - + * Write the syscache statistics files. + * + * If 'force' is false, this function skips writing a file and returns the + * time remaining in the current interval in milliseconds. If 'force' is true, + * writes a file regardless of the remaining time and reset the interval. + */ +long +pgstat_write_syscache_stats(bool force) +{ + static TimestampTz last_report = 0; + TimestampTz now; + long elapsed; + long secs; + int usecs; + int cacheId; + FILE *fpout; + char statfile[MAXPGPATH]; + char tmpfile[MAXPGPATH]; + + /* Return if we don't want it */ + if (!force && pgstat_track_syscache_usage_interval <= 0) + { + /* disabled. remove the statistics file if any */ + if (last_report > 0) + { + last_report = 0; + pgstat_remove_syscache_statsfile(); + } + return 0; + } + + /* Check against the interval */ + now = GetCurrentTransactionStopTimestamp(); + TimestampDifference(last_report, now, &secs, &usecs); + elapsed = secs * 1000 + usecs / 1000; + + if (!force && elapsed < pgstat_track_syscache_usage_interval) + { + /* not yet the time, inform the remaining time to the caller */ + return pgstat_track_syscache_usage_interval - elapsed; + } + + /* now update the stats */ + last_report = now; + + pgstat_get_syscachestat_filename(false, true, + MyBackendId, tmpfile, MAXPGPATH); + pgstat_get_syscachestat_filename(false, false, + MyBackendId, statfile, MAXPGPATH); + + /* + * This function can be called from ProcessInterrupts(). Inhibit recursive + * interrupts to avoid recursive entry. + */ + HOLD_INTERRUPTS(); + + fpout = AllocateFile(tmpfile, PG_BINARY_W); + if (fpout == NULL) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not open temporary statistics file \"%s\": %m", + tmpfile))); + /* + * Failure writing this file is not critical. Just skip this time and + * tell caller to wait for the next interval. + */ + RESUME_INTERRUPTS(); + return pgstat_track_syscache_usage_interval; + } + + /* write out every catcache stats */ + for (cacheId = 0 ; cacheId < SysCacheSize ; cacheId++) + { + SysCacheStats *stats; + + stats = SysCacheGetStats(cacheId); + Assert (stats); + + /* write error is checked later using ferror() */ + fputc('T', fpout); + (void)fwrite(&cacheId, sizeof(int), 1, fpout); + (void)fwrite(&last_report, sizeof(TimestampTz), 1, fpout); + (void)fwrite(stats, sizeof(*stats), 1, fpout); + } + fputc('E', fpout); + + if (ferror(fpout)) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not write syscache statistics file \"%s\": %m", + tmpfile))); + FreeFile(fpout); + unlink(tmpfile); + } + else if (FreeFile(fpout) < 0) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not close syscache statistics file \"%s\": %m", + tmpfile))); + unlink(tmpfile); + } + else if (rename(tmpfile, statfile) < 0) + { + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not rename syscache statistics file \"%s\" to \"%s\": %m", + tmpfile, statfile))); + unlink(tmpfile); + } + + RESUME_INTERRUPTS(); + return 0; +} diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index d9a54ed37f..39abb9fbab 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -3159,6 +3159,12 @@ ProcessInterrupts(void) } + if (IdleSyscacheStatsUpdateTimeoutPending) + { + IdleSyscacheStatsUpdateTimeoutPending = false; + pgstat_write_syscache_stats(true); + } + if (ParallelMessagePending) HandleParallelMessages(); @@ -3743,6 +3749,7 @@ PostgresMain(int argc, char *argv[], sigjmp_buf local_sigjmp_buf; volatile bool send_ready_for_query = true; bool disable_idle_in_transaction_timeout = false; + bool disable_idle_syscache_update_timeout = false; /* Initialize startup process environment if necessary. */ if (!IsUnderPostmaster) @@ -4186,9 +4193,19 @@ PostgresMain(int argc, char *argv[], } else { + long timeout; + ProcessCompletedNotifies(); pgstat_report_stat(false); + timeout = pgstat_write_syscache_stats(false); + + if (timeout > 0) + { + disable_idle_syscache_update_timeout = true; + enable_timeout_after(IDLE_SYSCACHE_STATS_UPDATE_TIMEOUT, + timeout); + } set_ps_display("idle", false); pgstat_report_activity(STATE_IDLE, NULL); } @@ -4231,6 +4248,12 @@ PostgresMain(int argc, char *argv[], disable_idle_in_transaction_timeout = false; } + if (disable_idle_syscache_update_timeout) + { + disable_timeout(IDLE_SYSCACHE_STATS_UPDATE_TIMEOUT, false); + disable_idle_syscache_update_timeout = false; + } + /* * (6) check for any other interesting events that happened while we * slept. diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index 69f7265779..26f923a66b 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -14,6 +14,8 @@ */ #include "postgres.h" +#include <sys/stat.h> + #include "access/htup_details.h" #include "catalog/pg_authid.h" #include "catalog/pg_type.h" @@ -28,6 +30,7 @@ #include "utils/acl.h" #include "utils/builtins.h" #include "utils/inet.h" +#include "utils/syscache.h" #include "utils/timestamp.h" #define UINT32_ACCESS_ONCE(var) ((uint32)(*((volatile uint32 *)&(var)))) @@ -1908,3 +1911,133 @@ pg_stat_get_archiver(PG_FUNCTION_ARGS) PG_RETURN_DATUM(HeapTupleGetDatum( heap_form_tuple(tupdesc, values, nulls))); } + +Datum +pgstat_get_syscache_stats(PG_FUNCTION_ARGS) +{ +#define PG_GET_SYSCACHE_SIZE 9 + int pid = PG_GETARG_INT32(0); + ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; + TupleDesc tupdesc; + Tuplestorestate *tupstore; + MemoryContext per_query_ctx; + MemoryContext oldcontext; + PgBackendStatus *beentry; + int beid; + char fname[MAXPGPATH]; + FILE *fpin; + char c; + + if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("set-valued function called in context that cannot accept a set"))); + if (!(rsinfo->allowedModes & SFRM_Materialize)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("materialize mode required, but it is not " \ + "allowed in this context"))); + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + per_query_ctx = rsinfo->econtext->ecxt_per_query_memory; + + oldcontext = MemoryContextSwitchTo(per_query_ctx); + tupstore = tuplestore_begin_heap(true, false, work_mem); + rsinfo->returnMode = SFRM_Materialize; + rsinfo->setResult = tupstore; + rsinfo->setDesc = tupdesc; + + MemoryContextSwitchTo(oldcontext); + + /* find beentry for given pid*/ + beentry = NULL; + for (beid = 1; + (beentry = pgstat_fetch_stat_beentry(beid)) && + beentry->st_procpid != pid ; + beid++); + + /* + * we silently return empty result on failure or insufficient privileges + */ + if (!beentry || + (!has_privs_of_role(GetUserId(), beentry->st_userid) && + !is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS))) + goto no_data; + + pgstat_get_syscachestat_filename(false, false, beid, fname, MAXPGPATH); + + if ((fpin = AllocateFile(fname, PG_BINARY_R)) == NULL) + { + if (errno != ENOENT) + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not open statistics file \"%s\": %m", + fname))); + /* also return empty on no statistics file */ + goto no_data; + } + + /* read the statistics file into tuplestore */ + while ((c = fgetc(fpin)) == 'T') + { + TimestampTz last_update; + SysCacheStats stats; + int cacheid; + Datum values[PG_GET_SYSCACHE_SIZE]; + bool nulls[PG_GET_SYSCACHE_SIZE] = {0}; + Datum datums[SYSCACHE_STATS_NAGECLASSES * 2]; + bool arrnulls[SYSCACHE_STATS_NAGECLASSES * 2] = {0}; + int dims[] = {SYSCACHE_STATS_NAGECLASSES, 2}; + int lbs[] = {1, 1}; + ArrayType *arr; + int i, j; + + if (fread(&cacheid, sizeof(int), 1, fpin) != 1 || + fread(&last_update, sizeof(TimestampTz), 1, fpin) != 1 || + fread(&stats, 1, sizeof(stats), fpin) != sizeof(stats)) + { + ereport(WARNING, + (errmsg("corrupted syscache statistics file \"%s\"", + fname))); + goto no_data; + } + + i = 0; + values[i++] = ObjectIdGetDatum(stats.reloid); + values[i++] = ObjectIdGetDatum(stats.indoid); + values[i++] = Int64GetDatum(stats.size); + values[i++] = Int64GetDatum(stats.ntuples); + values[i++] = Int64GetDatum(stats.nsearches); + values[i++] = Int64GetDatum(stats.nhits); + values[i++] = Int64GetDatum(stats.nneg_hits); + + for (j = 0 ; j < SYSCACHE_STATS_NAGECLASSES ; j++) + { + datums[j * 2] = Int32GetDatum((int32) stats.ageclasses[j]); + datums[j * 2 + 1] = Int32GetDatum((int32) stats.nclass_entries[j]); + } + + arr = construct_md_array(datums, arrnulls, 2, dims, lbs, + INT4OID, sizeof(int32), true, 'i'); + values[i++] = PointerGetDatum(arr); + + values[i++] = TimestampTzGetDatum(last_update); + + Assert (i == PG_GET_SYSCACHE_SIZE); + + tuplestore_putvalues(tupstore, tupdesc, values, nulls); + } + + /* check for the end of file. abandon the result if file is broken */ + if (c != 'E' || fgetc(fpin) != EOF) + tuplestore_clear(tupstore); + + FreeFile(fpin); + +no_data: + tuplestore_donestoring(tupstore); + return (Datum) 0; +} diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index e0ecfe09d4..63c0ea3b17 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -85,6 +85,10 @@ static CatCacheHeader *CacheHdr = NULL; /* Clock for the last accessed time of a catcache entry. */ TimestampTz catcacheclock = 0; +/* age classes for pruning */ +static double ageclass[SYSCACHE_STATS_NAGECLASSES] + = {0.05, 0.1, 1.0, 2.0, 3.0, 0.0}; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -118,7 +122,7 @@ static CatCTup *CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *keys); -static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos, +static int CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *srckeys, Datum *dstkeys); @@ -500,6 +504,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys, cache->cc_keyno, ct->keys); + cache->cc_memusage -= ct->size; pfree(ct); --cache->cc_ntup; @@ -613,9 +618,7 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue) else CatCacheRemoveCTup(cache, ct); CACHE_elog(DEBUG2, "CatCacheInvalidate: invalidated"); -#ifdef CATCACHE_STATS cache->cc_invals++; -#endif /* could be multiple matches, so keep looking! */ } } @@ -691,9 +694,7 @@ ResetCatalogCache(CatCache *cache) } else CatCacheRemoveCTup(cache, ct); -#ifdef CATCACHE_STATS cache->cc_invals++; -#endif } } } @@ -833,7 +834,12 @@ InitCatCache(int id, */ sz = sizeof(CatCache) + PG_CACHE_LINE_SIZE; cp = (CatCache *) CACHELINEALIGN(palloc0(sz)); - cp->cc_bucket = palloc0(nbuckets * sizeof(dlist_head)); + cp->cc_head_alloc_size = sz; + sz = nbuckets * sizeof(dlist_head); + cp->cc_bucket = palloc0(sz); + + /* cc_head_alloc_size + consumed size for cc_bucket */ + cp->cc_memusage = cp->cc_head_alloc_size + sz; /* * initialize the cache's relation information for the relation @@ -1011,13 +1017,17 @@ RehashCatCache(CatCache *cp) dlist_head *newbucket; int newnbuckets; int i; + size_t sz; elog(DEBUG1, "rehashing catalog cache id %d for %s; %d tups, %d buckets", cp->id, cp->cc_relname, cp->cc_ntup, cp->cc_nbuckets); /* Allocate a new, larger, hash table. */ newnbuckets = cp->cc_nbuckets * 2; - newbucket = (dlist_head *) MemoryContextAllocZero(CacheMemoryContext, newnbuckets * sizeof(dlist_head)); + sz = newnbuckets * sizeof(dlist_head); + newbucket = (dlist_head *) MemoryContextAllocZero(CacheMemoryContext, sz); + + cp->cc_memusage = cp->cc_head_alloc_size + sz; /* Move all entries from old hash table to new. */ for (i = 0; i < cp->cc_nbuckets; i++) @@ -1031,6 +1041,7 @@ RehashCatCache(CatCache *cp) dlist_delete(iter.cur); dlist_push_head(&newbucket[hashIndex], &ct->cache_elem); + cp->cc_memusage += ct->size; } } @@ -1369,9 +1380,7 @@ SearchCatCacheInternal(CatCache *cache, if (unlikely(cache->cc_tupdesc == NULL)) CatalogCacheInitializeCache(cache); -#ifdef CATCACHE_STATS cache->cc_searches++; -#endif /* Initialize local parameter array */ arguments[0] = v1; @@ -1440,9 +1449,7 @@ SearchCatCacheInternal(CatCache *cache, CACHE_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_hits++; -#endif return &ct->tuple; } @@ -1451,9 +1458,7 @@ SearchCatCacheInternal(CatCache *cache, CACHE_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_neg_hits++; -#endif return NULL; } @@ -1581,9 +1586,7 @@ SearchCatCacheMiss(CatCache *cache, CACHE_elog(DEBUG2, "SearchCatCache(%s): put in bucket %d", cache->cc_relname, hashIndex); -#ifdef CATCACHE_STATS cache->cc_newloads++; -#endif return &ct->tuple; } @@ -1694,9 +1697,7 @@ SearchCatCacheList(CatCache *cache, Assert(nkeys > 0 && nkeys < cache->cc_nkeys); -#ifdef CATCACHE_STATS cache->cc_lsearches++; -#endif /* Initialize local parameter array */ arguments[0] = v1; @@ -1753,9 +1754,7 @@ SearchCatCacheList(CatCache *cache, CACHE_elog(DEBUG2, "SearchCatCacheList(%s): found list", cache->cc_relname); -#ifdef CATCACHE_STATS cache->cc_lhits++; -#endif return cl; } @@ -1862,6 +1861,11 @@ SearchCatCacheList(CatCache *cache, /* Now we can build the CatCList entry. */ oldcxt = MemoryContextSwitchTo(CacheMemoryContext); nmembers = list_length(ctlist); + + /* + * Don't waste a time by counting the list's memory usage, since it + * doesn't live a long life. + */ cl = (CatCList *) palloc(offsetof(CatCList, members) + nmembers * sizeof(CatCTup *)); @@ -1972,6 +1976,7 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, CatCTup *ct; HeapTuple dtp; MemoryContext oldcxt; + int tupsize; /* negative entries have no tuple associated */ if (ntp) @@ -1995,8 +2000,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, /* Allocate memory for CatCTup and the cached tuple in one go */ oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - ct = (CatCTup *) palloc(sizeof(CatCTup) + - MAXIMUM_ALIGNOF + dtp->t_len); + tupsize = sizeof(CatCTup) + MAXIMUM_ALIGNOF + dtp->t_len; + ct = (CatCTup *) palloc(tupsize); ct->tuple.t_len = dtp->t_len; ct->tuple.t_self = dtp->t_self; ct->tuple.t_tableOid = dtp->t_tableOid; @@ -2029,14 +2034,16 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, { Assert(negative); oldcxt = MemoryContextSwitchTo(CacheMemoryContext); - ct = (CatCTup *) palloc(sizeof(CatCTup)); + tupsize = sizeof(CatCTup); + ct = (CatCTup *) palloc(tupsize); /* * Store keys - they'll point into separately allocated memory if not * by-value. */ - CatCacheCopyKeys(cache->cc_tupdesc, cache->cc_nkeys, cache->cc_keyno, - arguments, ct->keys); + tupsize += + CatCacheCopyKeys(cache->cc_tupdesc, cache->cc_nkeys, + cache->cc_keyno, arguments, ct->keys); MemoryContextSwitchTo(oldcxt); } @@ -2060,7 +2067,10 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, cache->cc_ntup++; CacheHdr->ch_ntup++; - /* increase refcount so that the new entry survives pruning */ + ct->size = tupsize; + cache->cc_memusage += ct->size; + + /* increase refcount so that this survives pruning */ ct->refcount++; /* @@ -2103,13 +2113,14 @@ CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *keys) /* * Helper routine that copies the keys in the srckeys array into the dstkeys * one, guaranteeing that the datums are fully allocated in the current memory - * context. + * context. Returns allocated memory size. */ -static void +static int CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *srckeys, Datum *dstkeys) { int i; + int size = 0; /* * XXX: memory and lookup performance could possibly be improved by @@ -2138,8 +2149,25 @@ CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos, dstkeys[i] = datumCopy(src, att->attbyval, att->attlen); + + /* calculate rough estimate memory usage by datumCopy */ + if (!att->attbyval) + { + if (att->attlen == -1) + { + struct varlena *vl = (struct varlena *) DatumGetPointer(src); + + if (VARATT_IS_EXTERNAL_EXPANDED(vl)) + size += EOH_get_flat_size(DatumGetEOHP(src)); + else + size += VARSIZE_ANY(vl); + } + else + size += datumGetSize(src, att->attbyval, att->attlen); + } } + return size; } /* @@ -2263,3 +2291,66 @@ PrintCatCacheListLeakWarning(CatCList *list) list->my_cache->cc_relname, list->my_cache->id, list, list->refcount); } + +/* + * CatCacheGetStats - fill in SysCacheStats struct. + * + * This is a support routine for SysCacheGetStats, substantially fills in the + * result. The classification here is based on the same criteria to + * CatCacheCleanupOldEntries(). + */ +void +CatCacheGetStats(CatCache *cache, SysCacheStats *stats) +{ + int i, j; + + Assert(ageclass[SYSCACHE_STATS_NAGECLASSES - 1] == 0.0); + + /* fill in the stats struct */ + stats->size = cache->cc_memusage; + stats->ntuples = cache->cc_ntup; + stats->nsearches = cache->cc_searches; + stats->nhits = cache->cc_hits; + stats->nneg_hits = cache->cc_neg_hits; + + /* + * catalog_cache_prune_min_age can be changed on-session, fill it every + * time + */ + for (i = 0 ; i < SYSCACHE_STATS_NAGECLASSES ; i++) + stats->ageclasses[i] = + (int) (catalog_cache_prune_min_age * ageclass[i]); + + /* + * nth element in nclass_entries stores the number of cache entries that + * have lived unaccessed for corresponding multiple in ageclass of + * catalog_cache_prune_min_age. + */ + memset(stats->nclass_entries, 0, sizeof(int) * SYSCACHE_STATS_NAGECLASSES); + + /* Scan the whole hash */ + for (i = 0; i < cache->cc_nbuckets; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cache->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + /* + * Calculate the duration from the time from the last access to + * the "current" time. See CatCacheCleanupOldEntries for details. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + + j = 0; + while (j < SYSCACHE_STATS_NAGECLASSES - 1 && + entry_age > stats->ageclasses[j]) + j++; + + stats->nclass_entries[j]++; + } + } +} diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c index ac98c19155..7b38a06708 100644 --- a/src/backend/utils/cache/syscache.c +++ b/src/backend/utils/cache/syscache.c @@ -20,6 +20,9 @@ */ #include "postgres.h" +#include <sys/stat.h> +#include <unistd.h> + #include "access/htup_details.h" #include "access/sysattr.h" #include "catalog/indexing.h" @@ -1534,6 +1537,27 @@ RelationSupportsSysCache(Oid relid) return false; } +/* + * SysCacheGetStats - returns stats of specified syscache + * + * This routine returns the address of its local static memory. + */ +SysCacheStats * +SysCacheGetStats(int cacheId) +{ + static SysCacheStats stats; + + Assert(cacheId >=0 && cacheId < SysCacheSize); + + memset(&stats, 0, sizeof(stats)); + + stats.reloid = cacheinfo[cacheId].reloid; + stats.indoid = cacheinfo[cacheId].indoid; + + CatCacheGetStats(SysCache[cacheId], &stats); + + return &stats; +} /* * OID comparator for pg_qsort diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index 0e8b972a29..b7c647b5e0 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false; volatile sig_atomic_t ClientConnectionLost = false; volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false; volatile sig_atomic_t CatcacheClockTimeoutPending = false; +volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending = false; volatile sig_atomic_t ConfigReloadPending = false; volatile uint32 InterruptHoldoffCount = 0; volatile uint32 QueryCancelHoldoffCount = 0; diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index eb17103595..f2f879b6d8 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -73,6 +73,7 @@ static void StatementTimeoutHandler(void); static void LockTimeoutHandler(void); static void IdleInTransactionSessionTimeoutHandler(void); static void CatcacheClockTimeoutHandler(void); +static void IdleSyscacheStatsUpdateTimeoutHandler(void); static bool ThereIsAtLeastOneRole(void); static void process_startup_options(Port *port, bool am_superuser); static void process_settings(Oid databaseid, Oid roleid); @@ -631,6 +632,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username, IdleInTransactionSessionTimeoutHandler); RegisterTimeout(CATCACHE_CLOCK_TIMEOUT, CatcacheClockTimeoutHandler); + RegisterTimeout(IDLE_SYSCACHE_STATS_UPDATE_TIMEOUT, + IdleSyscacheStatsUpdateTimeoutHandler); } /* @@ -1252,6 +1255,14 @@ CatcacheClockTimeoutHandler(void) SetLatch(MyLatch); } +static void +IdleSyscacheStatsUpdateTimeoutHandler(void) +{ + IdleSyscacheStatsUpdateTimeoutPending = true; + InterruptPending = true; + SetLatch(MyLatch); +} + /* * Returns true if at least one role is defined in this database cluster. */ diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 0bdea0c383..5c8c9146d1 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -3158,6 +3158,16 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"track_catalog_cache_usage_interval", PGC_SUSET, STATS_COLLECTOR, + gettext_noop("Sets the interval between syscache usage collection, in milliseconds. Zero disables syscache usagetracking."), + NULL + }, + &pgstat_track_syscache_usage_interval, + 0, 0, INT_MAX / 2, + NULL, NULL, NULL + }, + { {"gin_pending_list_limit", PGC_USERSET, CLIENT_CONN_STATEMENT, gettext_noop("Sets the maximum size of the pending list for GIN index."), diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index e9e3acc903..4d39daced6 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -555,6 +555,7 @@ #track_io_timing = off #track_functions = none # none, pl, all #track_activity_query_size = 1024 # (change requires restart) +#track_catlog_cache_usage_interval = 0 # zero disables tracking #stats_temp_directory = 'pg_stat_tmp' diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index a4e173b484..1a67c4219f 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -9689,6 +9689,15 @@ proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}', proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn}', prosrc => 'pg_get_replication_slots' }, +{ oid => '3425', + descr => 'syscache statistics', + proname => 'pg_get_syscache_stats', prorows => '100', proisstrict => 'f', + proretset => 't', provolatile => 'v', prorettype => 'record', + proargtypes => 'int4', + proallargtypes => '{int4,oid,oid,int8,int8,int8,int8,int8,_int4,timestamptz}', + proargmodes => '{i,o,o,o,o,o,o,o,o,o}', + proargnames => '{pid,relid,indid,size,ntup,searches,hits,neg_hits,ageclass,last_update}', + prosrc => 'pgstat_get_syscache_stats' }, { oid => '3786', descr => 'set up a logical replication slot', proname => 'pg_create_logical_replication_slot', provolatile => 'v', proparallel => 'u', prorettype => 'record', proargtypes => 'name name bool', diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index 33b800e80f..767c94a63c 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -83,6 +83,7 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending; extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending; extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending; extern PGDLLIMPORT volatile sig_atomic_t CatcacheClockTimeoutPending; +extern PGDLLIMPORT volatile sig_atomic_t IdleSyscacheStatsUpdateTimeoutPending; extern PGDLLIMPORT volatile sig_atomic_t ConfigReloadPending; extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost; diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 88a75fb798..c90ee1a064 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -1144,6 +1144,7 @@ extern bool pgstat_track_activities; extern bool pgstat_track_counts; extern int pgstat_track_functions; extern PGDLLIMPORT int pgstat_track_activity_query_size; +extern int pgstat_track_syscache_usage_interval; extern char *pgstat_stat_directory; extern char *pgstat_stat_tmpname; extern char *pgstat_stat_filename; @@ -1228,6 +1229,8 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id); extern void pgstat_initstats(Relation rel); extern char *pgstat_clip_activity(const char *raw_activity); +extern void pgstat_get_syscachestat_filename(bool permanent, + bool tempname, int backendid, char *filename, int len); /* ---------- * pgstat_report_wait_start() - @@ -1363,5 +1366,6 @@ extern PgStat_StatFuncEntry *pgstat_fetch_stat_funcentry(Oid funcid); extern int pgstat_fetch_stat_numbackends(void); extern PgStat_ArchiverStats *pgstat_fetch_stat_archiver(void); extern PgStat_GlobalStats *pgstat_fetch_global(void); +extern long pgstat_write_syscache_stats(bool force); #endif /* PGSTAT_H */ diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 5141f57bac..310aeaeab5 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -63,12 +63,13 @@ typedef struct catcache ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ dlist_head cc_lru_list; + int cc_head_alloc_size;/* consumed memory to allocate this struct */ + int cc_memusage; /* memory usage of this catcache (excluding + * header part) */ /* - * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS - * doesn't break ABI for other modules + * Statistics entries */ -#ifdef CATCACHE_STATS long cc_searches; /* total # searches against this cache */ long cc_hits; /* # of matches against existing entry */ long cc_neg_hits; /* # of matches against negative entry */ @@ -81,7 +82,6 @@ typedef struct catcache long cc_invals; /* # of entries invalidated from cache */ long cc_lsearches; /* total # list-searches */ long cc_lhits; /* # of matches against existing lists */ -#endif } CatCache; @@ -124,6 +124,7 @@ typedef struct catctup int naccess; /* # of access to this entry, up to 2 */ TimestampTz lastaccess; /* timestamp of the last usage */ dlist_node lru_node; /* LRU node */ + int size; /* palloc'ed size off this tuple */ /* * The tuple may also be a member of at most one CatCList. (If a single @@ -267,4 +268,8 @@ extern void PrepareToInvalidateCacheTuple(Relation relation, extern void PrintCatCacheLeakWarning(HeapTuple tuple); extern void PrintCatCacheListLeakWarning(CatCList *list); +/* defined in syscache.h */ +typedef struct syscachestats SysCacheStats; +extern void CatCacheGetStats(CatCache *cache, SysCacheStats *syscachestats); + #endif /* CATCACHE_H */ diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h index 95ee48954e..71b399c902 100644 --- a/src/include/utils/syscache.h +++ b/src/include/utils/syscache.h @@ -112,6 +112,24 @@ enum SysCacheIdentifier #define SysCacheSize (USERMAPPINGUSERSERVER + 1) }; +#define SYSCACHE_STATS_NAGECLASSES 6 +/* Struct for catcache tracking information */ +typedef struct syscachestats +{ + Oid reloid; /* target relation */ + Oid indoid; /* index */ + size_t size; /* size of the catcache */ + int ntuples; /* number of tuples resides in the catcache */ + int nsearches; /* number of searches */ + int nhits; /* number of cache hits */ + int nneg_hits; /* number of negative cache hits */ + /* age classes in seconds */ + int ageclasses[SYSCACHE_STATS_NAGECLASSES]; + /* number of tuples fall into the corresponding age class */ + int nclass_entries[SYSCACHE_STATS_NAGECLASSES]; +} SysCacheStats; + + extern void InitCatalogCache(void); extern void InitCatalogCachePhase2(void); @@ -164,6 +182,7 @@ extern void SysCacheInvalidate(int cacheId, uint32 hashValue); extern bool RelationInvalidatesSnapshotsOnly(Oid relid); extern bool RelationHasSysCache(Oid relid); extern bool RelationSupportsSysCache(Oid relid); +extern SysCacheStats *SysCacheGetStats(int cacheId); /* * The use of the macros below rather than direct calls to the corresponding diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h index b2d97b4f7b..0677978923 100644 --- a/src/include/utils/timeout.h +++ b/src/include/utils/timeout.h @@ -32,6 +32,7 @@ typedef enum TimeoutId STANDBY_LOCK_TIMEOUT, IDLE_IN_TRANSACTION_SESSION_TIMEOUT, CATCACHE_CLOCK_TIMEOUT, + IDLE_SYSCACHE_STATS_UPDATE_TIMEOUT, /* First user-definable timeout reason */ USER_TIMEOUT, /* Maximum number of timeout reasons */ diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out index 98f417cb57..cf404a3930 100644 --- a/src/test/regress/expected/rules.out +++ b/src/test/regress/expected/rules.out @@ -1929,6 +1929,28 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid, pg_stat_all_tables.autoanalyze_count FROM pg_stat_all_tables WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname~ '^pg_toast'::text)); +pg_stat_syscache| SELECT s.pid, + (s.relid)::regclass AS relname, + (s.indid)::regclass AS cache_name, + s.size, + s.ntup AS ntuples, + s.searches, + s.hits, + s.neg_hits, + s.ageclass, + s.last_update + FROM (pg_stat_activity a + JOIN LATERAL ( SELECT a.pid, + pg_get_syscache_stats.relid, + pg_get_syscache_stats.indid, + pg_get_syscache_stats.size, + pg_get_syscache_stats.ntup, + pg_get_syscache_stats.searches, + pg_get_syscache_stats.hits, + pg_get_syscache_stats.neg_hits, + pg_get_syscache_stats.ageclass, + pg_get_syscache_stats.last_update + FROM pg_get_syscache_stats(a.pid) pg_get_syscache_stats(relid, indid, size, ntup, searches, hits, neg_hits, ageclass,last_update)) s ON ((a.pid = s.pid))); pg_stat_user_functions| SELECT p.oid AS funcid, n.nspname AS schemaname, p.proname AS funcname, @@ -2360,7 +2382,7 @@ pg_settings|pg_settings_n|CREATE RULE pg_settings_n AS ON UPDATE TO pg_catalog.pg_settings DO INSTEAD NOTHING; pg_settings|pg_settings_u|CREATE RULE pg_settings_u AS ON UPDATE TO pg_catalog.pg_settings - WHERE (new.name = old.name) DO SELECT set_config(old.name, new.setting, false) AS set_config; + WHERE (new.name = old.name) DO SELECT set_config(old.name, new.setting, false, false) AS set_config; rtest_emp|rtest_emp_del|CREATE RULE rtest_emp_del AS ON DELETE TO public.rtest_emp DO INSERT INTO rtest_emplog (ename, who, action, newsal, oldsal) VALUES (old.ename, CURRENT_USER, 'fired'::bpchar, '$0.00'::money, old.salary); -- 2.16.3
>From: Tsunakawa, Takayuki [mailto:tsunakawa.takay@jp.fujitsu.com] >> [Size=800, iter=1,000,000] >> Master |15.763 >> Patched|16.262 (+3%) >> >> [Size=32768, iter=1,000,000] >> Master |61.3076 >> Patched|62.9566 (+2%) > >What's the unit, second or millisecond? Millisecond. >Why is the number of digits to the right of the decimal point? > >Is the measurement correct? I'm wondering because the difference is larger in the >latter case. Isn't the accounting processing almost the same in both cases? >* former: 16.262 - 15.763 = 4.99 >* latter: 62.956 - 61.307 = 16.49 >I think the overhead is sufficiently small. It may get even smaller with a trivial tweak. > >You added the new member usedspace at the end of MemoryContextData. The >original size of MemoryContextData is 72 bytes, and Intel Xeon's cache line is 64 bytes. >So, the new member will be on a separate cache line. Try putting usedspace before >the name member. OK. I changed the order of MemoryContextData members to fit usedspace into one cacheline. I disabled all the catcache eviction mechanism in patched one and compared it with master to investigate that overhead of memory accounting become small enough. The settings are almost same as the last email. But last time the number of trials was 50 so I increased it and tried 5000 times to calculate the average figure (rounded off to three decimal place). [Size=800, iter=1,000,000] Master |15.64 ms Patched|16.26 ms (+4%) The difference is 0.62ms [Size=32768, iter=1,000,000] Master |61.39 ms Patched|60.99 ms (-1%) I guess there is around 2% noise. But based on this experiment it seems the overhead small. Still there is some overhead but it can be covered by some other manipulation like malloc(). Does this result show that hard-limit size option with memory accounting doesn't harm to usual users who disable hard limit size option? Regards, Takeshi Ideriha
Attachment
Hello. At Mon, 4 Mar 2019 03:03:51 +0000, "Ideriha, Takeshi" <ideriha.takeshi@jp.fujitsu.com> wrote in <4E72940DA2BF16479384A86D54D0988A6F44564E@G01JPEXMBKW04> > Does this result show that hard-limit size option with memory accounting > doesn't harm to usual users who disable hard limit size option? Not sure, but 4% seems beyond noise level. Planner requests mainly smaller allocation sizes especially for list operations. If we implement it for slab allocator, the degradation would be more significant. We *are* suffering from endless bloat of system cache (and some other stuffs) and there is no way to deal with it. The soft limit feature actually eliminates the problem with no degradation and even accelerates execution in some cases. Infinite bloat is itself a problem, but if the processes just needs more but finite size of memory, just additional memory or less max_connections is enough. What Andres and Robert suggested is we need more convincing reason for the hard limit feature other than "some is wanting it". The degradation of the crude accounting stuff is not the primary issue here. I think. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Mar 1, 2019 at 3:33 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > > It is artificial (or acutually wont't be repeatedly executed in a > > > session) but anyway what can get benefit from > > > catalog_cache_memory_target would be a kind of extreme. > > > > I agree. So then let's not have it. > > Ah... Yeah! I see. Andres' concern was that crucial syscache > entries might be blown away during a long idle time. If that > happens, it's enough to just turn off in the almost all of such > cases. +1. > In the attached v18, > catalog_cache_memory_target is removed, > removed some leftover of removing the hard limit feature, > separated catcache clock update during a query into 0003. > attached 0004 (monitor part) in order just to see how it is working. > > v18-0001-Add-dlist_move_tail: > Just adds dlist_move_tail > > v18-0002-Remove-entries-that-haven-t-been-used-for-a-certain-: > Revised pruning feature. OK, so this is getting simpler, but I'm wondering why we need dlist_move_tail() at all. It is a well-known fact that maintaining LRU ordering is expensive and it seems to be unnecessary for our purposes here. Can't CatCacheCleanupOldEntries just use a single-bit flag on the entry? If the flag is set, clear it. If the flag is clear, drop the entry. When an entry is used, set the flag. Then, entries will go away if they are not used between consecutive calls to CatCacheCleanupOldEntries. Sure, that might be slightly less accurate in terms of which entries get thrown away, but I bet it makes no real difference. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > OK, so this is getting simpler, but I'm wondering why we need > dlist_move_tail() at all. It is a well-known fact that maintaining > LRU ordering is expensive and it seems to be unnecessary for our > purposes here. Yeah ... LRU maintenance was another thing that used to be in the catcache logic and was thrown out as far too expensive. Your idea of just using a clock sweep instead seems plausible. regards, tom lane
On 3/6/19 9:17 PM, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> OK, so this is getting simpler, but I'm wondering why we need >> dlist_move_tail() at all. It is a well-known fact that maintaining >> LRU ordering is expensive and it seems to be unnecessary for our >> purposes here. > > Yeah ... LRU maintenance was another thing that used to be in the > catcache logic and was thrown out as far too expensive. Your idea > of just using a clock sweep instead seems plausible. > I agree clock sweep might be sufficient, although the benchmarks done in this thread so far do not suggest the LRU approach is very expensive. A simple true/false flag, as proposed by Robert, would mean we can only do the cleanup once per the catalog_cache_prune_min_age interval, so with the default value (5 minutes) the entries might be between 5 and 10 minutes old. That's probably acceptable, although for higher values the range gets wider and wider ... Which part of the LRU approach is supposedly expensive? Updating the lastaccess field or moving the entries to tail? I'd guess it's the latter, so perhaps we can keep some sort of age field, update it less frequently (once per minute?), and do the clock sweep? BTW wasn't one of the cases this thread aimed to improve a session that accesses a lot of objects in a short period of time? That balloons the syscache, and while this patch evicts the entries from memory, we never actually release the memory back (because AllocSet just moves it into the freelists) and it's unlikely to get swapped out (because other chunks on those memory pages are likely to be still used). I've proposed to address that by recreating the context if it gets too bloated, and I think Alvaro agreed with that. But I haven't seen any further discussion about that. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Mar 6, 2019 at 6:18 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > I agree clock sweep might be sufficient, although the benchmarks done in > this thread so far do not suggest the LRU approach is very expensive. I'm not sure how thoroughly it's been tested -- has someone constructed a benchmark that does a lot of syscache lookups and measured how much slower they get with this new code? > A simple true/false flag, as proposed by Robert, would mean we can only > do the cleanup once per the catalog_cache_prune_min_age interval, so > with the default value (5 minutes) the entries might be between 5 and 10 > minutes old. That's probably acceptable, although for higher values the > range gets wider and wider ... That's true, but I don't know that it matters. I'm not sure there's much of a use case for raising this parameter to some larger value, but even if there is, is it really worth complicating the mechanism to make sure that we throw away entries in a more timely fashion? That's not going to be cost-free, either in terms of CPU cycles or in terms of code complexity. Again, I think our goal should be to add the least mechanism here that solves the problem. If we can show that a true/false flag makes poor decisions about which entries to evict and a smarter algorithm does better, then it's worth considering. However, my bet is that it makes no meaningful difference. > Which part of the LRU approach is supposedly expensive? Updating the > lastaccess field or moving the entries to tail? I'd guess it's the > latter, so perhaps we can keep some sort of age field, update it less > frequently (once per minute?), and do the clock sweep? Move to tail (although lastaccess would be expensive if too if it involves an extra gettimeofday() call). GCLOCK, like we use for shared_buffers, is a common approximation of LRU which tends to be a lot less expensive to implement. We could do that here and it might work well, but I think the question, again, is whether we really need it. I think our goal here should just be to jettison cache entries that are clearly worthless. It's expensive enough to reload cache entries that any kind of aggressive eviction policy is probably a loser, and if our goal is just to get rid of the stuff that's clearly not being used, we don't need to be super-accurate about it. > BTW wasn't one of the cases this thread aimed to improve a session that > accesses a lot of objects in a short period of time? That balloons the > syscache, and while this patch evicts the entries from memory, we never > actually release the memory back (because AllocSet just moves it into > the freelists) and it's unlikely to get swapped out (because other > chunks on those memory pages are likely to be still used). I've proposed > to address that by recreating the context if it gets too bloated, and I > think Alvaro agreed with that. But I haven't seen any further discussion > about that. That's an interesting point. It seems reasonable to me to just throw away everything and release all memory if the session has been idle for a while, but if the session is busy doing stuff, discarding everything in bulk like that is going to cause latency spikes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3/7/19 3:34 PM, Robert Haas wrote: > On Wed, Mar 6, 2019 at 6:18 PM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> I agree clock sweep might be sufficient, although the benchmarks done in >> this thread so far do not suggest the LRU approach is very expensive. > > I'm not sure how thoroughly it's been tested -- has someone > constructed a benchmark that does a lot of syscache lookups and > measured how much slower they get with this new code? > What I've done on v13 (and I don't think the results would be that different on the current patch, but I may rerun it if needed) is a test that creates large number of tables (up to 1M) and then accesses them randomly. I don't know if it matches what you imagine, but see [1] https://www.postgresql.org/message-id/74386116-0bc5-84f2-e614-0cff19aca2de%402ndquadrant.com I don't think this shows any regression, but perhaps we should do a microbenchmark isolating the syscache entirely? >> A simple true/false flag, as proposed by Robert, would mean we can only >> do the cleanup once per the catalog_cache_prune_min_age interval, so >> with the default value (5 minutes) the entries might be between 5 and 10 >> minutes old. That's probably acceptable, although for higher values the >> range gets wider and wider ... > > That's true, but I don't know that it matters. I'm not sure there's > much of a use case for raising this parameter to some larger value, > but even if there is, is it really worth complicating the mechanism to > make sure that we throw away entries in a more timely fashion? That's > not going to be cost-free, either in terms of CPU cycles or in terms > of code complexity. > True, although it very much depends on how expensive it would be. > Again, I think our goal should be to add the least mechanism here that > solves the problem. If we can show that a true/false flag makes poor > decisions about which entries to evict and a smarter algorithm does > better, then it's worth considering. However, my bet is that it makes > no meaningful difference. > True. >> Which part of the LRU approach is supposedly expensive? Updating the >> lastaccess field or moving the entries to tail? I'd guess it's the >> latter, so perhaps we can keep some sort of age field, update it less >> frequently (once per minute?), and do the clock sweep? > > Move to tail (although lastaccess would be expensive if too if it > involves an extra gettimeofday() call). GCLOCK, like we use for > shared_buffers, is a common approximation of LRU which tends to be a > lot less expensive to implement. We could do that here and it might > work well, but I think the question, again, is whether we really need > it. I think our goal here should just be to jettison cache entries > that are clearly worthless. It's expensive enough to reload cache > entries that any kind of aggressive eviction policy is probably a > loser, and if our goal is just to get rid of the stuff that's clearly > not being used, we don't need to be super-accurate about it. > True. >> BTW wasn't one of the cases this thread aimed to improve a session that >> accesses a lot of objects in a short period of time? That balloons the >> syscache, and while this patch evicts the entries from memory, we never >> actually release the memory back (because AllocSet just moves it into >> the freelists) and it's unlikely to get swapped out (because other >> chunks on those memory pages are likely to be still used). I've proposed >> to address that by recreating the context if it gets too bloated, and I >> think Alvaro agreed with that. But I haven't seen any further discussion >> about that. > > That's an interesting point. It seems reasonable to me to just throw > away everything and release all memory if the session has been idle > for a while, but if the session is busy doing stuff, discarding > everything in bulk like that is going to cause latency spikes. > What I had in mind is more along these lines: (a) track number of active syscache entries (increment when adding a new one, decrement when evicting one) (b) track peak number of active syscache entries (c) after clock-sweep, if (peak > K*active) where K=2 or K=4 or so, do a memory context swap, i.e. create a new context, copy active entries over and destroy the old one That would at least free() the memory. Of course, the syscache entries may have different sizes, so tracking just numbers of entries is just an approximation. But I think it'd be enough. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Mar 6, 2019 at 6:18 PM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> Which part of the LRU approach is supposedly expensive? Updating the >> lastaccess field or moving the entries to tail? I'd guess it's the >> latter, so perhaps we can keep some sort of age field, update it less >> frequently (once per minute?), and do the clock sweep? > Move to tail (although lastaccess would be expensive if too if it > involves an extra gettimeofday() call). As I recall, the big problem with the old LRU code was loss of locality of access, in that in addition to the data associated with hot syscache entries, you were necessarily also touching list link fields associated with not-hot entries. That's bad for the CPU cache. A gettimeofday call (or any other kernel call) per syscache access would be a complete disaster. regards, tom lane
On Thu, Mar 7, 2019 at 9:49 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > I don't think this shows any regression, but perhaps we should do a > microbenchmark isolating the syscache entirely? Well, if we need the LRU list, then yeah I think a microbenchmark would be a good idea to make sure we really understand what the impact of that is going to be. But if we don't need it and can just remove it then we don't. > What I had in mind is more along these lines: > > (a) track number of active syscache entries (increment when adding a new > one, decrement when evicting one) > > (b) track peak number of active syscache entries > > (c) after clock-sweep, if (peak > K*active) where K=2 or K=4 or so, do a > memory context swap, i.e. create a new context, copy active entries over > and destroy the old one > > That would at least free() the memory. Of course, the syscache entries > may have different sizes, so tracking just numbers of entries is just an > approximation. But I think it'd be enough. Yeah, that could be done. I'm not sure how expensive it would be, and I'm also not sure how much more effective it would be than what's currently proposed in terms of actually freeing memory. If you free enough dead syscache entries, you might manage to give some memory back to the OS: after all, there may be some locality there. And even if you don't, you'll at least prevent further growth, which might be good enough. We could consider doing some version of what has been proposed here and the thing you're proposing here could later be implemented on top of that. I mean, evicting entries at all is a prerequisite to copy-and-compact. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3/7/19 4:01 PM, Robert Haas wrote: > On Thu, Mar 7, 2019 at 9:49 AM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> I don't think this shows any regression, but perhaps we should do a >> microbenchmark isolating the syscache entirely? > > Well, if we need the LRU list, then yeah I think a microbenchmark > would be a good idea to make sure we really understand what the impact > of that is going to be. But if we don't need it and can just remove > it then we don't. > >> What I had in mind is more along these lines: >> >> (a) track number of active syscache entries (increment when adding a new >> one, decrement when evicting one) >> >> (b) track peak number of active syscache entries >> >> (c) after clock-sweep, if (peak > K*active) where K=2 or K=4 or so, do a >> memory context swap, i.e. create a new context, copy active entries over >> and destroy the old one >> >> That would at least free() the memory. Of course, the syscache entries >> may have different sizes, so tracking just numbers of entries is just an >> approximation. But I think it'd be enough. > > Yeah, that could be done. I'm not sure how expensive it would be, and > I'm also not sure how much more effective it would be than what's > currently proposed in terms of actually freeing memory. If you free > enough dead syscache entries, you might manage to give some memory > back to the OS: after all, there may be some locality there. And even > if you don't, you'll at least prevent further growth, which might be > good enough. > I have my doubts about that happening in practice. It might happen for some workloads, but I think the locality is rather unpredictable. > We could consider doing some version of what has been proposed here > and the thing you're proposing here could later be implemented on top > of that. I mean, evicting entries at all is a prerequisite to > copy-and-compact. > Sure. I'm not saying the patch must do this to make it committable. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From: Robert Haas [mailto:robertmhaas@gmail.com] >On Thu, Mar 7, 2019 at 9:49 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> >wrote: >> I don't think this shows any regression, but perhaps we should do a >> microbenchmark isolating the syscache entirely? > >Well, if we need the LRU list, then yeah I think a microbenchmark would be a good idea >to make sure we really understand what the impact of that is going to be. But if we >don't need it and can just remove it then we don't. Just to be sure, we introduced the LRU list in this thread to find the entries less than threshold time without scanning the whole hash table. If hash table becomes large without LRU list, scanning time becomes slow. Regards, Takeshi Ideriha
>From: Vladimir Sitnikov [mailto:sitnikov.vladimir@gmail.com] > >Robert> This email thread is really short on clear demonstrations that X >Robert> or Y is useful. > >It is useful when the whole database does **not** crash, isn't it? > >Case A (==current PostgeSQL mode): syscache grows, then OOMkiller chimes in, kills >the database process, and it leads to the complete cluster failure (all other PG >processes terminate themselves). > >Case B (==limit syscache by 10MiB or whatever as Tsunakawa, Takayuki >asks): a single ill-behaved process works a bit slower and/or consumers more CPU >than the other ones. The whole DB is still alive. > >I'm quite sure "case B" is much better for the end users and for the database >administrators. > >So, +1 to Tsunakawa, Takayuki, it would be so great if there was a way to limit the >memory consumption of a single process (e.g. syscache, workmem, etc, etc). > >Robert> However, memory usage is quite unpredictable. It depends on how >Robert> many backends are active > >The number of backends can be limited by ensuring a proper limits at application >connection pool level and/or pgbouncer and/or things like that. > >Robert>how many copies of work_mem and/or maintenance_work_mem are in >Robert>use > >There might be other patches to cap the total use of >work_mem/maintenance_work_mem, > >Robert>I don't think we >Robert> can say that just imposing a limit on the size of the system >Robert>caches is going to be enough to reliably prevent an out of >Robert>memory condition > >The less possibilities there are for OOM the better. Quite often it is much better to fail >a single SQL rather than kill all the DB processes. Yeah, I agree. This limit would be useful for such extreme situation. Regards, Takeshi Ideriha
On Thu, Mar 7, 2019 at 11:40 PM Ideriha, Takeshi <ideriha.takeshi@jp.fujitsu.com> wrote: > Just to be sure, we introduced the LRU list in this thread to find the entries less than threshold time > without scanning the whole hash table. If hash table becomes large without LRU list, scanning time becomes slow. Hmm. So, it's a trade-off, right? One option is to have an LRU list, which imposes a small overhead on every syscache or catcache operation to maintain the LRU ordering. The other option is to have no LRU list, which imposes a larger overhead every time we clean up the syscaches. My bias is toward thinking that the latter is better, because: 1. Not everybody is going to use this feature, and 2. Syscache cleanup should be something that only happens every so many minutes, and probably while the backend is otherwise idle, whereas lookups can happen many times per millisecond. However, perhaps someone will provide some evidence that casts a different light on the situation. I don't see much point in continuing to review this patch at this point. There's been no new version of the patch in 3 weeks, and there is -- in my view at least -- a rather frustrating lack of evidence that the complexity this patch introduces is actually beneficial. No matter how many people +1 the idea of making this more complicated, it can't be justified unless you can provide a test result showing that the additional complexity solves a problem that does not get solved without that complexity. And even then, who is going to commit a patch that uses a design which Tom Lane says was tried before and stunk? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello. Sorry for being late a bit. At Wed, 27 Mar 2019 17:30:37 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190327.173037.40342566.horiguchi.kyotaro@lab.ntt.co.jp> > > I don't see much point in continuing to review this patch at this > > point. There's been no new version of the patch in 3 weeks, and there > > is -- in my view at least -- a rather frustrating lack of evidence > > that the complexity this patch introduces is actually beneficial. No > > matter how many people +1 the idea of making this more complicated, it > > can't be justified unless you can provide a test result showing that > > the additional complexity solves a problem that does not get solved > > without that complexity. And even then, who is going to commit a > > patch that uses a design which Tom Lane says was tried before and > > stunk? > > Hmm. Anyway it is hit by recent commit. I'll post a rebased > version and a version reverted to do hole-scan. Then I'll take > numbers as far as I can and will show the result.. tomorrow. I took performance numbers for master and three versions of the patch. Master, LRU, full-scan, modified full-scan. I noticed that useless scan can be skipped in full-scan version so I added the last versoin. I ran three artificial test cases. The database is created by gen_tbl.pl. Numbers are the average of the fastest five runs in successive 15 runs. Test cases are listed below. 1_0. About 3,000,000 negative entries are created in pg_statstic cache by scanning that many distinct columns. It is 3000 tables * 1001 columns. Pruning scans happen several times while a run but no entries are removed. This emulates the bloating phase of cache. catalog_cache_prune_min_age is default (300s). (access_tbl1.pl) 1_1. Same to 1_0 except that catalog_cache_prune_min_age is 0, which means turning off. 2_0. Repeatedly access 1001 of the 3,000,000 entries 6000 times. This emulates the stable cache case without having pruning. catalog_cache_prune_min_age is default (300s). (access_tbl2.pl) 2_1. Same to 2_0 except that catalog_cache_prune_min_age is 0, which means turning off. 3_0. Scan over the 3,000,000 entries twice with setting prune_age to 10s. A run takes about 18 seconds on my box so fair amount of old entries are removed. This emulates the stable case with continuous pruning. (access_tbl3.pl) 2_1. Same to 3_0 except that catalog_cache_prune_min_age is 0, which means turning off. The result follows. | master | LRU | Full |Full-mod| -----|--------+--------+--------+--------+ 1_0 | 17.287 | 17.370 | 17.255 | 16.623 | 1_1 | 17.287 | 17.063 | 16.336 | 17.192 | 2_0 | 15.695 | 18.769 | 18.563 | 15.527 | 2_1 | 15.695 | 18.603 | 18.498 | 18.487 | 3_0 | 26.576 | 33.817 | 34.384 | 34.971 | 3_1 | 26.576 | 27.462 | 26.202 | 26.368 | The result of 2_0 and 2_1 seems strange, but I show you the numbers at the present. - Full-scan seems to have the smallest impact when turned off. - Full-scan-mod seems to perform best in total. (as far as Full-mod-2_0 is wrong value..) - LRU doesn't seem to outperform full scanning. For your information I measured how long pruning takes time. LRU 318318 out of 2097153 entries in 26ms: 0.08us/entry. Full-scan 443443 out of 2097153 entreis in 184ms. 0.4us/entry. LRU is actually fast to remove entries but the difference seems to be canceled by the complexity of LRU maintenance. As my conclusion, we should go with the Full-scan or Full-scan-mod version. I conduct a further overnight test and will see which is better. I attached the test script set. It is used in the folling manner. (start server) # perl gen_tbl.pl | psql postgres (stop server) # sh run.sh 30 > log.txt # 30 is repeat count # perl process.pl | master | LRU | Full |Full-mod| -----|--------+--------+--------+--------+ 1_0 | 16.711 | 17.647 | 16.767 | 17.256 | ... The attached files are follow. LRU versions patches. LRU-0001-Add-dlist_move_tail.patch LRU-0002-Remove-entries-that-haven-t-been-used-for-a-certain-.patch Fullscn version patch. FullScan-0001-Remove-entries-that-haven-t-been-used-for-a-certain-.patch Fullscn-mod version patch. FullScan-mod-0001-Remove-entries-that-haven-t-been-used-for-a-certain-.patch test scripts. test_script.tar.gz regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 1c397d118a65d6b76282cc904c43ecfe97ee5329 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Thu, 7 Feb 2019 14:56:07 +0900 Subject: [PATCH 1/2] Add dlist_move_tail We have dlist_push_head/tail and dlist_move_head but not dlist_move_tail. Add it. --- src/include/lib/ilist.h | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/src/include/lib/ilist.h b/src/include/lib/ilist.h index b1a5974ee4..659ab1ac87 100644 --- a/src/include/lib/ilist.h +++ b/src/include/lib/ilist.h @@ -394,6 +394,25 @@ dlist_move_head(dlist_head *head, dlist_node *node) dlist_check(head); } +/* + * Move element from its current position in the list to the tail position in + * the same list. + * + * Undefined behaviour if 'node' is not already part of the list. + */ +static inline void +dlist_move_tail(dlist_head *head, dlist_node *node) +{ + /* fast path if it's already at the tail */ + if (head->head.prev == node) + return; + + dlist_delete(node); + dlist_push_tail(head, node); + + dlist_check(head); +} + /* * Check whether 'node' has a following node. * Caution: unreliable if 'node' is not in the list. -- 2.16.3 From f7a132c2b4910908773c508ef356a07cc853fe79 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Fri, 1 Mar 2019 13:32:51 +0900 Subject: [PATCH 2/2] Remove entries that haven't been used for a certain time Catcache entries happen to be left alone for several reasons. It is not desirable that such useless entries eat up memory. Catcache pruning feature removes entries that haven't been accessed for a certain time before enlarging hash array. --- doc/src/sgml/config.sgml | 19 ++++ src/backend/tcop/postgres.c | 2 + src/backend/utils/cache/catcache.c | 122 +++++++++++++++++++++++++- src/backend/utils/misc/guc.c | 12 +++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/utils/catcache.h | 18 ++++ 6 files changed, 171 insertions(+), 3 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index d383de2512..4231235447 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1677,6 +1677,25 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-catalog-cache-prune-min-age" xreflabel="catalog_cache_prune_min_age"> + <term><varname>catalog_cache_prune_min_age</varname> (<type>integer</type>) + <indexterm> + <primary><varname>catalog_cache_prune_min_age</varname> configuration + parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the minimum amount of unused time in seconds at which a + system catalog cache entry is removed. -1 indicates that this feature + is disabled at all. The value defaults to 300 seconds (<literal>5 + minutes</literal>). The entries that are not used for the duration + can be removed to prevent catalog cache from bloating with useless + entries. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth"> <term><varname>max_stack_depth</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index f9ce3d8f22..acab473d34 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -71,6 +71,7 @@ #include "tcop/pquery.h" #include "tcop/tcopprot.h" #include "tcop/utility.h" +#include "utils/catcache.h" #include "utils/lsyscache.h" #include "utils/memutils.h" #include "utils/ps_status.h" @@ -2575,6 +2576,7 @@ start_xact_command(void) * not desired, the timeout has to be disabled explicitly. */ enable_statement_timeout(); + SetCatCacheClock(GetCurrentStatementStartTimestamp()); } static void diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index d05930bc4c..c8ee0c98fb 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -38,6 +38,7 @@ #include "utils/rel.h" #include "utils/resowner_private.h" #include "utils/syscache.h" +#include "utils/timeout.h" /* #define CACHEDEBUG */ /* turns DEBUG elogs on */ @@ -60,9 +61,24 @@ #define CACHE_elog(...) #endif +/* + * GUC variable to define the minimum age of entries that will be considered + * to be evicted in seconds. -1 to disable the feature. + */ +int catalog_cache_prune_min_age = 300; + +/* + * Minimum interval between two successive moves of a cache entry in LRU list, + * in microseconds. + */ +#define MIN_LRU_UPDATE_INTERVAL 100000 /* 100ms */ + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Clock for the last accessed time of a catcache entry. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -473,6 +489,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) /* delink from linked list */ dlist_delete(&ct->cache_elem); + dlist_delete(&ct->lru_node); /* * Free keys when we're dealing with a negative entry, normal entries just @@ -833,6 +850,7 @@ InitCatCache(int id, cp->cc_nkeys = nkeys; for (i = 0; i < nkeys; ++i) cp->cc_keyno[i] = key[i]; + dlist_init(&cp->cc_lru_list); /* * new cache is initialized as far as we can go for now. print some @@ -850,9 +868,83 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initialize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries happen to be left unused for a long time for several + * reasons. Remove such entries to prevent catcache from bloating. It is based + * on the similar algorithm with buffer eviction. Entries that are accessed + * several times in a certain period live longer than those that have had less + * access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int nremoved = 0; + dlist_mutable_iter iter; + + /* Return immediately if disabled */ + if (catalog_cache_prune_min_age == 0) + return false; + + /* Scan over LRU to find entries to remove */ + dlist_foreach_modify(iter, &cp->cc_lru_list) + { + CatCTup *ct = dlist_container(CatCTup, lru_node, iter.cur); + long entry_age; + int us; + + /* Don't remove referenced entries */ + if (ct->refcount != 0 || + (ct->c_list && ct->c_list->refcount != 0)) + continue; + + /* + * Calculate the duration from the time from the last access to + * the "current" time. catcacheclock is updated per-statement + * basis. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + + if (entry_age < catalog_cache_prune_min_age) + { + /* + * We don't have older entries, exit. At least one removal + * prevents rehashing this time. + */ + break; + } + + /* + * Entries that are not accessed after the last pruning are removed in + * that seconds, and their lives are prolonged according to how many + * times they are accessed up to three times of the duration. We don't + * try shrink buckets since pruning effectively caps catcache + * expansion in the long term. + */ + if (ct->naccess > 0) + ct->naccess--; + else + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + } + } + + if (nremoved > 0) + elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d", + cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved); + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -1264,6 +1356,20 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* prolong life of this entry */ + if (ct->naccess < 2) + ct->naccess++; + + /* + * Don't update LRU too frequently. We need to maintain the LRU even + * if pruning is inactive since it can be turned on on-session. + */ + if (catcacheclock - ct->lastaccess > MIN_LRU_UPDATE_INTERVAL) + { + ct->lastaccess = catcacheclock; + dlist_move_tail(&cache->cc_lru_list, &ct->lru_node); + } + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1888,19 +1994,29 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; + dlist_push_tail(&cache->cc_lru_list, &ct->lru_node); dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); cache->cc_ntup++; CacheHdr->ch_ntup++; + /* increase refcount so that the new entry survives pruning */ + ct->refcount++; + /* - * If the hash table has become too full, enlarge the buckets array. Quite - * arbitrarily, we enlarge when fill factor > 2. + * If the hash table has become too full, try removing infrequently used + * entries to make a room for the new entry. If failed, enlarge the bucket + * array instead. Quite arbitrarily, we try this when fill factor > 2. */ - if (cache->cc_ntup > cache->cc_nbuckets * 2) + if (cache->cc_ntup > cache->cc_nbuckets * 2 && + !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); + ct->refcount--; + return ct; } diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index aa564d153a..e624c74bf9 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -82,6 +82,7 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/guc_tables.h" #include "utils/float.h" #include "utils/memutils.h" @@ -2202,6 +2203,17 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("System catalog cache entries that live unused for longer than this seconds are considered forremoval."), + gettext_noop("The value of -1 turns off pruning."), + GUC_UNIT_S + }, + &catalog_cache_prune_min_age, + 300, -1, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index cccb5f145a..fa117f0573 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -128,6 +128,7 @@ #work_mem = 4MB # min 64kB #maintenance_work_mem = 64MB # min 1MB #autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem +#catalog_cache_prune_min_age = 300s # -1 disables pruning #max_stack_depth = 2MB # min 100kB #shared_memory_type = mmap # the default is the first option # supported by the operating system: diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 65d816a583..a21c53644a 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -61,6 +62,7 @@ typedef struct catcache slist_node cc_next; /* list link */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ + dlist_head cc_lru_list; /* * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS @@ -119,6 +121,9 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ + int naccess; /* # of access to this entry, up to 2 */ + TimestampTz lastaccess; /* timestamp of the last usage */ + dlist_node lru_node; /* LRU node */ /* * The tuple may also be a member of at most one CatCList. (If a single @@ -189,6 +194,19 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int catalog_cache_prune_min_age; + +/* source clock for access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* SetCatCacheClock - set catcache timestamp source clodk */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, -- 2.16.3 From 0a6691078f8af5f35cca194137768af7d08fa1d8 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Fri, 1 Mar 2019 13:32:51 +0900 Subject: [PATCH] Remove entries that haven't been used for a certain time Catcache entries happen to be left alone for several reasons. It is not desirable that such useless entries eat up memory. Catcache pruning feature removes entries that haven't been accessed for a certain time before enlarging hash array. --- doc/src/sgml/config.sgml | 19 +++++ src/backend/tcop/postgres.c | 2 + src/backend/utils/cache/catcache.c | 105 +++++++++++++++++++++++++- src/backend/utils/misc/guc.c | 12 +++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/utils/catcache.h | 16 ++++ 6 files changed, 152 insertions(+), 3 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index d383de2512..4231235447 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1677,6 +1677,25 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-catalog-cache-prune-min-age" xreflabel="catalog_cache_prune_min_age"> + <term><varname>catalog_cache_prune_min_age</varname> (<type>integer</type>) + <indexterm> + <primary><varname>catalog_cache_prune_min_age</varname> configuration + parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the minimum amount of unused time in seconds at which a + system catalog cache entry is removed. -1 indicates that this feature + is disabled at all. The value defaults to 300 seconds (<literal>5 + minutes</literal>). The entries that are not used for the duration + can be removed to prevent catalog cache from bloating with useless + entries. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth"> <term><varname>max_stack_depth</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index f9ce3d8f22..acab473d34 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -71,6 +71,7 @@ #include "tcop/pquery.h" #include "tcop/tcopprot.h" #include "tcop/utility.h" +#include "utils/catcache.h" #include "utils/lsyscache.h" #include "utils/memutils.h" #include "utils/ps_status.h" @@ -2575,6 +2576,7 @@ start_xact_command(void) * not desired, the timeout has to be disabled explicitly. */ enable_statement_timeout(); + SetCatCacheClock(GetCurrentStatementStartTimestamp()); } static void diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index d05930bc4c..52586bd415 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -38,6 +38,7 @@ #include "utils/rel.h" #include "utils/resowner_private.h" #include "utils/syscache.h" +#include "utils/timeout.h" /* #define CACHEDEBUG */ /* turns DEBUG elogs on */ @@ -60,9 +61,18 @@ #define CACHE_elog(...) #endif +/* + * GUC variable to define the minimum age of entries that will be considered + * to be evicted in seconds. -1 to disable the feature. + */ +int catalog_cache_prune_min_age = 300; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Clock for the last accessed time of a catcache entry. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -850,9 +860,83 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initialize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries happen to be left unused for a long time for several + * reasons. Remove such entries to prevent catcache from bloating. It is based + * on the similar algorithm with buffer eviction. Entries that are accessed + * several times in a certain period live longer than those that have had less + * access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int nremoved = 0; + int i; + + /* Return immediately if disabled */ + if (catalog_cache_prune_min_age == 0) + return false; + + /* Scan over the whole hash to find entries to remove */ + for (i = 0 ; i < cp->cc_nbuckets ; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + /* Don't remove referenced entries */ + if (ct->refcount != 0 || + (ct->c_list && ct->c_list->refcount != 0)) + continue; + + /* + * Calculate the duration from the time from the last access to + * the "current" time. catcacheclock is updated per-statement + * basis and additionaly udpated periodically during a long + * running query. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + + if (entry_age < catalog_cache_prune_min_age) + continue; + + /* + * Entries that are not accessed after the last pruning are + * removed in that seconds, and their lives are prolonged + * according to how many times they are accessed up to three times + * of the duration. We don't try shrink buckets since pruning + * effectively caps catcache expansion in the long term. + */ + if (ct->naccess > 0) + ct->naccess--; + else + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + } + } + } + + if (nremoved > 0) + elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d", + cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved); + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -1264,6 +1348,12 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* prolong life of this entry */ + if (ct->naccess < 2) + ct->naccess++; + + ct->lastaccess = catcacheclock; + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1888,19 +1978,28 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); cache->cc_ntup++; CacheHdr->ch_ntup++; + /* increase refcount so that the new entry survives pruning */ + ct->refcount++; + /* - * If the hash table has become too full, enlarge the buckets array. Quite - * arbitrarily, we enlarge when fill factor > 2. + * If the hash table has become too full, try removing infrequently used + * entries to make a room for the new entry. If failed, enlarge the bucket + * array instead. Quite arbitrarily, we try this when fill factor > 2. */ - if (cache->cc_ntup > cache->cc_nbuckets * 2) + if (cache->cc_ntup > cache->cc_nbuckets * 2 && + !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); + ct->refcount--; + return ct; } diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index aa564d153a..e624c74bf9 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -82,6 +82,7 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/guc_tables.h" #include "utils/float.h" #include "utils/memutils.h" @@ -2202,6 +2203,17 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("System catalog cache entries that live unused for longer than this seconds are considered forremoval."), + gettext_noop("The value of -1 turns off pruning."), + GUC_UNIT_S + }, + &catalog_cache_prune_min_age, + 300, -1, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index cccb5f145a..fa117f0573 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -128,6 +128,7 @@ #work_mem = 4MB # min 64kB #maintenance_work_mem = 64MB # min 1MB #autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem +#catalog_cache_prune_min_age = 300s # -1 disables pruning #max_stack_depth = 2MB # min 100kB #shared_memory_type = mmap # the default is the first option # supported by the operating system: diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 65d816a583..2134839ecf 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -119,6 +120,8 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ + int naccess; /* # of access to this entry, up to 2 */ + TimestampTz lastaccess; /* timestamp of the last usage */ /* * The tuple may also be a member of at most one CatCList. (If a single @@ -189,6 +192,19 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int catalog_cache_prune_min_age; + +/* source clock for access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* SetCatCacheClock - set catcache timestamp source clodk */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, -- 2.16.3 From f2379fb8070420ea0880cfa74439744ade41dc3f Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Fri, 1 Mar 2019 13:32:51 +0900 Subject: [PATCH] Remove entries that haven't been used for a certain time Catcache entries happen to be left alone for several reasons. It is not desirable that such useless entries eat up memory. Catcache pruning feature removes entries that haven't been accessed for a certain time before enlarging hash array. --- doc/src/sgml/config.sgml | 19 ++++ src/backend/tcop/postgres.c | 2 + src/backend/utils/cache/catcache.c | 121 +++++++++++++++++++++++++- src/backend/utils/misc/guc.c | 12 +++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/utils/catcache.h | 17 ++++ 6 files changed, 169 insertions(+), 3 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index d383de2512..4231235447 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1677,6 +1677,25 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-catalog-cache-prune-min-age" xreflabel="catalog_cache_prune_min_age"> + <term><varname>catalog_cache_prune_min_age</varname> (<type>integer</type>) + <indexterm> + <primary><varname>catalog_cache_prune_min_age</varname> configuration + parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the minimum amount of unused time in seconds at which a + system catalog cache entry is removed. -1 indicates that this feature + is disabled at all. The value defaults to 300 seconds (<literal>5 + minutes</literal>). The entries that are not used for the duration + can be removed to prevent catalog cache from bloating with useless + entries. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth"> <term><varname>max_stack_depth</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index f9ce3d8f22..acab473d34 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -71,6 +71,7 @@ #include "tcop/pquery.h" #include "tcop/tcopprot.h" #include "tcop/utility.h" +#include "utils/catcache.h" #include "utils/lsyscache.h" #include "utils/memutils.h" #include "utils/ps_status.h" @@ -2575,6 +2576,7 @@ start_xact_command(void) * not desired, the timeout has to be disabled explicitly. */ enable_statement_timeout(); + SetCatCacheClock(GetCurrentStatementStartTimestamp()); } static void diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index d05930bc4c..c4582fe5a3 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -38,6 +38,7 @@ #include "utils/rel.h" #include "utils/resowner_private.h" #include "utils/syscache.h" +#include "utils/timeout.h" /* #define CACHEDEBUG */ /* turns DEBUG elogs on */ @@ -60,9 +61,18 @@ #define CACHE_elog(...) #endif +/* + * GUC variable to define the minimum age of entries that will be considered + * to be evicted in seconds. -1 to disable the feature. + */ +int catalog_cache_prune_min_age = 300; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Clock for the last accessed time of a catcache entry. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -850,9 +860,99 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initialize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries happen to be left unused for a long time for several + * reasons. Remove such entries to prevent catcache from bloating. It is based + * on the similar algorithm with buffer eviction. Entries that are accessed + * several times in a certain period live longer than those that have had less + * access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int nremoved = 0; + int i; + long oldest_ts = catcacheclock; + long age; + int us; + + /* Return immediately if disabled */ + if (catalog_cache_prune_min_age == 0) + return false; + + /* Don't scan the hash when we know we don't have prunable entries */ + TimestampDifference(cp->cc_oldest_ts, catcacheclock, &age, &us); + if (age < catalog_cache_prune_min_age) + return false; + + /* Scan over the whole hash to find entries to remove */ + for (i = 0 ; i < cp->cc_nbuckets ; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + + /* Don't remove referenced entries */ + if (ct->refcount == 0 && + (ct->c_list == NULL || ct->c_list->refcount == 0)) + { + /* + * Calculate the duration from the time from the last access + * to the "current" time. catcacheclock is updated + * per-statement basis and additionaly udpated periodically + * during a long running query. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &age, &us); + + if (age >= catalog_cache_prune_min_age) + { + /* + * Entries that are not accessed after the last pruning + * are removed in that seconds, and their lives are + * prolonged according to how many times they are accessed + * up to three times of the duration. We don't try shrink + * buckets since pruning effectively caps catcache + * expansion in the long term. + */ + if (ct->naccess > 0) + ct->naccess--; + else + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + + /* don't update oldest_ts by removed entry */ + continue; + } + } + } + + /* update oldest timestamp if the entry remains alive */ + if (ct->lastaccess < oldest_ts) + oldest_ts = ct->lastaccess; + } + } + + cp->cc_oldest_ts = oldest_ts; + + if (nremoved > 0) + elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d", + cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved); + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -1264,6 +1364,12 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* prolong life of this entry */ + if (ct->naccess < 2) + ct->naccess++; + + ct->lastaccess = catcacheclock; + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1888,19 +1994,28 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); cache->cc_ntup++; CacheHdr->ch_ntup++; + /* increase refcount so that the new entry survives pruning */ + ct->refcount++; + /* - * If the hash table has become too full, enlarge the buckets array. Quite - * arbitrarily, we enlarge when fill factor > 2. + * If the hash table has become too full, try removing infrequently used + * entries to make a room for the new entry. If failed, enlarge the bucket + * array instead. Quite arbitrarily, we try this when fill factor > 2. */ - if (cache->cc_ntup > cache->cc_nbuckets * 2) + if (cache->cc_ntup > cache->cc_nbuckets * 2 && + !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); + ct->refcount--; + return ct; } diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index aa564d153a..e624c74bf9 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -82,6 +82,7 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/guc_tables.h" #include "utils/float.h" #include "utils/memutils.h" @@ -2202,6 +2203,17 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("System catalog cache entries that live unused for longer than this seconds are considered forremoval."), + gettext_noop("The value of -1 turns off pruning."), + GUC_UNIT_S + }, + &catalog_cache_prune_min_age, + 300, -1, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index cccb5f145a..fa117f0573 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -128,6 +128,7 @@ #work_mem = 4MB # min 64kB #maintenance_work_mem = 64MB # min 1MB #autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem +#catalog_cache_prune_min_age = 300s # -1 disables pruning #max_stack_depth = 2MB # min 100kB #shared_memory_type = mmap # the default is the first option # supported by the operating system: diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 65d816a583..2f697d5ca4 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -61,6 +62,7 @@ typedef struct catcache slist_node cc_next; /* list link */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ + long cc_oldest_ts; /* * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS @@ -119,6 +121,8 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ + int naccess; /* # of access to this entry, up to 2 */ + TimestampTz lastaccess; /* timestamp of the last usage */ /* * The tuple may also be a member of at most one CatCList. (If a single @@ -189,6 +193,19 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int catalog_cache_prune_min_age; + +/* source clock for access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* SetCatCacheClock - set catcache timestamp source clodk */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, -- 2.16.3
Attachment
At Fri, 29 Mar 2019 17:24:40 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190329.172440.199616830.horiguchi.kyotaro@lab.ntt.co.jp> > I ran three artificial test cases. The database is created by > gen_tbl.pl. Numbers are the average of the fastest five runs in > successive 15 runs. > > Test cases are listed below. > > 1_0. About 3,000,000 negative entries are created in pg_statstic > cache by scanning that many distinct columns. It is 3000 tables > * 1001 columns. Pruning scans happen several times while a run > but no entries are removed. This emulates the bloating phase of > cache. catalog_cache_prune_min_age is default (300s). > (access_tbl1.pl) > > 1_1. Same to 1_0 except that catalog_cache_prune_min_age is 0, > which means turning off. > > 2_0. Repeatedly access 1001 of the 3,000,000 entries 6000 > times. This emulates the stable cache case without having > pruning. catalog_cache_prune_min_age is default (300s). > (access_tbl2.pl) > > 2_1. Same to 2_0 except that catalog_cache_prune_min_age is 0, > which means turning off. > > 3_0. Scan over the 3,000,000 entries twice with setting prune_age > to 10s. A run takes about 18 seconds on my box so fair amount > of old entries are removed. This emulates the stable case with > continuous pruning. (access_tbl3.pl) > > 2_1. Same to 3_0 except that catalog_cache_prune_min_age is 0, > which means turning off. > > > The result follows. > > | master | LRU | Full |Full-mod| > -----|--------+--------+--------+--------+ > 1_0 | 17.287 | 17.370 | 17.255 | 16.623 | > 1_1 | 17.287 | 17.063 | 16.336 | 17.192 | > 2_0 | 15.695 | 18.769 | 18.563 | 15.527 | > 2_1 | 15.695 | 18.603 | 18.498 | 18.487 | > 3_0 | 26.576 | 33.817 | 34.384 | 34.971 | > 3_1 | 26.576 | 27.462 | 26.202 | 26.368 | > > The result of 2_0 and 2_1 seems strange, but I show you the > numbers at the present. > > - Full-scan seems to have the smallest impact when turned off. > > - Full-scan-mod seems to perform best in total. (as far as > Full-mod-2_0 is wrong value..) > > - LRU doesn't seem to outperform full scanning. I had another.. unstable.. result. | master | LRU | Full |Full-mod| -----|--------+--------+--------+--------+ 1_0 | 16.312 | 16.540 | 16.482 | 16.348 | 1_1 | 16.312 | 16.454 | 16.335 | 16.232 | 2_0 | 16.710 | 16.954 | 17.873 | 17.345 | 2_1 | 16.710 | 17.373 | 18.499 | 17.563 | 3_0 | 25.010 | 33.031 | 33.452 | 33.937 | 3_1 | 25.010 | 24.784 | 24.570 | 25.453 | Normalizing on master's result and rounding off to 1.0%, it looks as: | master | LRU | Full |Full-mod| Test description -----|--------+--------+--------+--------+----------------------------------- 1_0 | 100 | 101 | 101 | 100 | bloating. pruning enabled. 1_1 | 100 | 101 | 100 | 100 | bloating. pruning disabled. 2_0 | 100 | 101 | 107 | 104 | normal access. pruning enabled. 2_1 | 100 | 104 | 111 | 105 | normal access. pruning disabled. 3_0 | 100 | 132 | 134 | 136 | pruning continuously running. 3_1 | 100 | 99 | 98 | 102 | pruning disabled. I'm not sure why the 2_1 is slower than 2_0, but LRU impacts least if the numbers are right. I will investigate the strange behavior using profiler. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Mon, 01 Apr 2019 11:05:32 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190401.110532.102998353.horiguchi.kyotaro@lab.ntt.co.jp> > At Fri, 29 Mar 2019 17:24:40 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in<20190329.172440.199616830.horiguchi.kyotaro@lab.ntt.co.jp> > > I ran three artificial test cases. The database is created by > > gen_tbl.pl. Numbers are the average of the fastest five runs in > > successive 15 runs. > > > > Test cases are listed below. > > > > 1_0. About 3,000,000 negative entries are created in pg_statstic > > cache by scanning that many distinct columns. It is 3000 tables > > * 1001 columns. Pruning scans happen several times while a run > > but no entries are removed. This emulates the bloating phase of > > cache. catalog_cache_prune_min_age is default (300s). > > (access_tbl1.pl) > > > > 1_1. Same to 1_0 except that catalog_cache_prune_min_age is 0, > > which means turning off. > > > > 2_0. Repeatedly access 1001 of the 3,000,000 entries 6000 > > times. This emulates the stable cache case without having > > pruning. catalog_cache_prune_min_age is default (300s). > > (access_tbl2.pl) > > > > 2_1. Same to 2_0 except that catalog_cache_prune_min_age is 0, > > which means turning off. > > > > 3_0. Scan over the 3,000,000 entries twice with setting prune_age > > to 10s. A run takes about 18 seconds on my box so fair amount > > of old entries are removed. This emulates the stable case with > > continuous pruning. (access_tbl3.pl) > > > > 3_1. Same to 3_0 except that catalog_cache_prune_min_age is 0, > > which means turning off. .. > I had another.. unstable.. result. dlist_move_head is used every time an entry is accessed. It moves the accessed element to the top of bucket expecting that subsequent access become faster - a kind of LRU maintenance. But the mean length of a bucket is 2 so dlist_move_head is too complex than following one step of link. So I removed it in pruning patch. I understand I cannot get rid of noise a far as I'm poking the feature from client via communication and SQL layer. The attached extension surgically exercises SearchSysCache3(STATRELATTINH) in the almost pattern with the benchmarks taken last week. I believe that that gives far reliable numbers. But still the number fluctuates by up to about 10% every trial, and the difference among the methods is under the fluctulation. I'm tired.. But this still looks somewhat wrong. ratio in the following table is the percentage to the master for the same test. master2 is a version that removed the dlink_move_head from master. binary | test | count | avg | stddev | ratio ---------+------+-------+---------+--------+-------- master | 1_0 | 5 | 7841.42 | 6.91 master | 2_0 | 5 | 3810.10 | 8.51 master | 3_0 | 5 | 7826.17 | 11.98 master | 1_1 | 5 | 7905.73 | 5.69 master | 2_1 | 5 | 3827.15 | 5.55 master | 3_1 | 5 | 7822.67 | 13.75 ---------+------+-------+---------+--------+-------- master2 | 1_0 | 5 | 7538.05 | 16.65 | 96.13 master2 | 2_0 | 5 | 3927.05 | 11.58 | 103.07 master2 | 3_0 | 5 | 7455.47 | 12.03 | 95.26 master2 | 1_1 | 5 | 7485.60 | 9.38 | 94.69 master2 | 2_1 | 5 | 3870.81 | 5.54 | 101.14 master2 | 3_1 | 5 | 7437.35 | 9.91 | 95.74 ---------+------+-------+---------+--------+-------- LRU | 1_0 | 5 | 7633.57 | 9.00 | 97.35 LRU | 2_0 | 5 | 4062.43 | 5.90 | 106.62 LRU | 3_0 | 5 | 8340.51 | 6.12 | 106.57 LRU | 1_1 | 5 | 7645.87 | 13.29 | 96.71 LRU | 2_1 | 5 | 4026.60 | 7.56 | 105.21 LRU | 3_1 | 5 | 8400.10 | 19.07 | 107.38 ---------+------+-------+---------+--------+-------- Full | 1_0 | 5 | 7481.61 | 6.70 | 95.41 Full | 2_0 | 5 | 4084.46 | 14.50 | 107.20 Full | 3_0 | 5 | 8166.23 | 14.80 | 104.35 Full | 1_1 | 5 | 7447.20 | 10.93 | 94.20 Full | 2_1 | 5 | 4016.88 | 8.53 | 104.96 Full | 3_1 | 5 | 8258.80 | 7.91 | 105.58 ---------+------+-------+---------+--------+-------- FullMod | 1_0 | 5 | 7291.80 | 14.03 | 92.99 FullMod | 2_0 | 5 | 4006.36 | 7.64 | 105.15 FullMod | 3_0 | 5 | 8143.60 | 9.26 | 104.06 FullMod | 1_1 | 5 | 7270.66 | 6.24 | 91.97 FullMod | 2_1 | 5 | 3996.20 | 13.00 | 104.42 FullMod | 3_1 | 5 | 8012.55 | 7.09 | 102 43 So "Full (scan) Mod" wins again, or the diffence is under error. I don't think this level of difference can be a reason to reject this kind of resource saving mechanism. LRU version doesn't seem particularly slow but also doesn't seem particularly fast for the complexity. FullMod version doesn't look differently. So it seems to me that the simplest "Full" version wins. The attached is rebsaed version. dlist_move_head(entry) is removed as mentioned above in that patch. The third and fourth attached are a set of script I used. $ perl gen_tbl.pl | psql postgres $ run.sh > log.txt regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 57c9dab7fff7b81890657594711bbfb47a3e0f0d Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Fri, 1 Mar 2019 13:32:51 +0900 Subject: [PATCH 1/2] Remove entries that haven't been used for a certain time Catcache entries happen to be left alone for several reasons. It is not desirable that such useless entries eat up memory. Catcache pruning feature removes entries that haven't been accessed for a certain time before enlarging hash array. --- doc/src/sgml/config.sgml | 19 ++++ src/backend/tcop/postgres.c | 2 + src/backend/utils/cache/catcache.c | 124 +++++++++++++++++++++++++- src/backend/utils/misc/guc.c | 12 +++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/utils/catcache.h | 18 ++++ 6 files changed, 172 insertions(+), 4 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index bc1d0f7bfa..819b252029 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1677,6 +1677,25 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-catalog-cache-prune-min-age" xreflabel="catalog_cache_prune_min_age"> + <term><varname>catalog_cache_prune_min_age</varname> (<type>integer</type>) + <indexterm> + <primary><varname>catalog_cache_prune_min_age</varname> configuration + parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the minimum amount of unused time in seconds at which a + system catalog cache entry is removed. -1 indicates that this feature + is disabled at all. The value defaults to 300 seconds (<literal>5 + minutes</literal>). The entries that are not used for the duration + can be removed to prevent catalog cache from bloating with useless + entries. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth"> <term><varname>max_stack_depth</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 44a59e1d4f..a0efac86bc 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -71,6 +71,7 @@ #include "tcop/pquery.h" #include "tcop/tcopprot.h" #include "tcop/utility.h" +#include "utils/catcache.h" #include "utils/lsyscache.h" #include "utils/memutils.h" #include "utils/ps_status.h" @@ -2577,6 +2578,7 @@ start_xact_command(void) * not desired, the timeout has to be disabled explicitly. */ enable_statement_timeout(); + SetCatCacheClock(GetCurrentStatementStartTimestamp()); } static void diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index d05930bc4c..e85f2b038c 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -38,6 +38,7 @@ #include "utils/rel.h" #include "utils/resowner_private.h" #include "utils/syscache.h" +#include "utils/timeout.h" /* #define CACHEDEBUG */ /* turns DEBUG elogs on */ @@ -60,9 +61,24 @@ #define CACHE_elog(...) #endif +/* + * GUC variable to define the minimum age of entries that will be considered + * to be evicted in seconds. -1 to disable the feature. + */ +int catalog_cache_prune_min_age = 300; + +/* + * Minimum interval between two successive moves of a cache entry in LRU list, + * in microseconds. + */ +#define MIN_LRU_UPDATE_INTERVAL 100000 /* 100ms */ + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Clock for the last accessed time of a catcache entry. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -473,6 +489,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) /* delink from linked list */ dlist_delete(&ct->cache_elem); + dlist_delete(&ct->lru_node); /* * Free keys when we're dealing with a negative entry, normal entries just @@ -833,6 +850,7 @@ InitCatCache(int id, cp->cc_nkeys = nkeys; for (i = 0; i < nkeys; ++i) cp->cc_keyno[i] = key[i]; + dlist_init(&cp->cc_lru_list); /* * new cache is initialized as far as we can go for now. print some @@ -850,9 +868,83 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initialize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries happen to be left unused for a long time for several + * reasons. Remove such entries to prevent catcache from bloating. It is based + * on the similar algorithm with buffer eviction. Entries that are accessed + * several times in a certain period live longer than those that have had less + * access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int nremoved = 0; + dlist_mutable_iter iter; + + /* Return immediately if disabled */ + if (catalog_cache_prune_min_age == 0) + return false; + + /* Scan over LRU to find entries to remove */ + dlist_foreach_modify(iter, &cp->cc_lru_list) + { + CatCTup *ct = dlist_container(CatCTup, lru_node, iter.cur); + long entry_age; + int us; + + /* Don't remove referenced entries */ + if (ct->refcount != 0 || + (ct->c_list && ct->c_list->refcount != 0)) + continue; + + /* + * Calculate the duration from the time from the last access to + * the "current" time. catcacheclock is updated per-statement + * basis. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + + if (entry_age < catalog_cache_prune_min_age) + { + /* + * We don't have older entries, exit. At least one removal + * prevents rehashing this time. + */ + break; + } + + /* + * Entries that are not accessed after the last pruning are removed in + * that seconds, and their lives are prolonged according to how many + * times they are accessed up to three times of the duration. We don't + * try shrink buckets since pruning effectively caps catcache + * expansion in the long term. + */ + if (ct->naccess > 0) + ct->naccess--; + else + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + } + } + + if (nremoved > 0) + elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d", + cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved); + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -1262,7 +1354,21 @@ SearchCatCacheInternal(CatCache *cache, * most frequently accessed elements in any hashbucket will tend to be * near the front of the hashbucket's list.) */ - dlist_move_head(bucket, &ct->cache_elem); + /* dlist_move_head(bucket, &ct->cache_elem);*/ + + /* prolong life of this entry */ + if (ct->naccess < 2) + ct->naccess++; + + /* + * Don't update LRU too frequently. We need to maintain the LRU even + * if pruning is inactive since it can be turned on on-session. + */ + if (catcacheclock - ct->lastaccess > MIN_LRU_UPDATE_INTERVAL) + { + ct->lastaccess = catcacheclock; + dlist_move_tail(&cache->cc_lru_list, &ct->lru_node); + } /* * If it's a positive entry, bump its refcount and return it. If it's @@ -1888,19 +1994,29 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; + dlist_push_tail(&cache->cc_lru_list, &ct->lru_node); dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); cache->cc_ntup++; CacheHdr->ch_ntup++; + /* increase refcount so that the new entry survives pruning */ + ct->refcount++; + /* - * If the hash table has become too full, enlarge the buckets array. Quite - * arbitrarily, we enlarge when fill factor > 2. + * If the hash table has become too full, try removing infrequently used + * entries to make a room for the new entry. If failed, enlarge the bucket + * array instead. Quite arbitrarily, we try this when fill factor > 2. */ - if (cache->cc_ntup > cache->cc_nbuckets * 2) + if (cache->cc_ntup > cache->cc_nbuckets * 2 && + !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); + ct->refcount--; + return ct; } diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 1766e46037..e671d4428e 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -82,6 +82,7 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/guc_tables.h" #include "utils/float.h" #include "utils/memutils.h" @@ -2249,6 +2250,17 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("System catalog cache entries that live unused for longer than this seconds are considered forremoval."), + gettext_noop("The value of -1 turns off pruning."), + GUC_UNIT_S + }, + &catalog_cache_prune_min_age, + 300, -1, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index bbbeb4bb15..d88ec57382 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -128,6 +128,7 @@ #work_mem = 4MB # min 64kB #maintenance_work_mem = 64MB # min 1MB #autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem +#catalog_cache_prune_min_age = 300s # -1 disables pruning #max_stack_depth = 2MB # min 100kB #shared_memory_type = mmap # the default is the first option # supported by the operating system: diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 65d816a583..a21c53644a 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -61,6 +62,7 @@ typedef struct catcache slist_node cc_next; /* list link */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ + dlist_head cc_lru_list; /* * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS @@ -119,6 +121,9 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ + int naccess; /* # of access to this entry, up to 2 */ + TimestampTz lastaccess; /* timestamp of the last usage */ + dlist_node lru_node; /* LRU node */ /* * The tuple may also be a member of at most one CatCList. (If a single @@ -189,6 +194,19 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int catalog_cache_prune_min_age; + +/* source clock for access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* SetCatCacheClock - set catcache timestamp source clodk */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, -- 2.16.3 From ac4a9dc1bb822f9df36d453354b953d2b383545d Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Thu, 4 Apr 2019 21:16:17 +0900 Subject: [PATCH 2/2] Benchmark extension for catcache pruning feature. This extension surgically exercises CatCacheSearch() on STATRELATTINH and returns the duration in milliseconds. --- contrib/catcachebench/Makefile | 17 ++ contrib/catcachebench/catcachebench--0.0.sql | 9 ++ contrib/catcachebench/catcachebench.c | 229 +++++++++++++++++++++++++++ contrib/catcachebench/catcachebench.control | 6 + 4 files changed, 261 insertions(+) create mode 100644 contrib/catcachebench/Makefile create mode 100644 contrib/catcachebench/catcachebench--0.0.sql create mode 100644 contrib/catcachebench/catcachebench.c create mode 100644 contrib/catcachebench/catcachebench.control diff --git a/contrib/catcachebench/Makefile b/contrib/catcachebench/Makefile new file mode 100644 index 0000000000..0478818b25 --- /dev/null +++ b/contrib/catcachebench/Makefile @@ -0,0 +1,17 @@ +MODULE_big = catcachebench +OBJS = catcachebench.o + +EXTENSION = catcachebench +DATA = catcachebench--0.0.sql +PGFILEDESC = "catcachebench - benchmark for catcache pruning feature" + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/catcachebench +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/catcachebench/catcachebench--0.0.sql b/contrib/catcachebench/catcachebench--0.0.sql new file mode 100644 index 0000000000..e091baaaa7 --- /dev/null +++ b/contrib/catcachebench/catcachebench--0.0.sql @@ -0,0 +1,9 @@ +/* contrib/catcachebench/catcachebench--0.0.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION catcachebench" to load this file. \quit + +CREATE FUNCTION catcachebench(IN type int) +RETURNS double precision +AS 'MODULE_PATHNAME', 'catcachebench' +LANGUAGE C STRICT VOLATILE; diff --git a/contrib/catcachebench/catcachebench.c b/contrib/catcachebench/catcachebench.c new file mode 100644 index 0000000000..36d21d13c1 --- /dev/null +++ b/contrib/catcachebench/catcachebench.c @@ -0,0 +1,229 @@ +/* + * catcachebench: test code for cache pruning feature + */ +#include "postgres.h" +#include "catalog/pg_type.h" +#include "catalog/pg_statistic.h" +#include "executor/spi.h" +#include "utils/catcache.h" +#include "utils/syscache.h" + +Oid tableoids[10000]; +int ntables = 0; +int16 attnums[1000]; +int natts = 0; + +PG_MODULE_MAGIC; + +double catcachebench1(void); +double catcachebench2(void); +double catcachebench3(void); +void collectinfo(void); +void catcachewarmup(void); + +PG_FUNCTION_INFO_V1(catcachebench); + +Datum +catcachebench(PG_FUNCTION_ARGS) +{ + int testtype = PG_GETARG_INT32(0); + double ms; + + collectinfo(); + + /* flush the catalog -- safe? don't mind. */ + CatalogCacheFlushCatalog(StatisticRelationId); + + switch (testtype) + { + case 0: + catcachewarmup(); /* prewarm of syscatalog */ + PG_RETURN_NULL(); + case 1: + ms = catcachebench1(); break; + case 2: + ms = catcachebench2(); break; + case 3: + ms = catcachebench3(); break; + default: + elog(ERROR, "Invalid test type: %d", testtype); + } + + PG_RETURN_DATUM(Float8GetDatum(ms)); +} + +double +catcachebench1(void) +{ + int t, a; + instr_time start, + duration; + + INSTR_TIME_SET_CURRENT(start); + for (t = 0 ; t < ntables ; t++) + { + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[t]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + + return INSTR_TIME_GET_MILLISEC(duration); +}; + +double +catcachebench2(void) +{ + int t, a; + instr_time start, + duration; + + INSTR_TIME_SET_CURRENT(start); + for (t = 0 ; t < 60000 ; t++) + { + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[0]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + + return INSTR_TIME_GET_MILLISEC(duration); +}; + +double +catcachebench3(void) +{ + int i, t, a; + instr_time start, + duration; + + INSTR_TIME_SET_CURRENT(start); + for (i = 0 ; i < 2 ; i++) + { + for (t = 0 ; t < ntables ; t++) + { + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[t]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } + } + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + + return INSTR_TIME_GET_MILLISEC(duration); +}; + +void +catcachewarmup(void) +{ + int t, a; + + /* load up catalog tables */ + for (t = 0 ; t < ntables ; t++) + { + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[t]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } +} + +void +collectinfo(void) +{ + int ret; + Datum values[10000]; + bool nulls[10000]; + Oid types0[] = {OIDOID}; + int i; + + ntables = 0; + natts = 0; + + SPI_connect(); + /* collect target tables */ + ret = SPI_execute("select oid from pg_class where relnamespace = (select oid from pg_namespace where nspname = \'test\')", + true, 0); + if (ret != SPI_OK_SELECT) + elog(ERROR, "Failed 1"); + if (SPI_processed == 0) + elog(ERROR, "no relation found in schema \"test\""); + if (SPI_processed > 10000) + elog(ERROR, "too many relation found in schema \"test\""); + + for (i = 0 ; i < SPI_processed ; i++) + { + heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc, + values, nulls); + if (nulls[0]) + elog(ERROR, "Failed 2"); + + tableoids[ntables++] = DatumGetObjectId(values[0]); + } + SPI_finish(); + elog(DEBUG1, "%d tables found", ntables); + + values[0] = ObjectIdGetDatum(tableoids[0]); + nulls[0] = false; + SPI_connect(); + ret = SPI_execute_with_args("select attnum from pg_attribute where attrelid = (select oid from pg_class where oid =$1)", + 1, types0, values, NULL, true, 0); + if (SPI_processed == 0) + elog(ERROR, "no attribute found in table %d", tableoids[0]); + if (SPI_processed > 10000) + elog(ERROR, "too many relation found in table %d", tableoids[0]); + + /* collect target attributes. assuming all tables have the same attnums */ + for (i = 0 ; i < SPI_processed ; i++) + { + int16 attnum; + + heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc, + values, nulls); + if (nulls[0]) + elog(ERROR, "Failed 3"); + attnum = DatumGetInt16(values[0]); + + if (attnum > 0) + attnums[natts++] = attnum; + } + SPI_finish(); + elog(DEBUG1, "%d attributes found", natts); +} diff --git a/contrib/catcachebench/catcachebench.control b/contrib/catcachebench/catcachebench.control new file mode 100644 index 0000000000..3fc9d2e420 --- /dev/null +++ b/contrib/catcachebench/catcachebench.control @@ -0,0 +1,6 @@ +# catcachebench + +comment = 'benchmark for catcache pruning' +default_version = '0.0' +module_pathname = '$libdir/catcachebench' +relocatable = true -- 2.16.3 #! /usr/bin/perl $collist = ""; foreach $i (0..1000) { $collist .= sprintf(", c%05d int", $i); } $collist = substr($collist, 2); printf "drop schema if exists test cascade;\n"; printf "create schema test;\n"; foreach $i (0..2999) { printf "create table test.t%04d ($collist);\n", $i; } #!/bin/bash LOOPS=5 BINROOT=/home/horiguti/bin DATADIR=/home/horiguti/data/data_work_o2 PREC="numeric(10,2)" killall postgres sleep 3 run() { local BINARY=$1 local PGCTL=$2/bin/pg_ctl if [ "$3" != "" ]; then local SETTING1="set catalog_cache_prune_min_age to \"$3\";" local SETTING2="set catalog_cache_prune_min_age to \"$4\";" local SETTING3="set catalog_cache_prune_min_age to \"$5\";" fi $PGCTL --pgdata=$DATADIR start psql postgres -e <<EOF create extension if not exists catcachebench; select catcachebench(0); $SETTING1 select '${BINARY}' as binary, '1_0' as test, count(a), avg(a)::${PREC}, stddev(a)::${PREC} from (select catcachebench(1)from generate_series(1, ${LOOPS})) as a(a) UNION ALL select '${BINARY}', '2_0' , count(a), avg(a)::${PREC}, stddev(a)::${PREC} from (select catcachebench(2) from generate_series(1,${LOOPS})) as a(a); $SETTING2 select '${BINARY}' as binary, '3_0' as test, count(a), avg(a)::${PREC}, stddev(a)::${PREC} from (select catcachebench(3)from generate_series(1, ${LOOPS})) as a(a); $SETTING3 select '${BINARY}' as binary, '1_1' as test, count(a), avg(a)::${PREC}, stddev(a)::${PREC} from (select catcachebench(1)from generate_series(1, ${LOOPS})) as a(a) UNION ALL select '${BINARY}', '2_1' , count(a), avg(a)::${PREC}, stddev(a)::${PREC} from (select catcachebench(2) from generate_series(1,${LOOPS})) as a(a) UNION ALL select '${BINARY}', '3_1' , count(a), avg(a)::${PREC}, stddev(a)::${PREC} from (select catcachebench(3) from generate_series(1,${LOOPS})) as a(a); EOF $PGCTL --pgdata=$DATADIR stop } run "master" $BINROOT/pgsql_work_o2 "" "" "" run "master2" $BINROOT/pgsql_mater_o2m "" "" "" run "LRU" $BINROOT/pgsql_catexp8_1 "300s" "1s" "0" run "Full" $BINROOT/pgsql_catexp8_2 "300s" "1s" "0" run "FullMod" $BINROOT/pgsql_catexp8_3 "300s" "1s" "0"
On Thu, Apr 4, 2019 at 8:53 AM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > So it seems to me that the simplest "Full" version wins. The > attached is rebsaed version. dlist_move_head(entry) is removed as > mentioned above in that patch. 1. I really don't think this patch has any business changing the existing logic. You can't just assume that the dlist_move_head() operation is unimportant for performance. 2. This patch still seems to add a new LRU list that has to be maintained. That's fairly puzzling. You seem to have concluded that the version without the additional LRU wins, but the sent a new copy of the version with the LRU version. 3. I don't think adding an additional call to GetCurrentTimestamp() in start_xact_command() is likely to be acceptable. There has got to be a way to set this up so that the maximum number of new GetCurrentTimestamp() is limited to once per N seconds, vs. the current implementation that could do it many many many times per second. 4. The code in CatalogCacheCreateEntry seems clearly unacceptable. In a pathological case where CatCacheCleanupOldEntries removes exactly one element per cycle, it could be called on every new catcache allocation. I think we need to punt this patch to next release. We're not converging on anything committable very fast. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thank you for the comment. At Thu, 4 Apr 2019 15:44:35 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoZQx7pCcc=VO3WeDQNpco8h6MZN09KjcOMRRu_CrbeoSw@mail.gmail.com> > On Thu, Apr 4, 2019 at 8:53 AM Kyotaro HORIGUCHI > <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > So it seems to me that the simplest "Full" version wins. The > > attached is rebsaed version. dlist_move_head(entry) is removed as > > mentioned above in that patch. > > 1. I really don't think this patch has any business changing the > existing logic. You can't just assume that the dlist_move_head() > operation is unimportant for performance. Ok, it doesn't show significant performance gain so removed that. > 2. This patch still seems to add a new LRU list that has to be > maintained. That's fairly puzzling. You seem to have concluded that > the version without the additional LRU wins, but the sent a new copy > of the version with the LRU version. Sorry, I attached wrong one. The attached is the right one, which doesn't adds the new dlist. > 3. I don't think adding an additional call to GetCurrentTimestamp() in > start_xact_command() is likely to be acceptable. There has got to be > a way to set this up so that the maximum number of new > GetCurrentTimestamp() is limited to once per N seconds, vs. the > current implementation that could do it many many many times per > second. The GetCurrentTimestamp() is called only once at very early in the backend's life in InitPostgres. Not in start_xact_command. What I did in the function is just copying stmtStartTimstamp, not GetCurrentTimestamp(). > 4. The code in CatalogCacheCreateEntry seems clearly unacceptable. In > a pathological case where CatCacheCleanupOldEntries removes exactly > one element per cycle, it could be called on every new catcache > allocation. It may be a problem, if just one entry was created in the duration longer than by catalog_cache_prune_min_age and resize interval, or all candidate entries except one are actually in use at the pruning moment. Is it realistic? > I think we need to punt this patch to next release. We're not > converging on anything committable very fast. Yeah, maybe right. This patch had several month silence several times, got comments and modified taking in the comments for more than two cycles, and finally had a death sentence (not literaly, actually postpone) at very close to this third cycle end. I anticipate the same continues in the next cycle. By the way, I found the reason of the wrong result of the previous benchmark. The test 3_0/1 needs to update catcacheclock midst of the loop. I'm going to fix it and rerun it. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 596d6b018e1b7ddd5828298bfaba3ee405eb2604 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Fri, 1 Mar 2019 13:32:51 +0900 Subject: [PATCH] Remove entries that haven't been used for a certain time Catcache entries happen to be left alone for several reasons. It is not desirable that such useless entries eat up memory. Catcache pruning feature removes entries that haven't been accessed for a certain time before enlarging hash array. --- doc/src/sgml/config.sgml | 19 +++++ src/backend/tcop/postgres.c | 2 + src/backend/utils/cache/catcache.c | 103 +++++++++++++++++++++++++- src/backend/utils/misc/guc.c | 12 +++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/utils/catcache.h | 16 ++++ 6 files changed, 150 insertions(+), 3 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index bc1d0f7bfa..819b252029 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1677,6 +1677,25 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-catalog-cache-prune-min-age" xreflabel="catalog_cache_prune_min_age"> + <term><varname>catalog_cache_prune_min_age</varname> (<type>integer</type>) + <indexterm> + <primary><varname>catalog_cache_prune_min_age</varname> configuration + parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies the minimum amount of unused time in seconds at which a + system catalog cache entry is removed. -1 indicates that this feature + is disabled at all. The value defaults to 300 seconds (<literal>5 + minutes</literal>). The entries that are not used for the duration + can be removed to prevent catalog cache from bloating with useless + entries. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-max-stack-depth" xreflabel="max_stack_depth"> <term><varname>max_stack_depth</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 44a59e1d4f..a0efac86bc 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -71,6 +71,7 @@ #include "tcop/pquery.h" #include "tcop/tcopprot.h" #include "tcop/utility.h" +#include "utils/catcache.h" #include "utils/lsyscache.h" #include "utils/memutils.h" #include "utils/ps_status.h" @@ -2577,6 +2578,7 @@ start_xact_command(void) * not desired, the timeout has to be disabled explicitly. */ enable_statement_timeout(); + SetCatCacheClock(GetCurrentStatementStartTimestamp()); } static void diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index d05930bc4c..03c2d8524c 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -38,6 +38,7 @@ #include "utils/rel.h" #include "utils/resowner_private.h" #include "utils/syscache.h" +#include "utils/timeout.h" /* #define CACHEDEBUG */ /* turns DEBUG elogs on */ @@ -60,9 +61,18 @@ #define CACHE_elog(...) #endif +/* + * GUC variable to define the minimum age of entries that will be considered + * to be evicted in seconds. -1 to disable the feature. + */ +int catalog_cache_prune_min_age = 300; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Clock for the last accessed time of a catcache entry. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -850,9 +860,83 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initialize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries happen to be left unused for a long time for several + * reasons. Remove such entries to prevent catcache from bloating. It is based + * on the similar algorithm with buffer eviction. Entries that are accessed + * several times in a certain period live longer than those that have had less + * access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int nremoved = 0; + int i; + + /* Return immediately if disabled */ + if (catalog_cache_prune_min_age == 0) + return false; + + /* Scan over the whole hash to find entries to remove */ + for (i = 0 ; i < cp->cc_nbuckets ; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + /* Don't remove referenced entries */ + if (ct->refcount != 0 || + (ct->c_list && ct->c_list->refcount != 0)) + continue; + + /* + * Calculate the duration from the time from the last access to + * the "current" time. catcacheclock is updated per-statement + * basis and additionaly udpated periodically during a long + * running query. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + + if (entry_age < catalog_cache_prune_min_age) + continue; + + /* + * Entries that are not accessed after the last pruning are + * removed in that seconds, and their lives are prolonged + * according to how many times they are accessed up to three times + * of the duration. We don't try shrink buckets since pruning + * effectively caps catcache expansion in the long term. + */ + if (ct->naccess > 0) + ct->naccess--; + else + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + } + } + } + + if (nremoved > 0) + elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d", + cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved); + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -1263,6 +1347,10 @@ SearchCatCacheInternal(CatCache *cache, * near the front of the hashbucket's list.) */ dlist_move_head(bucket, &ct->cache_elem); + if (ct->naccess < 2) + ct->naccess++; + + ct->lastaccess = catcacheclock; /* * If it's a positive entry, bump its refcount and return it. If it's @@ -1888,19 +1976,28 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); cache->cc_ntup++; CacheHdr->ch_ntup++; + /* increase refcount so that the new entry survives pruning */ + ct->refcount++; + /* - * If the hash table has become too full, enlarge the buckets array. Quite - * arbitrarily, we enlarge when fill factor > 2. + * If the hash table has become too full, try removing infrequently used + * entries to make a room for the new entry. If failed, enlarge the bucket + * array instead. Quite arbitrarily, we try this when fill factor > 2. */ - if (cache->cc_ntup > cache->cc_nbuckets * 2) + if (cache->cc_ntup > cache->cc_nbuckets * 2 && + !CatCacheCleanupOldEntries(cache)) RehashCatCache(cache); + ct->refcount--; + return ct; } diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 1766e46037..e671d4428e 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -82,6 +82,7 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/guc_tables.h" #include "utils/float.h" #include "utils/memutils.h" @@ -2249,6 +2250,17 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("System catalog cache entries that live unused for longer than this seconds are considered forremoval."), + gettext_noop("The value of -1 turns off pruning."), + GUC_UNIT_S + }, + &catalog_cache_prune_min_age, + 300, -1, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index bbbeb4bb15..d88ec57382 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -128,6 +128,7 @@ #work_mem = 4MB # min 64kB #maintenance_work_mem = 64MB # min 1MB #autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem +#catalog_cache_prune_min_age = 300s # -1 disables pruning #max_stack_depth = 2MB # min 100kB #shared_memory_type = mmap # the default is the first option # supported by the operating system: diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 65d816a583..2134839ecf 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -119,6 +120,8 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ + int naccess; /* # of access to this entry, up to 2 */ + TimestampTz lastaccess; /* timestamp of the last usage */ /* * The tuple may also be a member of at most one CatCList. (If a single @@ -189,6 +192,19 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int catalog_cache_prune_min_age; + +/* source clock for access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* SetCatCacheClock - set catcache timestamp source clodk */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, -- 2.16.3
At Fri, 05 Apr 2019 09:44:07 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190405.094407.151644324.horiguchi.kyotaro@lab.ntt.co.jp> > By the way, I found the reason of the wrong result of the > previous benchmark. The test 3_0/1 needs to update catcacheclock > midst of the loop. I'm going to fix it and rerun it. I found the cause. CataloCacheFlushCatalog() doesn't shring the hash. So no resize happens once it is bloated. I needed another version of the function that reset the cc_bucket to the initial size. Using the new debug function, I got better numbers. I focused on the performance when disabled. I rechecked that by adding the patch part-by-part and identified several causes of the degradaton. I did: - MovedpSetCatCacheClock() to AtStart_Cache() - Maybe improved the caller site of CatCacheCleanupOldEntries(). As the result: binary | test | count | avg | stddev | --------+------+-------+---------+--------+------- master | 1_1 | 5 | 7104.90 | 4.40 | master | 2_1 | 5 | 3759.26 | 4.20 | master | 3_1 | 5 | 7954.05 | 2.15 | --------+------+-------+---------+--------+------- Full | 1_1 | 5 | 7237.20 | 7.98 | 101.87 Full | 2_1 | 5 | 4050.98 | 8.42 | 107.76 Full | 3_1 | 5 | 8192.87 | 3.28 | 103.00 But, still fluctulates by around 5%.. If this level of the degradation is still not acceptable, that means that nothing can be inserted in the code path and the new code path should be isolated from existing code by using indirect call. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index bd5024ef00..a9414c0c07 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -1067,6 +1067,7 @@ static void AtStart_Cache(void) { AcceptInvalidationMessages(); + SetCatCacheClock(GetCurrentStatementStartTimestamp()); } /* diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 44a59e1d4f..4d849aeb4c 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -71,6 +71,7 @@ #include "tcop/pquery.h" #include "tcop/tcopprot.h" #include "tcop/utility.h" +#include "utils/catcache.h" #include "utils/lsyscache.h" #include "utils/memutils.h" #include "utils/ps_status.h" diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index d05930bc4c..91814f7210 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -60,9 +60,18 @@ #define CACHE_elog(...) #endif +/* + * GUC variable to define the minimum age of entries that will be considered + * to be evicted in seconds. -1 to disable the feature. + */ +int catalog_cache_prune_min_age = 0; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Clock for the last accessed time of a catcache entry. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -850,9 +859,83 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initialize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + return cp; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries happen to be left unused for a long time for several + * reasons. Remove such entries to prevent catcache from bloating. It is based + * on the similar algorithm with buffer eviction. Entries that are accessed + * several times in a certain period live longer than those that have had less + * access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int nremoved = 0; + int i; + + /* Return immediately if disabled */ + if (catalog_cache_prune_min_age == 0) + return false; + + /* Scan over the whole hash to find entries to remove */ + for (i = 0 ; i < cp->cc_nbuckets ; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + long entry_age; + int us; + + /* Don't remove referenced entries */ + if (ct->refcount != 0 || + (ct->c_list && ct->c_list->refcount != 0)) + continue; + + /* + * Calculate the duration from the time from the last access to + * the "current" time. catcacheclock is updated per-statement + * basis and additionaly udpated periodically during a long + * running query. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &entry_age, &us); + + if (entry_age < catalog_cache_prune_min_age) + continue; + + /* + * Entries that are not accessed after the last pruning are + * removed in that seconds, and their lives are prolonged + * according to how many times they are accessed up to three times + * of the duration. We don't try shrink buckets since pruning + * effectively caps catcache expansion in the long term. + */ + if (ct->naccess > 1) + ct->naccess--; + else + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + } + } + } + + if (nremoved > 0) + elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d", + cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved); + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -1263,6 +1346,10 @@ SearchCatCacheInternal(CatCache *cache, * near the front of the hashbucket's list.) */ dlist_move_head(bucket, &ct->cache_elem); + ct->naccess++; + ct->naccess &= 3; + + ct->lastaccess = catcacheclock; /* * If it's a positive entry, bump its refcount and return it. If it's @@ -1888,6 +1975,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); @@ -1895,11 +1984,25 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, CacheHdr->ch_ntup++; /* - * If the hash table has become too full, enlarge the buckets array. Quite - * arbitrarily, we enlarge when fill factor > 2. - */ - if (cache->cc_ntup > cache->cc_nbuckets * 2) - RehashCatCache(cache); + * If the hash table has become too full, try removing infrequently used + * entries to make a room for the new entry. If failed, enlarge the bucket + * array instead. Quite arbitrarily, we try this when fill factor > 2. + */ + if (unlikely(cache->cc_ntup > cache->cc_nbuckets * 2)) + { + bool rehash = true; + + if (unlikely(catalog_cache_prune_min_age > 0)) + { + /* increase refcount so that the new entry survives pruning */ + ct->refcount++; + rehash = !CatCacheCleanupOldEntries(cache); + ct->refcount--; + } + + if (likely(rehash)) + RehashCatCache(cache); + } return ct; } diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 65d816a583..871f51fe34 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -119,6 +120,8 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ + int naccess; /* # of access to this entry, up to 2 */ + TimestampTz lastaccess; /* timestamp of the last usage */ /* * The tuple may also be a member of at most one CatCList. (If a single @@ -185,6 +188,18 @@ typedef struct catcacheheader int ch_ntup; /* # of tuples in all caches */ } CatCacheHeader; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int catalog_cache_prune_min_age; + +/* source clock for access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* SetCatCacheClock - set catcache timestamp source clodk */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext;
>From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com] >Does this result show that hard-limit size option with memory accounting doesn't harm >to usual users who disable hard limit size option? Hi, I've implemented relation cache size limitation with LRU list and built-in memory context size account. And I'll share some results coupled with a quick recap of catcache so that we can resume discussion if needed though relation cache bloat was also discussed in this thread but right now it's pending and catcache feature is not fixed. But a variety of information could be good I believe. Regarding catcache it seems to me recent Horiguchi san posts shows a pretty detailed stats including comparison LRU overhead and full scan of hash table. According to the result, lru overhead seems small but for simplicity this thread go without LRU. https://www.postgresql.org/message-id/20190404.215255.09756748.horiguchi.kyotaro%40lab.ntt.co.jp When there was hard limit of catcach, there was built-in memory context size accounting machinery. I checked the overhead of memory accounting, and when repeating palloc and pfree of 800 byte area many times it was 4% down on the other hand in case of 32768 byte there seems no overhead. https://www.postgresql.org/message-id/4E72940DA2BF16479384A86D54D0988A6F44564E%40G01JPEXMBKW04 Regarding relcache hard limit (relation_cache_max_size), most of the architecture was similar as catcache one with LRU listexcept memory accounting. Relcaches are managed by LRU list. To prune LRU cache, we need to know overall relcache sizes including objects pointed byrelcache like 'index info'. So in this patch relcache objects are allocated under RelCacheMemoryContext, which is child of CacheMemoryContext. Objectspointed by relcache is allocated under child context of RelCacheMemoryContext. In built-in size accounting, if memoryContext is set to collect "group(family) size", you can get context size includingchild easily. I ran two experiments: A) One is pgbench using Tomas's script he posted while ago, which is randomly select 1 from many tables. https://www.postgresql.org/message-id/4E72940DA2BF16479384A86D54D0988A6F426207%40G01JPEXMBKW04 B) The other is to check memory context account overhead using the same method. https://www.postgresql.org/message-id/4E72940DA2BF16479384A86D54D0988A6F44564E%40G01JPEXMBKW04 A) randomly select 1 from many tables Results are average of 5 times each. number of tables | 100 |1000 |10000 ----------------------------------------------------------- TPS (master) |11105 |10815 |8915 TPS (patch; limit feature off) |11254 (+1%) |11176 (+3%) |9242 (+4%) TPS (patch: limit on with 1MB) |11317 (+2%) |10491 (-3%) |7380 (-17%) The results are noisy but it seems overhead of LRU and memory accounting is small when turning off the relcache limit feature. When turning on the limit feature, after exceeding the limit it drops 17%, which is no surprise. B) Repeat palloc/pfree "With group accounting" means that account test context and its child context with built-in accounting using "palloc_bench_family()". The other one is that using palloc_bench(). Please see palloc_bench.gz. [Size=32768, iter=1,000,000] Master | 59.97 ms Master with group account | 59.57 ms patched |67.23 ms patched with family |68.81 ms It seems that overhead seems large in this patch. So it needs more inspection this area. regards, Takeshi Ideriha
Attachment
Hello, my_gripe> But, still fluctulates by around 5%.. my_gripe> my_gripe> If this level of the degradation is still not acceptable, that my_gripe> means that nothing can be inserted in the code path and the new my_gripe> code path should be isolated from existing code by using indirect my_gripe> call. Finally, after some struggling, I think I could manage to measure the impact on performace precisely and reliably. Starting from "make distclean" every time building, then removing all in $TARGET before installation makes things stable enough. (I don't think it's good but I didin't investigate the cause..) I measured time/call by directly calling SearchSysCache3() many times. It showed that the patch causes around 0.1 microsec degradation per call. (The funtion overall took about 6.9 microsec on average.) Next, I counted how many times SearchSysCache is called during a planning with, as an instance, a query on a partitioned table having 3000 columns and 1000 partitions. explain analyze select sum(c0000) from test.p; Planner made 6020608 times syscache calls while planning and the overall planning time was 8641ms. (Exec time was 48ms.) 6020608 times 0.1 us is 602 ms of degradation. So roughly -7% degradation in planning time in estimation. The degradation was given by really only the two successive instructions "ADD/conditional MOVE(CMOVE)". The fact leads to the conclusion that the existing code path as is doesn't have room for any additional code. So I sought for room at least for one branch and found that (on gcc 7.3.1/CentOS7/x64). Interestingly, de-inlining SearchCatCacheInternal gave me gain on performance by about 3%. Further inlining of CatalogCacheComputeHashValue() gave another gain about 3%. I could add a branch in SearchCatCacheInteral within the gain. I also tried indirect calls but the degradation overwhelmed the gain, so I choosed branching rather than indirect calls. I didn't investigated how it happens. The following is the result. The binaries are build with the same configuration using -O2. binary means master : master HEAD. patched_off : patched, but pruning disabled (catalog_cache_prune_min_age=-1). patched_on : patched with pruning enabled. ("300s" for 1, "1s" for2, "0" for 3) bench: 1: corresponds to catcachebench(1); fetching STATRELATTINH 3000 * 1000 times generating new cache entriies. (Massive cache creatiion) Pruning doesn't happen while running this. 2: catcachebench(2); 60000 times cache access on 1000 STATRELATTINH entries. (Frequent cache reference) Pruning doesn't happen while running this. 3: catcachebench(3); fetching 1000(tbls) * 3000(cols) STATRELATTINH entries. Catcache clock advancing with the interval of 100(tbls) * 3000(cols) times of access and pruning happenshoge. While running catcachebench(3) once, pruning happens 28 times and most of the time 202202 entries are removed and the total number of entries was limite to 524289. (The systable has 3000 * 1001 = 3003000 tuples.) iter: Number of iterations. Time ms and stddev is calculated over the iterations. binar | bench | iter | time ms | stddev -------------+-------+-------+----------+-------- master | 1 | 10 | 8150.30 | 12.96 master | 2 | 10 | 4002.88 | 16.18 master | 3 | 10 | 9065.06 | 11.46 -------------+-------+-------+----------+-------- patched_off | 1 | 10 | 8090.95 | 9.95 patched_off | 2 | 10 | 3984.67 | 12.33 patched_off | 3 | 10 | 9050.46 | 4.64 -------------+-------+-------+----------+-------- patched_on | 1 | 10 | 8158.95 | 6.29 patched_on | 2 | 10 | 4023.72 | 10.41 patched_on | 3 | 10 | 16532.66 | 18.39 patched_off is slightly faster than master. patched_on is generally a bit slower. Even though patched_on/3 seems take too long time, the extra time comes from increased catalog table acess in exchange of memory saving. (That is, it is expected behavior.) I ran it several times and most them showed the same tendency. As a side-effect, once the branch added, the shared syscache in a neighbour thread will be able to be inserted together without impact on existing code path. === The benchmark script is used as the follows: - create many (3000, as example) tables in "test" schema. I created a partitioned table with 3000 children. - The tables have many columns, 1000 for me. - Run the following commands. =# select catcachebench(0); -- warm up systables. =# set catalog_cache_prune_min_age = any; -- as required =# select catcachebench(n); -- 3 >= n >= 1, the number of "bench" above. The above result is taked with the following query. =# select 'patched_on', '3' , count(a), avg(a)::numeric(10,2), stddev(a)::numeric(10,2) from (select catcachebench(3) fromgenerate_series(1, 10)) as a(a); ==== The attached patches are: 0001-Adjust-inlining-of-some-functions.patch: Changes inlining property of two functions, SearchCatCacheInternal and CatalogCacheComputeHashValue. 0002-Benchmark-extension-and-required-core-change.patch: Micro benchmark of SearchSysCache3() and core-side tweaks, which is out-of this patch set in the view of functionality. Works for 0001 but not for 0004 or later. 0003 adjusts that. 0003-Adjust-catcachebench-for-later-patches.patch Adjustment of 0002, benchmark for 0004, the body of this patchset. Breaks code consistency until 0004 applied. 0004-Catcache-pruning-feature.patch The feature patch, intentionally unchanges indentation of an existing code block in SearchCatCacheInternal for smaller size of the patch. It is adjusted in the next 0005 patch. 0005-Adjust-indentation-of-SearchCatCacheInternal.patch Adjusts indentation of 0004. 0001+4+5 is the final shape of the patch set and 0002+3 is only for benchmarking. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 927ce9035e13240378c7c332610bdde9377c2d7b Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Fri, 28 Jun 2019 16:29:52 +0900 Subject: [PATCH 1/5] Adjust inlining of some functions SearchCatCacheInternal code path is quite short and hot so that it doesn't accept additional cycles in the function. But changing inline attribute of SearchCatCacheInternal and CatalogCacheComputeHashValue makes SearchCatCacheN faster by about 6%. This makes room for an extra branch to be the door to other implementations of catcache. --- src/backend/utils/cache/catcache.c | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 00def27881..8fc067ce31 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -63,10 +63,10 @@ /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; -static inline HeapTuple SearchCatCacheInternal(CatCache *cache, - int nkeys, - Datum v1, Datum v2, - Datum v3, Datum v4); +static HeapTuple SearchCatCacheInternal(CatCache *cache, + int nkeys, + Datum v1, Datum v2, + Datum v3, Datum v4); static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache, int nkeys, @@ -75,8 +75,9 @@ static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache, Datum v1, Datum v2, Datum v3, Datum v4); -static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys, - Datum v1, Datum v2, Datum v3, Datum v4); +static inline uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys, + Datum v1, Datum v2, + Datum v3, Datum v4); static uint32 CatalogCacheComputeTupleHashValue(CatCache *cache, int nkeys, HeapTuple tuple); static inline bool CatalogCacheCompareTuple(const CatCache *cache, int nkeys, @@ -266,7 +267,7 @@ GetCCHashEqFuncs(Oid keytype, CCHashFN *hashfunc, RegProcedure *eqfunc, CCFastEq * * Compute the hash value associated with a given set of lookup keys */ -static uint32 +static inline uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys, Datum v1, Datum v2, Datum v3, Datum v4) { @@ -1194,7 +1195,7 @@ SearchCatCache4(CatCache *cache, /* * Work-horse for SearchCatCache/SearchCatCacheN. */ -static inline HeapTuple +static HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, -- 2.16.3 From f0f5833ddfd0aac934cc7b5ded93541810c486d3 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Fri, 28 Jun 2019 17:03:07 +0900 Subject: [PATCH 2/5] Benchmark extension and required core change Micro benchmark extension for SearchSysCache and required core-side code. --- contrib/catcachebench/Makefile | 17 ++ contrib/catcachebench/catcachebench--0.0.sql | 9 + contrib/catcachebench/catcachebench.c | 281 +++++++++++++++++++++++++++ contrib/catcachebench/catcachebench.control | 6 + src/backend/utils/cache/catcache.c | 13 ++ 5 files changed, 326 insertions(+) create mode 100644 contrib/catcachebench/Makefile create mode 100644 contrib/catcachebench/catcachebench--0.0.sql create mode 100644 contrib/catcachebench/catcachebench.c create mode 100644 contrib/catcachebench/catcachebench.control diff --git a/contrib/catcachebench/Makefile b/contrib/catcachebench/Makefile new file mode 100644 index 0000000000..0478818b25 --- /dev/null +++ b/contrib/catcachebench/Makefile @@ -0,0 +1,17 @@ +MODULE_big = catcachebench +OBJS = catcachebench.o + +EXTENSION = catcachebench +DATA = catcachebench--0.0.sql +PGFILEDESC = "catcachebench - benchmark for catcache pruning feature" + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/catcachebench +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/catcachebench/catcachebench--0.0.sql b/contrib/catcachebench/catcachebench--0.0.sql new file mode 100644 index 0000000000..e091baaaa7 --- /dev/null +++ b/contrib/catcachebench/catcachebench--0.0.sql @@ -0,0 +1,9 @@ +/* contrib/catcachebench/catcachebench--0.0.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION catcachebench" to load this file. \quit + +CREATE FUNCTION catcachebench(IN type int) +RETURNS double precision +AS 'MODULE_PATHNAME', 'catcachebench' +LANGUAGE C STRICT VOLATILE; diff --git a/contrib/catcachebench/catcachebench.c b/contrib/catcachebench/catcachebench.c new file mode 100644 index 0000000000..0cebbbde4f --- /dev/null +++ b/contrib/catcachebench/catcachebench.c @@ -0,0 +1,281 @@ +/* + * catcachebench: test code for cache pruning feature + */ +#include "postgres.h" +#include "catalog/pg_type.h" +#include "catalog/pg_statistic.h" +#include "executor/spi.h" +#include "libpq/pqsignal.h" +#include "utils/catcache.h" +#include "utils/syscache.h" +#include "utils/timestamp.h" + +Oid tableoids[10000]; +int ntables = 0; +int16 attnums[1000]; +int natts = 0; + +PG_MODULE_MAGIC; + +double catcachebench1(void); +double catcachebench2(void); +double catcachebench3(void); +void collectinfo(void); +void catcachewarmup(void); + +PG_FUNCTION_INFO_V1(catcachebench); + +Datum +catcachebench(PG_FUNCTION_ARGS) +{ + int testtype = PG_GETARG_INT32(0); + double ms; + extern bool _catcache_shrink_buckets; + + collectinfo(); + + /* flush the catalog -- safe? don't mind. */ + + _catcache_shrink_buckets = true; + CatalogCacheFlushCatalog(StatisticRelationId); + _catcache_shrink_buckets = false; + + switch (testtype) + { + case 0: + catcachewarmup(); /* prewarm of syscatalog */ + PG_RETURN_NULL(); + case 1: + ms = catcachebench1(); break; + case 2: + ms = catcachebench2(); break; + case 3: + ms = catcachebench3(); break; + default: + elog(ERROR, "Invalid test type: %d", testtype); + } + + PG_RETURN_DATUM(Float8GetDatum(ms)); +} + +/* + * fetch all attribute entires of all tables. + */ +double +catcachebench1(void) +{ + int t, a; + instr_time start, + duration; + + PG_SETMASK(&BlockSig); + INSTR_TIME_SET_CURRENT(start); + for (t = 0 ; t < ntables ; t++) + { + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[t]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + PG_SETMASK(&UnBlockSig); + + return INSTR_TIME_GET_MILLISEC(duration); +}; + +/* + * fetch all attribute entires of a table 6000 times. + */ +double +catcachebench2(void) +{ + const int clock_step = 100; + int t, a; + instr_time start, + duration; + + PG_SETMASK(&BlockSig); + INSTR_TIME_SET_CURRENT(start); + for (t = 0 ; t < 60000 ; t++) + { + int ct = clock_step; + + /* + * catcacheclock is updated by transaction timestamp, so needs to + * be updated by other means for this test to work. Here I choosed + * to update the clock every 100 times of table scans. + */ + if (--ct < 0) + { + // We don't have it yet. + //SetCatCacheClock(GetCurrentTimestamp()); + GetCurrentTimestamp(); + ct = clock_step; + } + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[0]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + PG_SETMASK(&UnBlockSig); + + return INSTR_TIME_GET_MILLISEC(duration); +}; + +/* + * fetch all attribute entires of all tables twice with having expiration + * happen. + */ +double +catcachebench3(void) +{ + const int clock_step = 100; + int i, t, a; + instr_time start, + duration; + + PG_SETMASK(&BlockSig); + INSTR_TIME_SET_CURRENT(start); + for (i = 0 ; i < 2 ; i++) + { + int ct = clock_step; + + for (t = 0 ; t < ntables ; t++) + { + /* + * catcacheclock is updated by transaction timestamp, so needs to + * be updated by other means for this test to work. Here I choosed + * to update the clock every 100 tables scan. + */ + if (--ct < 0) + { + // We don't have it yet. + //SetCatCacheClock(GetCurrentTimestamp()); + GetCurrentTimestamp(); + ct = clock_step; + } + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[t]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } + } + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + PG_SETMASK(&UnBlockSig); + + return INSTR_TIME_GET_MILLISEC(duration); +}; + +void +catcachewarmup(void) +{ + int t, a; + + /* load up catalog tables */ + for (t = 0 ; t < ntables ; t++) + { + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[t]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } +} + +void +collectinfo(void) +{ + int ret; + Datum values[10000]; + bool nulls[10000]; + Oid types0[] = {OIDOID}; + int i; + + ntables = 0; + natts = 0; + + SPI_connect(); + /* collect target tables */ + ret = SPI_execute("select oid from pg_class where relnamespace = (select oid from pg_namespace where nspname = \'test\')", + true, 0); + if (ret != SPI_OK_SELECT) + elog(ERROR, "Failed 1"); + if (SPI_processed == 0) + elog(ERROR, "no relation found in schema \"test\""); + if (SPI_processed > 10000) + elog(ERROR, "too many relation found in schema \"test\""); + + for (i = 0 ; i < SPI_processed ; i++) + { + heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc, + values, nulls); + if (nulls[0]) + elog(ERROR, "Failed 2"); + + tableoids[ntables++] = DatumGetObjectId(values[0]); + } + SPI_finish(); + elog(DEBUG1, "%d tables found", ntables); + + values[0] = ObjectIdGetDatum(tableoids[0]); + nulls[0] = false; + SPI_connect(); + ret = SPI_execute_with_args("select attnum from pg_attribute where attrelid = (select oid from pg_class where oid =$1)", + 1, types0, values, NULL, true, 0); + if (SPI_processed == 0) + elog(ERROR, "no attribute found in table %d", tableoids[0]); + if (SPI_processed > 10000) + elog(ERROR, "too many relation found in table %d", tableoids[0]); + + /* collect target attributes. assuming all tables have the same attnums */ + for (i = 0 ; i < SPI_processed ; i++) + { + int16 attnum; + + heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc, + values, nulls); + if (nulls[0]) + elog(ERROR, "Failed 3"); + attnum = DatumGetInt16(values[0]); + + if (attnum > 0) + attnums[natts++] = attnum; + } + SPI_finish(); + elog(DEBUG1, "%d attributes found", natts); +} diff --git a/contrib/catcachebench/catcachebench.control b/contrib/catcachebench/catcachebench.control new file mode 100644 index 0000000000..3fc9d2e420 --- /dev/null +++ b/contrib/catcachebench/catcachebench.control @@ -0,0 +1,6 @@ +# catcachebench + +comment = 'benchmark for catcache pruning' +default_version = '0.0' +module_pathname = '$libdir/catcachebench' +relocatable = true diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 8fc067ce31..98427b67cd 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -716,6 +716,9 @@ ResetCatalogCaches(void) * rather than relying on the relcache to keep a tupdesc for us. Of course * this assumes the tupdesc of a cachable system table will not change...) */ +/* CODE FOR catcachebench: REMOVE ME AFTER USE */ +bool _catcache_shrink_buckets = false; +/* END: CODE FOR catcachebench*/ void CatalogCacheFlushCatalog(Oid catId) { @@ -735,6 +738,16 @@ CatalogCacheFlushCatalog(Oid catId) /* Tell inval.c to call syscache callbacks for this cache */ CallSyscacheCallbacks(cache->id, 0); + + /* CODE FOR catcachebench: REMOVE ME AFTER USE */ + if (_catcache_shrink_buckets) + { + cache->cc_nbuckets = 128; + pfree(cache->cc_bucket); + cache->cc_bucket = palloc0(128 * sizeof(dlist_head)); + elog(LOG, "Catcache reset"); + } + /* END: CODE FOR catcachebench*/ } } -- 2.16.3 From fcd0273933f3f53979749ba2e315043f1a6f6f31 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Mon, 1 Jul 2019 15:08:11 +0900 Subject: [PATCH 3/5] Adjust catcachebench for later patches Make the benchmark use SetCatCacheClock, which is being introduced by the next patch. This temprarily breaks consistency until the next patch is applied. --- contrib/catcachebench/catcachebench.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/contrib/catcachebench/catcachebench.c b/contrib/catcachebench/catcachebench.c index 0cebbbde4f..63a7400463 100644 --- a/contrib/catcachebench/catcachebench.c +++ b/contrib/catcachebench/catcachebench.c @@ -116,9 +116,7 @@ catcachebench2(void) */ if (--ct < 0) { - // We don't have it yet. - //SetCatCacheClock(GetCurrentTimestamp()); - GetCurrentTimestamp(); + SetCatCacheClock(GetCurrentTimestamp()); ct = clock_step; } for (a = 0 ; a < natts ; a++) @@ -168,9 +166,7 @@ catcachebench3(void) */ if (--ct < 0) { - // We don't have it yet. - //SetCatCacheClock(GetCurrentTimestamp()); - GetCurrentTimestamp(); + SetCatCacheClock(GetCurrentTimestamp()); ct = clock_step; } for (a = 0 ; a < natts ; a++) -- 2.16.3 From 57129a5a001b7729f8888579bc11e47cf7192801 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Mon, 1 Jul 2019 11:31:54 +0900 Subject: [PATCH 4/5] Catcache pruning feature. Currently we don't have a mechanism to limit the memory amount for syscache. Syscache bloat often causes process die by OOM killer or other problems. This patch lets old syscache entries removed to eventually limit the amount of cache. This patch intentionally unchanges indentation of an existing code block in SearchCatCacheInternal for the patch size to be smaller. It is adjusted in the next patch. --- src/backend/utils/cache/catcache.c | 186 +++++++++++++++++++++++++++++++++++++ src/backend/utils/misc/guc.c | 12 +++ src/include/utils/catcache.h | 17 ++++ 3 files changed, 215 insertions(+) diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 98427b67cd..b552ae960c 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -60,9 +60,18 @@ #define CACHE_elog(...) #endif +/* + * GUC variable to define the minimum age of entries that will be considered + * to be evicted in seconds. -1 to disable the feature. + */ +int catalog_cache_prune_min_age = -1; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Clock for the last accessed time of a catcache entry. */ +TimestampTz catcacheclock = 0; + static HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -864,9 +873,107 @@ InitCatCache(int id, */ MemoryContextSwitchTo(oldcxt); + /* initialize catcache reference clock if haven't done yet */ + if (catcacheclock == 0) + catcacheclock = GetCurrentTimestamp(); + + /* + * This cache doesn't contain a tuple older than the current time. Prevent + * the first pruning from happening too early. + */ + cp->cc_oldest_ts = catcacheclock; + return cp; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries happen to be left unused for a long time for several + * reasons. Remove such entries to prevent catcache from bloating. It is based + * on the similar algorithm with buffer eviction. Entries that are accessed + * several times in a certain period live longer than those that have had less + * access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int nremoved = 0; + int i; + long oldest_ts = catcacheclock; + long age; + int us; + + /* Return immediately if disabled */ + if (catalog_cache_prune_min_age < 0) + return false; + + /* Don't scan the hash when we know we don't have prunable entries */ + TimestampDifference(cp->cc_oldest_ts, catcacheclock, &age, &us); + if (age < catalog_cache_prune_min_age) + return false; + + /* Scan over the whole hash to find entries to remove */ + for (i = 0 ; i < cp->cc_nbuckets ; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + + /* Don't remove referenced entries */ + if (ct->refcount == 0 && + (ct->c_list == NULL || ct->c_list->refcount == 0)) + { + /* + * Calculate the duration from the time from the last access + * to the "current" time. catcacheclock is updated + * per-statement basis and additionaly udpated periodically + * during a long running query. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &age, &us); + + if (age > catalog_cache_prune_min_age) + { + /* + * Entries that are not accessed after the last pruning + * are removed in that seconds, and their lives are + * prolonged according to how many times they are accessed + * up to three times of the duration. We don't try shrink + * buckets since pruning effectively caps catcache + * expansion in the long term. + */ + if (ct->naccess > 2) + ct->naccess = 1; + else if (ct->naccess > 0) + ct->naccess--; + else + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + + /* don't update oldest_ts by removed entry */ + continue; + } + } + } + + /* update oldest timestamp if the entry remains alive */ + if (ct->lastaccess < oldest_ts) + oldest_ts = ct->lastaccess; + } + } + + cp->cc_oldest_ts = oldest_ts; + + if (nremoved > 0) + elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d", + cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved); + + return nremoved > 0; +} + /* * Enlarge a catcache, doubling the number of buckets. */ @@ -880,6 +987,10 @@ RehashCatCache(CatCache *cp) elog(DEBUG1, "rehashing catalog cache id %d for %s; %d tups, %d buckets", cp->id, cp->cc_relname, cp->cc_ntup, cp->cc_nbuckets); + /* try removing old entries before expanding the hash */ + if (CatCacheCleanupOldEntries(cp)) + return; + /* Allocate a new, larger, hash table. */ newnbuckets = cp->cc_nbuckets * 2; newbucket = (dlist_head *) MemoryContextAllocZero(CacheMemoryContext, newnbuckets * sizeof(dlist_head)); @@ -1257,6 +1368,14 @@ SearchCatCacheInternal(CatCache *cache, * dlist within the loop, because we don't continue the loop afterwards. */ bucket = &cache->cc_bucket[hashIndex]; + + /* + * Even though this branch leads to duplicate of a bit much code, we want + * as less branches as possible here to keep fastest when pruning is + * disabled. Don't try to move this branch to foreach to save lines. + */ + if (likely(catalog_cache_prune_min_age < 0)) + { dlist_foreach(iter, bucket) { ct = dlist_container(CatCTup, cache_elem, iter.cur); @@ -1309,6 +1428,71 @@ SearchCatCacheInternal(CatCache *cache, return NULL; } } + } + else + { + /* + * We manage the age of each entries for pruning in this branch. + */ + dlist_foreach(iter, bucket) + { + /* The following section is the same with the if() block */ + ct = dlist_container(CatCTup, cache_elem, iter.cur); + + if (ct->dead) + continue; + + if (ct->hash_value != hashValue) + continue; + + if (!CatalogCacheCompareTuple(cache, nkeys, ct->keys, arguments)) + continue; + + dlist_move_head(bucket, &ct->cache_elem); + + /* + * Prolong life of this entry. Since we want run as less + * instructions as possible and want the branch be stable for + * performance reasons, we don't give a strict cap on the + * counter. All numbers above 1 will be regarded as 2 in + * CatCacheCleanupOldEntries(). + */ + ct->naccess++; + if (unlikely(ct->naccess == 0)) + ct->naccess = 2; + ct->lastaccess = catcacheclock; + + /* Following part is also the same with if() block above */ + if (!ct->negative) + { + ResourceOwnerEnlargeCatCacheRefs(CurrentResourceOwner); + ct->refcount++; + ResourceOwnerRememberCatCacheRef(CurrentResourceOwner, + &ct->tuple); + + CACHE_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d", + cache->cc_relname, hashIndex); + +#ifdef CATCACHE_STATS + cache->cc_hits++; +#endif + + + return &ct->tuple; + } + else + { + CACHE_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d", + cache->cc_relname, hashIndex); + +#ifdef CATCACHE_STATS + cache->cc_neg_hits++; +#endif + + return NULL; + } + } + } return SearchCatCacheMiss(cache, nkeys, hashValue, hashIndex, v1, v2, v3, v4); } @@ -1902,6 +2086,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 92c4fee8f8..c2a4caa44b 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -82,6 +82,7 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/guc_tables.h" #include "utils/float.h" #include "utils/memutils.h" @@ -2252,6 +2253,17 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("System catalog cache entries that live unused for longer than this seconds are considered forremoval."), + gettext_noop("The value of -1 turns off pruning."), + GUC_UNIT_S + }, + &catalog_cache_prune_min_age, + 300, -1, INT_MAX, + NULL, NULL, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index ff1fabaca1..ad962fb096 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -61,6 +62,7 @@ typedef struct catcache slist_node cc_next; /* list link */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ + TimestampTz cc_oldest_ts; /* timestamp of the oldest tuple in the hash */ /* * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS @@ -119,6 +121,8 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ + unsigned int naccess; /* # of access to this entry */ + TimestampTz lastaccess; /* timestamp of the last usage */ /* * The tuple may also be a member of at most one CatCList. (If a single @@ -189,6 +193,19 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int catalog_cache_prune_min_age; + +/* source clock for access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* SetCatCacheClock - set catcache timestamp source clodk */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, -- 2.16.3 From 772d7fda030e8990fbd84ec44f07120d73682256 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp> Date: Mon, 1 Jul 2019 14:11:08 +0900 Subject: [PATCH 5/5] Adjust indentation of SearchCatCacheInternal The previous patch leaves indentation of an existing code block in SearchCatCacheInternal unchanged for diff size to be small. This adjusts the indentation. --- src/backend/utils/cache/catcache.c | 82 +++++++++++++++++++------------------- 1 file changed, 41 insertions(+), 41 deletions(-) diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index b552ae960c..60d0fd28a8 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -1376,59 +1376,59 @@ SearchCatCacheInternal(CatCache *cache, */ if (likely(catalog_cache_prune_min_age < 0)) { - dlist_foreach(iter, bucket) - { - ct = dlist_container(CatCTup, cache_elem, iter.cur); - - if (ct->dead) - continue; /* ignore dead entries */ - - if (ct->hash_value != hashValue) - continue; /* quickly skip entry if wrong hash val */ - - if (!CatalogCacheCompareTuple(cache, nkeys, ct->keys, arguments)) - continue; - - /* - * We found a match in the cache. Move it to the front of the list - * for its hashbucket, in order to speed subsequent searches. (The - * most frequently accessed elements in any hashbucket will tend to be - * near the front of the hashbucket's list.) - */ - dlist_move_head(bucket, &ct->cache_elem); - - /* - * If it's a positive entry, bump its refcount and return it. If it's - * negative, we can report failure to the caller. - */ - if (!ct->negative) + dlist_foreach(iter, bucket) { - ResourceOwnerEnlargeCatCacheRefs(CurrentResourceOwner); - ct->refcount++; - ResourceOwnerRememberCatCacheRef(CurrentResourceOwner, &ct->tuple); + ct = dlist_container(CatCTup, cache_elem, iter.cur); - CACHE_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d", - cache->cc_relname, hashIndex); + if (ct->dead) + continue; /* ignore dead entries */ + + if (ct->hash_value != hashValue) + continue; /* quickly skip entry if wrong hash val */ + + if (!CatalogCacheCompareTuple(cache, nkeys, ct->keys, arguments)) + continue; + + /* + * We found a match in the cache. Move it to the front of the + * list for its hashbucket, in order to speed subsequent searches. + * (The most frequently accessed elements in any hashbucket will + * tend to be near the front of the hashbucket's list.) + */ + dlist_move_head(bucket, &ct->cache_elem); + + /* + * If it's a positive entry, bump its refcount and return it. If + * it's negative, we can report failure to the caller. + */ + if (!ct->negative) + { + ResourceOwnerEnlargeCatCacheRefs(CurrentResourceOwner); + ct->refcount++; + ResourceOwnerRememberCatCacheRef(CurrentResourceOwner, &ct->tuple); + + CACHE_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d", + cache->cc_relname, hashIndex); #ifdef CATCACHE_STATS - cache->cc_hits++; + cache->cc_hits++; #endif - return &ct->tuple; - } - else - { - CACHE_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d", - cache->cc_relname, hashIndex); + return &ct->tuple; + } + else + { + CACHE_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d", + cache->cc_relname, hashIndex); #ifdef CATCACHE_STATS - cache->cc_neg_hits++; + cache->cc_neg_hits++; #endif - return NULL; + return NULL; + } } } - } else { /* -- 2.16.3
I'd like to throw in food for discussion on how much SearchSysCacheN suffers degradation from some choices on how we can insert a code into the SearchSysCacheN code path. I ran the run2.sh script attached, which runs catcachebench2(), which asks SearchSysCache3() for cached entries (almost) 240000 times per run. The number of each output line is the mean of 3 times runs, and stddev. Lines are in "time" order and edited to fit here. "gen_tbl.pl | psql" creates a database for the benchmark. catcachebench2() runs the shortest path in the three in the attached benchmark program. (pg_ctl start) $ perl gen_tbl.pl | psql ... (pg_ctl stop) 0. Baseline (0001-benchmark.patch, 0002-Base-change.patch) At first, I made two binaries from the literally same source. For the benchmark's sake the source is already modified a bit. Specifically it has SetCatCacheClock needed by the benchmark, but actually not called in this benchmark. time(ms)|stddev(ms) not patched | 7750.42 | 23.83 # 0.6% faster than 7775.23 not patched | 7864.73 | 43.21 not patched | 7866.80 | 106.47 not patched | 7952.06 | 63.14 master | 7775.23 | 35.76 master | 7870.42 | 120.31 master | 7876.76 | 109.04 master | 7963.04 | 9.49 So, it seems to me that we cannot tell something about differences below about 80ms (about 1%) now. 1. Inserting a branch in SearchCatCacheInternal. (CatCache_Pattern_1.patch) This is the most straightforward way to add an alternative feature. pattern 1 | 8459.73 | 28.15 # 9% (>> 1%) slower than 7757.58 pattern 1 | 8504.83 | 55.61 pattern 1 | 8541.81 | 41.56 pattern 1 | 8552.20 | 27.99 master | 7757.58 | 22.65 master | 7801.32 | 20.64 master | 7839.57 | 25.28 master | 7925.30 | 38.84 It's so slow that it cannot be used. 2. Making SearchCatCacheInternal be an indirect function. (CatCache_Pattern_2.patch) Next, I made the work horse routine be called indirectly. The "inline" for the function acutally let compiler optimize SearchCatCacheN routines as described in comment but the effect doesn't seem so large at least for this case. pattern 2 | 7976.22 | 46.12 (2.6% slower > 1%) pattern 2 | 8103.03 | 51.57 pattern 2 | 8144.97 | 68.46 pattern 2 | 8353.10 | 34.89 master | 7768.40 | 56.00 master | 7772.02 | 29.05 master | 7775.05 | 27.69 master | 7830.82 | 13.78 3. Making SearchCatCacheN be indirect functions. (CatCache_Pattern_3.patch) As far as gcc/linux/x86 goes, SearchSysCacheN is comiled into the following instructions: 0x0000000000866c20 <+0>: movslq %edi,%rdi 0x0000000000866c23 <+3>: mov 0xd3da40(,%rdi,8),%rdi 0x0000000000866c2b <+11>: jmpq 0x856ee0 <SearchCatCache3> If we made SearchCatCacheN be indirect functions as the patch, it changes just one instruction as: 0x0000000000866c50 <+0>: movslq %edi,%rdi 0x0000000000866c53 <+3>: mov 0xd3da60(,%rdi,8),%rdi 0x0000000000866c5b <+11>: jmpq *0x4c0caf(%rip) # 0xd27910 <SearchCatCache3> pattern 3 | 7836.26 | 48.66 (2% slower > 1%) pattern 3 | 7963.74 | 67.88 pattern 3 | 7966.65 | 101.07 pattern 3 | 8214.57 | 71.93 master | 7679.74 | 62.20 master | 7756.14 | 77.19 master | 7867.14 | 73.33 master | 7893.97 | 47.67 I expected this runs in almost the same time. I'm not sure if it is the result of spectre_v2 mitigation, but I show status of my environment as follows. # uname -r 4.18.0-80.11.2.el8_0.x86_64 # cat /proc/cpuinfo ... model name : Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz stepping : 12 microcode : 0xae bugs : spectre_v1 spectre_v2 spec_store_bypass mds # cat /sys/devices/system/cpu/vulnerabilities/spectre_v2 Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, STIBP: disabled, RSB filling I am using CentOS8 and I don't find a handy (or on-the-fly) way to disable them.. Attached are: 0001-benchmark.patch : catcache benchmark extension (and core side fix) 0002-Base-change.patch : baseline change in this series of benchmark CatCache_Pattern_1.patch: naive branching CatCache_Pattern_2.patch: indirect SearchCatCacheInternal CatCache_Pattern_1.patch: indirect SearchCatCacheN regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 245e88e1b43df74273fbaa1b22f4f64621ffe9d5 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Thu, 14 Nov 2019 19:24:36 +0900 Subject: [PATCH 1/2] benchmark --- contrib/catcachebench/Makefile | 17 + contrib/catcachebench/catcachebench--0.0.sql | 14 + contrib/catcachebench/catcachebench.c | 330 +++++++++++++++++++ contrib/catcachebench/catcachebench.control | 6 + src/backend/utils/cache/catcache.c | 33 ++ src/backend/utils/cache/syscache.c | 2 +- 6 files changed, 401 insertions(+), 1 deletion(-) create mode 100644 contrib/catcachebench/Makefile create mode 100644 contrib/catcachebench/catcachebench--0.0.sql create mode 100644 contrib/catcachebench/catcachebench.c create mode 100644 contrib/catcachebench/catcachebench.control diff --git a/contrib/catcachebench/Makefile b/contrib/catcachebench/Makefile new file mode 100644 index 0000000000..0478818b25 --- /dev/null +++ b/contrib/catcachebench/Makefile @@ -0,0 +1,17 @@ +MODULE_big = catcachebench +OBJS = catcachebench.o + +EXTENSION = catcachebench +DATA = catcachebench--0.0.sql +PGFILEDESC = "catcachebench - benchmark for catcache pruning feature" + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/catcachebench +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/catcachebench/catcachebench--0.0.sql b/contrib/catcachebench/catcachebench--0.0.sql new file mode 100644 index 0000000000..ea9cd62abb --- /dev/null +++ b/contrib/catcachebench/catcachebench--0.0.sql @@ -0,0 +1,14 @@ +/* contrib/catcachebench/catcachebench--0.0.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION catcachebench" to load this file. \quit + +CREATE FUNCTION catcachebench(IN type int) +RETURNS double precision +AS 'MODULE_PATHNAME', 'catcachebench' +LANGUAGE C STRICT VOLATILE; + +CREATE FUNCTION catcachereadstats(OUT catid int, OUT reloid oid, OUT searches bigint, OUT hits bigint, OUT neg_hits bigint) +RETURNS SETOF record +AS 'MODULE_PATHNAME', 'catcachereadstats' +LANGUAGE C STRICT VOLATILE; diff --git a/contrib/catcachebench/catcachebench.c b/contrib/catcachebench/catcachebench.c new file mode 100644 index 0000000000..b5a4d794ed --- /dev/null +++ b/contrib/catcachebench/catcachebench.c @@ -0,0 +1,330 @@ +/* + * catcachebench: test code for cache pruning feature + */ +/* #define CATCACHE_STATS */ +#include "postgres.h" +#include "catalog/pg_type.h" +#include "catalog/pg_statistic.h" +#include "executor/spi.h" +#include "funcapi.h" +#include "libpq/pqsignal.h" +#include "utils/catcache.h" +#include "utils/syscache.h" +#include "utils/timestamp.h" + +Oid tableoids[10000]; +int ntables = 0; +int16 attnums[1000]; +int natts = 0; + +PG_MODULE_MAGIC; + +double catcachebench1(void); +double catcachebench2(void); +double catcachebench3(void); +void collectinfo(void); +void catcachewarmup(void); + +PG_FUNCTION_INFO_V1(catcachebench); +PG_FUNCTION_INFO_V1(catcachereadstats); + +extern void CatalogCacheFlushCatalog2(Oid catId); +extern int64 catcache_called; +extern CatCache *SysCache[]; + +typedef struct catcachestatsstate +{ + TupleDesc tupd; + int catId; +} catcachestatsstate; + +Datum +catcachereadstats(PG_FUNCTION_ARGS) +{ + catcachestatsstate *state_data = NULL; + FuncCallContext *fctx; + + if (SRF_IS_FIRSTCALL()) + { + TupleDesc tupdesc; + MemoryContext mctx; + + fctx = SRF_FIRSTCALL_INIT(); + mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx); + + state_data = palloc(sizeof(catcachestatsstate)); + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + state_data->tupd = tupdesc; + state_data->catId = 0; + + fctx->user_fctx = state_data; + + MemoryContextSwitchTo(mctx); + } + + fctx = SRF_PERCALL_SETUP(); + state_data = fctx->user_fctx; + + if (state_data->catId < SysCacheSize) + { + Datum values[5]; + bool nulls[5]; + HeapTuple resulttup; + Datum result; + int catId = state_data->catId++; + + memset(nulls, 0, sizeof(nulls)); + memset(values, 0, sizeof(values)); + values[0] = Int16GetDatum(catId); + values[1] = ObjectIdGetDatum(SysCache[catId]->cc_reloid); +#ifdef CATCACHE_STATS + values[2] = Int64GetDatum(SysCache[catId]->cc_searches); + values[3] = Int64GetDatum(SysCache[catId]->cc_hits); + values[4] = Int64GetDatum(SysCache[catId]->cc_neg_hits); +#endif + resulttup = heap_form_tuple(state_data->tupd, values, nulls); + result = HeapTupleGetDatum(resulttup); + + SRF_RETURN_NEXT(fctx, result); + } + + SRF_RETURN_DONE(fctx); +} + +Datum +catcachebench(PG_FUNCTION_ARGS) +{ + int testtype = PG_GETARG_INT32(0); + double ms; + + collectinfo(); + + /* flush the catalog -- safe? don't mind. */ + CatalogCacheFlushCatalog2(StatisticRelationId); + + switch (testtype) + { + case 0: + catcachewarmup(); /* prewarm of syscatalog */ + PG_RETURN_NULL(); + case 1: + ms = catcachebench1(); break; + case 2: + ms = catcachebench2(); break; + case 3: + ms = catcachebench3(); break; + default: + elog(ERROR, "Invalid test type: %d", testtype); + } + + PG_RETURN_DATUM(Float8GetDatum(ms)); +} + +/* + * fetch all attribute entires of all tables. + */ +double +catcachebench1(void) +{ + int t, a; + instr_time start, + duration; + + PG_SETMASK(&BlockSig); + INSTR_TIME_SET_CURRENT(start); + for (t = 0 ; t < ntables ; t++) + { + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[t]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + PG_SETMASK(&UnBlockSig); + + return INSTR_TIME_GET_MILLISEC(duration); +}; + +/* + * fetch all attribute entires of a table 6000 times. + */ +double +catcachebench2(void) +{ + int t, a; + instr_time start, + duration; + + PG_SETMASK(&BlockSig); + INSTR_TIME_SET_CURRENT(start); + for (t = 0 ; t < 240000 ; t++) + { + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[0]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + PG_SETMASK(&UnBlockSig); + + return INSTR_TIME_GET_MILLISEC(duration); +}; + +/* + * fetch all attribute entires of all tables twice with having expiration + * happen. + */ +double +catcachebench3(void) +{ + const int clock_step = 1000; + int i, t, a; + instr_time start, + duration; + + PG_SETMASK(&BlockSig); + INSTR_TIME_SET_CURRENT(start); + for (i = 0 ; i < 4 ; i++) + { + int ct = clock_step; + + for (t = 0 ; t < ntables ; t++) + { + /* + * catcacheclock is updated by transaction timestamp, so needs to + * be updated by other means for this test to work. Here I choosed + * to update the clock every 1000 tables scan. + */ + if (--ct < 0) + { + SetCatCacheClock(GetCurrentTimestamp()); + ct = clock_step; + } + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[t]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } + } + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + PG_SETMASK(&UnBlockSig); + + return INSTR_TIME_GET_MILLISEC(duration); +}; + +void +catcachewarmup(void) +{ + int t, a; + + /* load up catalog tables */ + for (t = 0 ; t < ntables ; t++) + { + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[t]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } +} + +void +collectinfo(void) +{ + int ret; + Datum values[10000]; + bool nulls[10000]; + Oid types0[] = {OIDOID}; + int i; + + ntables = 0; + natts = 0; + + SPI_connect(); + /* collect target tables */ + ret = SPI_execute("select oid from pg_class where relnamespace = (select oid from pg_namespace where nspname = \'test\')", + true, 0); + if (ret != SPI_OK_SELECT) + elog(ERROR, "Failed 1"); + if (SPI_processed == 0) + elog(ERROR, "no relation found in schema \"test\""); + if (SPI_processed > 10000) + elog(ERROR, "too many relation found in schema \"test\""); + + for (i = 0 ; i < SPI_processed ; i++) + { + heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc, + values, nulls); + if (nulls[0]) + elog(ERROR, "Failed 2"); + + tableoids[ntables++] = DatumGetObjectId(values[0]); + } + SPI_finish(); + elog(DEBUG1, "%d tables found", ntables); + + values[0] = ObjectIdGetDatum(tableoids[0]); + nulls[0] = false; + SPI_connect(); + ret = SPI_execute_with_args("select attnum from pg_attribute where attrelid = (select oid from pg_class where oid =$1)", + 1, types0, values, NULL, true, 0); + if (SPI_processed == 0) + elog(ERROR, "no attribute found in table %d", tableoids[0]); + if (SPI_processed > 10000) + elog(ERROR, "too many relation found in table %d", tableoids[0]); + + /* collect target attributes. assuming all tables have the same attnums */ + for (i = 0 ; i < SPI_processed ; i++) + { + int16 attnum; + + heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc, + values, nulls); + if (nulls[0]) + elog(ERROR, "Failed 3"); + attnum = DatumGetInt16(values[0]); + + if (attnum > 0) + attnums[natts++] = attnum; + } + SPI_finish(); + elog(DEBUG1, "%d attributes found", natts); +} diff --git a/contrib/catcachebench/catcachebench.control b/contrib/catcachebench/catcachebench.control new file mode 100644 index 0000000000..3fc9d2e420 --- /dev/null +++ b/contrib/catcachebench/catcachebench.control @@ -0,0 +1,6 @@ +# catcachebench + +comment = 'benchmark for catcache pruning' +default_version = '0.0' +module_pathname = '$libdir/catcachebench' +relocatable = true diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index c3e7d94aa5..2dd8455052 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -740,6 +740,39 @@ CatalogCacheFlushCatalog(Oid catId) CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call"); } + +/* FUNCTION FOR BENCHMARKING */ +void +CatalogCacheFlushCatalog2(Oid catId) +{ + slist_iter iter; + + CACHE_elog(DEBUG2, "CatalogCacheFlushCatalog called for %u", catId); + + slist_foreach(iter, &CacheHdr->ch_caches) + { + CatCache *cache = slist_container(CatCache, cc_next, iter.cur); + + /* Does this cache store tuples of the target catalog? */ + if (cache->cc_reloid == catId) + { + /* Yes, so flush all its contents */ + ResetCatalogCache(cache); + + /* Tell inval.c to call syscache callbacks for this cache */ + CallSyscacheCallbacks(cache->id, 0); + + cache->cc_nbuckets = 128; + pfree(cache->cc_bucket); + cache->cc_bucket = palloc0(128 * sizeof(dlist_head)); + elog(LOG, "Catcache reset"); + } + } + + CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call"); +} +/* END: FUNCTION FOR BENCHMARKING */ + /* * InitCatCache * diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c index d69c0ff813..2e282a10b4 100644 --- a/src/backend/utils/cache/syscache.c +++ b/src/backend/utils/cache/syscache.c @@ -983,7 +983,7 @@ static const struct cachedesc cacheinfo[] = { } }; -static CatCache *SysCache[SysCacheSize]; +CatCache *SysCache[SysCacheSize]; static bool CacheInitialized = false; -- 2.23.0 From eebffb678b2450fbf51395de8c52f4b53a9286d1 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Thu, 14 Nov 2019 20:28:29 +0900 Subject: [PATCH 2/2] Base change. --- src/backend/utils/cache/catcache.c | 19 ++++++++++++++++++- src/backend/utils/misc/guc.c | 13 +++++++++++++ src/include/utils/catcache.h | 17 +++++++++++++++++ 3 files changed, 48 insertions(+), 1 deletion(-) diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 2dd8455052..2dbc2151b1 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -60,9 +60,18 @@ #define CACHE_elog(...) #endif +/* + * GUC variable to define the minimum age of entries that will be considered + * to be evicted in seconds. -1 to disable the feature. + */ +int catalog_cache_prune_min_age = 300; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Clock for the last accessed time of a catcache entry. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -99,6 +108,12 @@ static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos, static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *srckeys, Datum *dstkeys); +/* GUC assign function */ +void +assign_catalog_cache_prune_min_age(int newval, void *extra) +{ + catalog_cache_prune_min_age = newval; +} /* * internal support functions @@ -765,7 +780,9 @@ CatalogCacheFlushCatalog2(Oid catId) cache->cc_nbuckets = 128; pfree(cache->cc_bucket); cache->cc_bucket = palloc0(128 * sizeof(dlist_head)); - elog(LOG, "Catcache reset"); + ereport(DEBUG1, + (errmsg("Catcache reset"), + errhidestmt(true))); } } diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 4b3769b8b0..39a18a8c7a 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -82,6 +82,8 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" +#include "utils/guc_tables.h" #include "utils/float.h" #include "utils/guc_tables.h" #include "utils/memutils.h" @@ -2257,6 +2259,17 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("System catalog cache entries that live unused for longer than this seconds are considered forremoval."), + gettext_noop("The value of -1 turns off pruning."), + GUC_UNIT_S + }, + &catalog_cache_prune_min_age, + 300, -1, INT_MAX, + NULL, assign_catalog_cache_prune_min_age, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index ff1fabaca1..8105f19bc4 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -189,6 +190,22 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int catalog_cache_prune_min_age; + +/* source clock for access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* SetCatCacheClock - set catcache timestamp source clodk */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + + +extern void assign_catalog_cache_prune_min_age(int newval, void *extra); + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, -- 2.23.0 diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 2dbc2151b1..81ccc0b472 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -1273,6 +1273,12 @@ SearchCatCacheInternal(CatCache *cache, #ifdef CATCACHE_STATS cache->cc_searches++; #endif + /* cannot be true, but compiler doesn't know */ + if (catalog_cache_prune_min_age < -1) + { + return SearchCatCache(cache, v1, v2, v3, v4); /* Never executed */ + } + /* Initialize local parameter array */ arguments[0] = v1; diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 2dbc2151b1..48a8a14c7f 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -72,11 +72,16 @@ static CatCacheHeader *CacheHdr = NULL; /* Clock for the last accessed time of a catcache entry. */ TimestampTz catcacheclock = 0; -static inline HeapTuple SearchCatCacheInternal(CatCache *cache, +static HeapTuple SearchCatCacheInternalb(CatCache *cache, int nkeys, Datum v1, Datum v2, Datum v3, Datum v4); +static HeapTuple (*SearchCatCacheInternal)(CatCache *cache, + int nkeys, + Datum v1, Datum v2, + Datum v3, Datum v4) = + SearchCatCacheInternalb; static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache, int nkeys, uint32 hashValue, @@ -1245,7 +1250,7 @@ SearchCatCache4(CatCache *cache, * Work-horse for SearchCatCache/SearchCatCacheN. */ static inline HeapTuple -SearchCatCacheInternal(CatCache *cache, +SearchCatCacheInternalb(CatCache *cache, int nkeys, Datum v1, Datum v2, diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 2dbc2151b1..e4ebd07397 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -84,6 +84,26 @@ static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache, Datum v1, Datum v2, Datum v3, Datum v4); +static HeapTuple SearchCatCacheb(CatCache *cache, + Datum v1, Datum v2, Datum v3, Datum v4); +HeapTuple (*SearchCatCache)(CatCache *cache, + Datum v1, Datum v2, Datum v3, Datum v4) = + SearchCatCacheb; +static HeapTuple SearchCatCache1b(CatCache *cache, Datum v1); +HeapTuple (*SearchCatCache1)(CatCache *cache, Datum v1) = SearchCatCache1b; +static HeapTuple SearchCatCache2b(CatCache *cache, Datum v1, Datum v2); +HeapTuple (*SearchCatCache2)(CatCache *cache, Datum v1, Datum v2) = + SearchCatCache2b; +static HeapTuple SearchCatCache3b(CatCache *cache, + Datum v1, Datum v2, Datum v3); +HeapTuple (*SearchCatCache3)(CatCache *cache, Datum v1, Datum v2, Datum v3) = + SearchCatCache3b; +static HeapTuple SearchCatCache4b(CatCache *cache, + Datum v1, Datum v2, Datum v3, Datum v4); +HeapTuple (*SearchCatCache4)(CatCache *cache, + Datum v1, Datum v2, Datum v3, Datum v4) = + SearchCatCache4b; + static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys, Datum v1, Datum v2, Datum v3, Datum v4); static uint32 CatalogCacheComputeTupleHashValue(CatCache *cache, int nkeys, @@ -1193,8 +1213,8 @@ IndexScanOK(CatCache *cache, ScanKey cur_skey) * the caller need not go to the trouble of converting it to a fully * null-padded NAME. */ -HeapTuple -SearchCatCache(CatCache *cache, +static HeapTuple +SearchCatCacheb(CatCache *cache, Datum v1, Datum v2, Datum v3, @@ -1210,32 +1230,32 @@ SearchCatCache(CatCache *cache, * bit faster than SearchCatCache(). */ -HeapTuple -SearchCatCache1(CatCache *cache, +static HeapTuple +SearchCatCache1b(CatCache *cache, Datum v1) { return SearchCatCacheInternal(cache, 1, v1, 0, 0, 0); } -HeapTuple -SearchCatCache2(CatCache *cache, +static HeapTuple +SearchCatCache2b(CatCache *cache, Datum v1, Datum v2) { return SearchCatCacheInternal(cache, 2, v1, v2, 0, 0); } -HeapTuple -SearchCatCache3(CatCache *cache, +static HeapTuple +SearchCatCache3b(CatCache *cache, Datum v1, Datum v2, Datum v3) { return SearchCatCacheInternal(cache, 3, v1, v2, v3, 0); } -HeapTuple -SearchCatCache4(CatCache *cache, +static HeapTuple +SearchCatCache4b(CatCache *cache, Datum v1, Datum v2, Datum v3, Datum v4) { return SearchCatCacheInternal(cache, 4, v1, v2, v3, v4); diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 8105f19bc4..f2e0d29bc8 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -213,15 +213,15 @@ extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, int nbuckets); extern void InitCatCachePhase2(CatCache *cache, bool touch_index); -extern HeapTuple SearchCatCache(CatCache *cache, +extern HeapTuple (*SearchCatCache)(CatCache *cache, Datum v1, Datum v2, Datum v3, Datum v4); -extern HeapTuple SearchCatCache1(CatCache *cache, +extern HeapTuple (*SearchCatCache1)(CatCache *cache, Datum v1); -extern HeapTuple SearchCatCache2(CatCache *cache, +extern HeapTuple (*SearchCatCache2)(CatCache *cache, Datum v1, Datum v2); -extern HeapTuple SearchCatCache3(CatCache *cache, +extern HeapTuple (*SearchCatCache3)(CatCache *cache, Datum v1, Datum v2, Datum v3); -extern HeapTuple SearchCatCache4(CatCache *cache, +extern HeapTuple (*SearchCatCache4)(CatCache *cache, Datum v1, Datum v2, Datum v3, Datum v4); extern void ReleaseCatCache(HeapTuple tuple); #! /usr/bin/perl $collist = ""; foreach $i (0..1000) { $collist .= sprintf(", c%05d int", $i); } $collist = substr($collist, 2); printf "drop schema if exists test cascade;\n"; printf "create schema test;\n"; foreach $i (0..2999) { printf "create table test.t%04d ($collist);\n", $i; } #!/bin/bash LOOPS=3 USES=1 BINROOT=/home/horiguti/bin DATADIR=/home/horiguti/data/data_catexp PREC="numeric(10,2)" /usr/bin/killall postgres /usr/bin/sleep 3 run() { local BINARY=$1 local PGCTL=$2/bin/pg_ctl local PGSQL=$2/bin/postgres local PSQL=$2/bin/psql if [ "$3" != "" ]; then local SETTING1="set catalog_cache_prune_min_age to \"$3\";" local SETTING2="set catalog_cache_prune_min_age to \"$4\";" local SETTING3="set catalog_cache_prune_min_age to \"$5\";" fi # ($PGSQL -D $DATADIR 2>&1 > /dev/null)& ($PGSQL -D $DATADIR 2>&1 > /dev/null | /usr/bin/sed -e 's/^/# /')& /usr/bin/sleep 3 ${PSQL} postgres <<EOF create extension if not exists catcachebench; select catcachebench(0); $SETTING3 select * from generate_series(2, 2) test, LATERAL (select '${BINARY}' as version, '${USES}/' || (count(r) OVER())::text as n, r::${PREC}, (stddev(r) OVER ())::${PREC} from (select catcachebench(test) as r from generate_series(1, ${LOOPS})) r order by r limit ${USES}) r EOF $PGCTL --pgdata=$DATADIR stop 2>&1 > /dev/null | /usr/bin/sed -e 's/^/# /' # oreport > $BINARY_perf.txt } for i in $(seq 0 3); do run "E_off" $BINROOT/pgsql_catexpe "-1" "-1" "-1" #run "E_on" $BINROOT/pgsql_catexpe "300s" "1s" "0" run "master" $BINROOT/pgsql_master_o2 "" "" "" done
On Tue, Nov 19, 2019 at 07:48:10PM +0900, Kyotaro Horiguchi wrote: > I'd like to throw in food for discussion on how much SearchSysCacheN > suffers degradation from some choices on how we can insert a code into > the SearchSysCacheN code path. Please note that the patch has a warning, causing cfbot-san to complain: catcache.c:786:1: error: no previous prototype for ‘CatalogCacheFlushCatalog2’ [-Werror=missing-prototypes] CatalogCacheFlushCatalog2(Oid catId) ^ cc1: all warnings being treated as errors So this should at least be fixed. For now I have moved it to next CF, waiting on author. -- Michael
Attachment
This is a new complete workable patch after a long time of struggling with benchmarking. At Tue, 19 Nov 2019 19:48:10 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > I ran the run2.sh script attached, which runs catcachebench2(), which > asks SearchSysCache3() for cached entries (almost) 240000 times per > run. The number of each output line is the mean of 3 times runs, and > stddev. Lines are in "time" order and edited to fit here. "gen_tbl.pl > | psql" creates a database for the benchmark. catcachebench2() runs > the shortest path in the three in the attached benchmark program. > > (pg_ctl start) > $ perl gen_tbl.pl | psql ... > (pg_ctl stop) I wonder why I took the average of the time instead of choose the fastest one. This benchmark is extremely CPU intensive so the fastest run reliably represents the perfromance. I changed the benchmark so that it shows the time of the fastest run (run4.sh). Based on the latest result, I used the pattern 3 (SearchSyscacheN indirection, but wrongly pointed as 1 in the last mail) in the latest version, I took the number of the fastest time among 3 iteration of 5 runs of both master/patched O2 binaries. version | min ---------+--------- master | 7986.65 patched | 7984.47 = 'indirect' below I would say this version doesn't get degradaded by indirect calls. So, I applied the other part of the catcache expiration patch as the succeeding parts. After that I got somewhat strange but very stable result. Just adding struct members acceleartes the benchmark. The numbers are the fastest time of 20 runs of the bencmark in 10 times iterations. ms master 7980.79 # the master with the benchmark extension (0001) ===== base 7340.96 # add only struct members and a GUC variable. (0002) indirect 7998.68 # call SearchCatCacheN indirectly (0003) ===== expire-off 7422.30 # CatCache expiration (0004) # (catalog_cache_prune_min_age = -1) expire-on 7861.13 # CatCache expiration (catalog_cache_prune_min_age = 0) The patch accelerates CatCaCheSearch for uncertain reasons. I'm not sure what makes the difference between about 8000ms and about 7400 ms, though. Several times building of all versions then running the benchmark gave me the results with the same tendency. I once stop this work at this point and continue later. The following files are attached. 0001-catcache-benchmark-extension.patch: benchmnark extension used by the benchmarking here. The test tables are generated using gentbl2.pl attached. (perl gentbk2.pl | psql) 0002-base_change.patch: Preliminary adds some struct members and a GUC variable to see if they cause any extent of degradation. 0003-Make-CatCacheSearchN-indirect-functions.patch: Rewrite to change CatCacheSearchN functions to be called indirectly. 0004-CatCache-expiration-feature.patch: Add CatCache expiration feature. gentbl2.pl: A script that emits SQL statements to generate test tables. run4.sh : The test script I used for benchmarkiing here. build2.sh : A script I used to build the four types of binaries used here. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From dacf4a2ac9eb49099e744ee24066b94e9f78aa61 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Thu, 14 Nov 2019 19:24:36 +0900 Subject: [PATCH 1/4] catcache benchmark extension Provides the function catcachebench(bench_no int), which runs CPU intensive benchmark on catcache search. The test table is created by a script separately provided. catcachebench(0): prewarm catcache with provided test tables. catcachebench(1): fetches all attribute stats of all tables. This benchmark loads a vast number of unique entries. Expriration doesn't work since it runs in a transaction. catcachebench(2): fetches all attribute stats of a tables many times. This benchmark repeatedly accesses already loaded entries. Expriration doesn't work since it runs in a transaction. catcachebench(3): fetches all attribute stats of all tables four times. Different from other modes, this runs expiration by forcibly updates reference clock variable every 1000 entries. At this point, variables needed for the expiration feature is not added so SetCatCacheClock is a dummy macro that just replaces it with its parameter. --- contrib/catcachebench/Makefile | 17 + contrib/catcachebench/catcachebench--0.0.sql | 14 + contrib/catcachebench/catcachebench.c | 330 +++++++++++++++++++ contrib/catcachebench/catcachebench.control | 6 + src/backend/utils/cache/catcache.c | 35 ++ src/backend/utils/cache/syscache.c | 2 +- src/include/utils/catcache.h | 3 + 7 files changed, 406 insertions(+), 1 deletion(-) create mode 100644 contrib/catcachebench/Makefile create mode 100644 contrib/catcachebench/catcachebench--0.0.sql create mode 100644 contrib/catcachebench/catcachebench.c create mode 100644 contrib/catcachebench/catcachebench.control diff --git a/contrib/catcachebench/Makefile b/contrib/catcachebench/Makefile new file mode 100644 index 0000000000..0478818b25 --- /dev/null +++ b/contrib/catcachebench/Makefile @@ -0,0 +1,17 @@ +MODULE_big = catcachebench +OBJS = catcachebench.o + +EXTENSION = catcachebench +DATA = catcachebench--0.0.sql +PGFILEDESC = "catcachebench - benchmark for catcache pruning feature" + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/catcachebench +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/catcachebench/catcachebench--0.0.sql b/contrib/catcachebench/catcachebench--0.0.sql new file mode 100644 index 0000000000..ea9cd62abb --- /dev/null +++ b/contrib/catcachebench/catcachebench--0.0.sql @@ -0,0 +1,14 @@ +/* contrib/catcachebench/catcachebench--0.0.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION catcachebench" to load this file. \quit + +CREATE FUNCTION catcachebench(IN type int) +RETURNS double precision +AS 'MODULE_PATHNAME', 'catcachebench' +LANGUAGE C STRICT VOLATILE; + +CREATE FUNCTION catcachereadstats(OUT catid int, OUT reloid oid, OUT searches bigint, OUT hits bigint, OUT neg_hits bigint) +RETURNS SETOF record +AS 'MODULE_PATHNAME', 'catcachereadstats' +LANGUAGE C STRICT VOLATILE; diff --git a/contrib/catcachebench/catcachebench.c b/contrib/catcachebench/catcachebench.c new file mode 100644 index 0000000000..b6c2b8f577 --- /dev/null +++ b/contrib/catcachebench/catcachebench.c @@ -0,0 +1,330 @@ +/* + * catcachebench: test code for cache pruning feature + */ +/* #define CATCACHE_STATS */ +#include "postgres.h" +#include "catalog/pg_type.h" +#include "catalog/pg_statistic.h" +#include "executor/spi.h" +#include "funcapi.h" +#include "libpq/pqsignal.h" +#include "utils/catcache.h" +#include "utils/syscache.h" +#include "utils/timestamp.h" + +Oid tableoids[10000]; +int ntables = 0; +int16 attnums[1000]; +int natts = 0; + +PG_MODULE_MAGIC; + +double catcachebench1(void); +double catcachebench2(void); +double catcachebench3(void); +void collectinfo(void); +void catcachewarmup(void); + +PG_FUNCTION_INFO_V1(catcachebench); +PG_FUNCTION_INFO_V1(catcachereadstats); + +extern void CatalogCacheFlushCatalog2(Oid catId); +extern int64 catcache_called; +extern CatCache *SysCache[]; + +typedef struct catcachestatsstate +{ + TupleDesc tupd; + int catId; +} catcachestatsstate; + +Datum +catcachereadstats(PG_FUNCTION_ARGS) +{ + catcachestatsstate *state_data = NULL; + FuncCallContext *fctx; + + if (SRF_IS_FIRSTCALL()) + { + TupleDesc tupdesc; + MemoryContext mctx; + + fctx = SRF_FIRSTCALL_INIT(); + mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx); + + state_data = palloc(sizeof(catcachestatsstate)); + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + state_data->tupd = tupdesc; + state_data->catId = 0; + + fctx->user_fctx = state_data; + + MemoryContextSwitchTo(mctx); + } + + fctx = SRF_PERCALL_SETUP(); + state_data = fctx->user_fctx; + + if (state_data->catId < SysCacheSize) + { + Datum values[5]; + bool nulls[5]; + HeapTuple resulttup; + Datum result; + int catId = state_data->catId++; + + memset(nulls, 0, sizeof(nulls)); + memset(values, 0, sizeof(values)); + values[0] = Int16GetDatum(catId); + values[1] = ObjectIdGetDatum(SysCache[catId]->cc_reloid); +#ifdef CATCACHE_STATS + values[2] = Int64GetDatum(SysCache[catId]->cc_searches); + values[3] = Int64GetDatum(SysCache[catId]->cc_hits); + values[4] = Int64GetDatum(SysCache[catId]->cc_neg_hits); +#endif + resulttup = heap_form_tuple(state_data->tupd, values, nulls); + result = HeapTupleGetDatum(resulttup); + + SRF_RETURN_NEXT(fctx, result); + } + + SRF_RETURN_DONE(fctx); +} + +Datum +catcachebench(PG_FUNCTION_ARGS) +{ + int testtype = PG_GETARG_INT32(0); + double ms; + + collectinfo(); + + /* flush the catalog -- safe? don't mind. */ + CatalogCacheFlushCatalog2(StatisticRelationId); + + switch (testtype) + { + case 0: + catcachewarmup(); /* prewarm of syscatalog */ + PG_RETURN_NULL(); + case 1: + ms = catcachebench1(); break; + case 2: + ms = catcachebench2(); break; + case 3: + ms = catcachebench3(); break; + default: + elog(ERROR, "Invalid test type: %d", testtype); + } + + PG_RETURN_DATUM(Float8GetDatum(ms)); +} + +/* + * fetch all attribute entires of all tables. + */ +double +catcachebench1(void) +{ + int t, a; + instr_time start, + duration; + + PG_SETMASK(&BlockSig); + INSTR_TIME_SET_CURRENT(start); + for (t = 0 ; t < ntables ; t++) + { + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[t]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + PG_SETMASK(&UnBlockSig); + + return INSTR_TIME_GET_MILLISEC(duration); +}; + +/* + * fetch all attribute entires of a table many times. + */ +double +catcachebench2(void) +{ + int t, a; + instr_time start, + duration; + + PG_SETMASK(&BlockSig); + INSTR_TIME_SET_CURRENT(start); + for (t = 0 ; t < 240000 ; t++) + { + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[0]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + PG_SETMASK(&UnBlockSig); + + return INSTR_TIME_GET_MILLISEC(duration); +}; + +/* + * fetch all attribute entires of all tables several times with having + * expiration happen. + */ +double +catcachebench3(void) +{ + const int clock_step = 1000; + int i, t, a; + instr_time start, + duration; + + PG_SETMASK(&BlockSig); + INSTR_TIME_SET_CURRENT(start); + for (i = 0 ; i < 4 ; i++) + { + int ct = clock_step; + + for (t = 0 ; t < ntables ; t++) + { + /* + * catcacheclock is updated by transaction timestamp, so needs to + * be updated by other means for this test to work. Here I choosed + * to update the clock every 1000 tables scan. + */ + if (--ct < 0) + { + SetCatCacheClock(GetCurrentTimestamp()); + ct = clock_step; + } + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[t]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } + } + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + PG_SETMASK(&UnBlockSig); + + return INSTR_TIME_GET_MILLISEC(duration); +}; + +void +catcachewarmup(void) +{ + int t, a; + + /* load up catalog tables */ + for (t = 0 ; t < ntables ; t++) + { + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[t]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } +} + +void +collectinfo(void) +{ + int ret; + Datum values[10000]; + bool nulls[10000]; + Oid types0[] = {OIDOID}; + int i; + + ntables = 0; + natts = 0; + + SPI_connect(); + /* collect target tables */ + ret = SPI_execute("select oid from pg_class where relnamespace = (select oid from pg_namespace where nspname = \'test\')", + true, 0); + if (ret != SPI_OK_SELECT) + elog(ERROR, "Failed 1"); + if (SPI_processed == 0) + elog(ERROR, "no relation found in schema \"test\""); + if (SPI_processed > 10000) + elog(ERROR, "too many relation found in schema \"test\""); + + for (i = 0 ; i < SPI_processed ; i++) + { + heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc, + values, nulls); + if (nulls[0]) + elog(ERROR, "Failed 2"); + + tableoids[ntables++] = DatumGetObjectId(values[0]); + } + SPI_finish(); + elog(DEBUG1, "%d tables found", ntables); + + values[0] = ObjectIdGetDatum(tableoids[0]); + nulls[0] = false; + SPI_connect(); + ret = SPI_execute_with_args("select attnum from pg_attribute where attrelid = (select oid from pg_class where oid =$1)", + 1, types0, values, NULL, true, 0); + if (SPI_processed == 0) + elog(ERROR, "no attribute found in table %d", tableoids[0]); + if (SPI_processed > 10000) + elog(ERROR, "too many relation found in table %d", tableoids[0]); + + /* collect target attributes. assuming all tables have the same attnums */ + for (i = 0 ; i < SPI_processed ; i++) + { + int16 attnum; + + heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc, + values, nulls); + if (nulls[0]) + elog(ERROR, "Failed 3"); + attnum = DatumGetInt16(values[0]); + + if (attnum > 0) + attnums[natts++] = attnum; + } + SPI_finish(); + elog(DEBUG1, "%d attributes found", natts); +} diff --git a/contrib/catcachebench/catcachebench.control b/contrib/catcachebench/catcachebench.control new file mode 100644 index 0000000000..3fc9d2e420 --- /dev/null +++ b/contrib/catcachebench/catcachebench.control @@ -0,0 +1,6 @@ +# catcachebench + +comment = 'benchmark for catcache pruning' +default_version = '0.0' +module_pathname = '$libdir/catcachebench' +relocatable = true diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 64776e3209..95a4e30d2b 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -740,6 +740,41 @@ CatalogCacheFlushCatalog(Oid catId) CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call"); } + +/* FUNCTION FOR BENCHMARKING */ +void +CatalogCacheFlushCatalog2(Oid catId) +{ + slist_iter iter; + + CACHE_elog(DEBUG2, "CatalogCacheFlushCatalog called for %u", catId); + + slist_foreach(iter, &CacheHdr->ch_caches) + { + CatCache *cache = slist_container(CatCache, cc_next, iter.cur); + + /* Does this cache store tuples of the target catalog? */ + if (cache->cc_reloid == catId) + { + /* Yes, so flush all its contents */ + ResetCatalogCache(cache); + + /* Tell inval.c to call syscache callbacks for this cache */ + CallSyscacheCallbacks(cache->id, 0); + + cache->cc_nbuckets = 128; + pfree(cache->cc_bucket); + cache->cc_bucket = palloc0(128 * sizeof(dlist_head)); + ereport(DEBUG1, + (errmsg("Catcache reset"), + errhidestmt(true))); + } + } + + CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call"); +} +/* END: FUNCTION FOR BENCHMARKING */ + /* * InitCatCache * diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c index 53d9ddf159..1c79a85a8c 100644 --- a/src/backend/utils/cache/syscache.c +++ b/src/backend/utils/cache/syscache.c @@ -983,7 +983,7 @@ static const struct cachedesc cacheinfo[] = { } }; -static CatCache *SysCache[SysCacheSize]; +CatCache *SysCache[SysCacheSize]; static bool CacheInitialized = false; diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index f4aa316604..ea9e75a1ae 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -228,4 +228,7 @@ extern void PrepareToInvalidateCacheTuple(Relation relation, extern void PrintCatCacheLeakWarning(HeapTuple tuple); extern void PrintCatCacheListLeakWarning(CatCList *list); +/* tentative change to allow benchmark on master branch */ +#define SetCatCacheClock(ts) (ts) + #endif /* CATCACHE_H */ -- 2.23.0 From a18c8f531c685682b22d304efa8bfb31401cc3b0 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 10 Jan 2020 15:02:26 +0900 Subject: [PATCH 2/4] base_change Adds struct members needed by catcache expiration feature and a GUC variable that controls the behavior of the feature. But no substantial code is not added yet. This also replaces SetCatCacheClock() with the real definition. If existence of some variables alone can cause degradation, benchmarking after this patch shows that. --- src/backend/utils/cache/catcache.c | 15 +++++++++++++++ src/backend/utils/misc/guc.c | 13 +++++++++++++ src/include/utils/catcache.h | 23 ++++++++++++++++++++--- 3 files changed, 48 insertions(+), 3 deletions(-) diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 95a4e30d2b..d267e5ce6e 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -60,9 +60,18 @@ #define CACHE_elog(...) #endif +/* + * GUC variable to define the minimum age of entries that will be considered + * to be evicted in seconds. -1 to disable the feature. + */ +int catalog_cache_prune_min_age = 300; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Clock for the last accessed time of a catcache entry. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -99,6 +108,12 @@ static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos, static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *srckeys, Datum *dstkeys); +/* GUC assign function */ +void +assign_catalog_cache_prune_min_age(int newval, void *extra) +{ + catalog_cache_prune_min_age = newval; +} /* * internal support functions diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 62285792ec..2f2b599f61 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -83,6 +83,8 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" +#include "utils/guc_tables.h" #include "utils/float.h" #include "utils/guc_tables.h" #include "utils/memutils.h" @@ -2280,6 +2282,17 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("System catalog cache entries that live unused for longer than this seconds are considered forremoval."), + gettext_noop("The value of -1 turns off pruning."), + GUC_UNIT_S + }, + &catalog_cache_prune_min_age, + 300, -1, INT_MAX, + NULL, assign_catalog_cache_prune_min_age, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index ea9e75a1ae..3d3870f05a 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -61,6 +62,7 @@ typedef struct catcache slist_node cc_next; /* list link */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ + TimestampTz cc_oldest_ts; /* timestamp of the oldest tuple in the hash */ /* * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS @@ -119,6 +121,8 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ + unsigned int naccess; /* # of access to this entry */ + TimestampTz lastaccess; /* timestamp of the last usage */ /* * The tuple may also be a member of at most one CatCList. (If a single @@ -189,6 +193,22 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int catalog_cache_prune_min_age; + +/* source clock for access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* SetCatCacheClock - set catcache timestamp source clodk */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + + +extern void assign_catalog_cache_prune_min_age(int newval, void *extra); + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, @@ -228,7 +248,4 @@ extern void PrepareToInvalidateCacheTuple(Relation relation, extern void PrintCatCacheLeakWarning(HeapTuple tuple); extern void PrintCatCacheListLeakWarning(CatCList *list); -/* tentative change to allow benchmark on master branch */ -#define SetCatCacheClock(ts) (ts) - #endif /* CATCACHE_H */ -- 2.23.0 From 5327bfd024ba9e5313cca39db4a1e986c299ca16 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Thu, 9 Jan 2020 19:22:18 +0900 Subject: [PATCH 3/4] Make CatCacheSearchN indirect functions After some expriments showed that the best way to add a new feature to the current CatCacheSearch path is making SearchCatCacheN functions replacable using indirect calling. This patch does that. If the change of how to call the functions alone causes degradataion, benchmarking after this patch applied shows that. --- src/backend/utils/cache/catcache.c | 42 +++++++++++++++++++++++------- src/include/utils/catcache.h | 40 ++++++++++++++++++++++++---- 2 files changed, 67 insertions(+), 15 deletions(-) diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index d267e5ce6e..74c893ba4e 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -84,6 +84,15 @@ static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache, Datum v1, Datum v2, Datum v3, Datum v4); +static HeapTuple SearchCatCacheb(CatCache *cache, + Datum v1, Datum v2, Datum v3, Datum v4); +static HeapTuple SearchCatCache1b(CatCache *cache, Datum v1); +static HeapTuple SearchCatCache2b(CatCache *cache, Datum v1, Datum v2); +static HeapTuple SearchCatCache3b(CatCache *cache, + Datum v1, Datum v2, Datum v3); +static HeapTuple SearchCatCache4b(CatCache *cache, + Datum v1, Datum v2, Datum v3, Datum v4); + static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys, Datum v1, Datum v2, Datum v3, Datum v4); static uint32 CatalogCacheComputeTupleHashValue(CatCache *cache, int nkeys, @@ -108,6 +117,16 @@ static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos, static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *srckeys, Datum *dstkeys); +static SearchCatCacheFuncsType catcache_base = { + SearchCatCacheb, + SearchCatCache1b, + SearchCatCache2b, + SearchCatCache3b, + SearchCatCache4b +}; + +SearchCatCacheFuncsType *SearchCatCacheFuncs = NULL; + /* GUC assign function */ void assign_catalog_cache_prune_min_age(int newval, void *extra) @@ -852,6 +871,9 @@ InitCatCache(int id, CacheHdr = (CatCacheHeader *) palloc(sizeof(CatCacheHeader)); slist_init(&CacheHdr->ch_caches); CacheHdr->ch_ntup = 0; + + SearchCatCacheFuncs = &catcache_base; + #ifdef CATCACHE_STATS /* set up to dump stats at backend exit */ on_proc_exit(CatCachePrintStats, 0); @@ -1193,8 +1215,8 @@ IndexScanOK(CatCache *cache, ScanKey cur_skey) * the caller need not go to the trouble of converting it to a fully * null-padded NAME. */ -HeapTuple -SearchCatCache(CatCache *cache, +static HeapTuple +SearchCatCacheb(CatCache *cache, Datum v1, Datum v2, Datum v3, @@ -1210,32 +1232,32 @@ SearchCatCache(CatCache *cache, * bit faster than SearchCatCache(). */ -HeapTuple -SearchCatCache1(CatCache *cache, +static HeapTuple +SearchCatCache1b(CatCache *cache, Datum v1) { return SearchCatCacheInternal(cache, 1, v1, 0, 0, 0); } -HeapTuple -SearchCatCache2(CatCache *cache, +static HeapTuple +SearchCatCache2b(CatCache *cache, Datum v1, Datum v2) { return SearchCatCacheInternal(cache, 2, v1, v2, 0, 0); } -HeapTuple -SearchCatCache3(CatCache *cache, +static HeapTuple +SearchCatCache3b(CatCache *cache, Datum v1, Datum v2, Datum v3) { return SearchCatCacheInternal(cache, 3, v1, v2, v3, 0); } -HeapTuple -SearchCatCache4(CatCache *cache, +static HeapTuple +SearchCatCache4b(CatCache *cache, Datum v1, Datum v2, Datum v3, Datum v4) { return SearchCatCacheInternal(cache, 4, v1, v2, v3, v4); diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 3d3870f05a..f9e9889339 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -189,6 +189,36 @@ typedef struct catcacheheader int ch_ntup; /* # of tuples in all caches */ } CatCacheHeader; +typedef HeapTuple (*SearchCatCache_fn)(CatCache *cache, + Datum v1, Datum v2, Datum v3, Datum v4); +typedef HeapTuple (*SearchCatCache1_fn)(CatCache *cache, Datum v1); +typedef HeapTuple (*SearchCatCache2_fn)(CatCache *cache, Datum v1, Datum v2); +typedef HeapTuple (*SearchCatCache3_fn)(CatCache *cache, Datum v1, Datum v2, + Datum v3); +typedef HeapTuple (*SearchCatCache4_fn)(CatCache *cache, + Datum v1, Datum v2, Datum v3, Datum v4); + +typedef struct SearchCatCacheFuncsType +{ + SearchCatCache_fn SearchCatCache; + SearchCatCache1_fn SearchCatCache1; + SearchCatCache2_fn SearchCatCache2; + SearchCatCache3_fn SearchCatCache3; + SearchCatCache4_fn SearchCatCache4; +} SearchCatCacheFuncsType; + +extern PGDLLIMPORT SearchCatCacheFuncsType *SearchCatCacheFuncs; + +#define SearchCatCache(cache, v1, v2, v3, v4) \ + SearchCatCacheFuncs->SearchCatCache(cache, v1, v2, v3, v4) +#define SearchCatCache1(cache, v1) \ + SearchCatCacheFuncs->SearchCatCache1(cache, v1) +#define SearchCatCache2(cache, v1, v2) \ + SearchCatCacheFuncs->SearchCatCache2(cache, v1, v2) +#define SearchCatCache3(cache, v1, v2, v3) \ + SearchCatCacheFuncs->SearchCatCache3(cache, v1, v2, v3) +#define SearchCatCache4(cache, v1, v2, v3, v4) \ + SearchCatCacheFuncs->SearchCatCache4(cache, v1, v2, v3, v4) /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; @@ -216,15 +246,15 @@ extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, int nbuckets); extern void InitCatCachePhase2(CatCache *cache, bool touch_index); -extern HeapTuple SearchCatCache(CatCache *cache, +extern HeapTuple (*SearchCatCache)(CatCache *cache, Datum v1, Datum v2, Datum v3, Datum v4); -extern HeapTuple SearchCatCache1(CatCache *cache, +extern HeapTuple (*SearchCatCache1)(CatCache *cache, Datum v1); -extern HeapTuple SearchCatCache2(CatCache *cache, +extern HeapTuple (*SearchCatCache2)(CatCache *cache, Datum v1, Datum v2); -extern HeapTuple SearchCatCache3(CatCache *cache, +extern HeapTuple (*SearchCatCache3)(CatCache *cache, Datum v1, Datum v2, Datum v3); -extern HeapTuple SearchCatCache4(CatCache *cache, +extern HeapTuple (*SearchCatCache4)(CatCache *cache, Datum v1, Datum v2, Datum v3, Datum v4); extern void ReleaseCatCache(HeapTuple tuple); -- 2.23.0 From bef3df3bb2a0c2340eadf267cdfaf8d40612cd0c Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 10 Jan 2020 15:08:54 +0900 Subject: [PATCH 4/4] CatCache expiration feature. This adds the catcache expiration feature to the catcache mechanism. Current catcache doesn't remove a entry and there's a case where many hash entries occupy large amont of memory , being not accessed ever after. This is a quire serious issue on the cases of long-running sessions. The expiration feature saves the case in exchange of some extent of degradation if it is turned on. --- src/backend/utils/cache/catcache.c | 343 +++++++++++++++++++++++++++-- 1 file changed, 326 insertions(+), 17 deletions(-) diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 74c893ba4e..35e1a07e57 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -72,10 +72,11 @@ static CatCacheHeader *CacheHdr = NULL; /* Clock for the last accessed time of a catcache entry. */ TimestampTz catcacheclock = 0; -static inline HeapTuple SearchCatCacheInternal(CatCache *cache, - int nkeys, - Datum v1, Datum v2, - Datum v3, Datum v4); +/* basic catcache search functions */ +static inline HeapTuple SearchCatCacheInternalb(CatCache *cache, + int nkeys, + Datum v1, Datum v2, + Datum v3, Datum v4); static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache, int nkeys, @@ -93,6 +94,23 @@ static HeapTuple SearchCatCache3b(CatCache *cache, static HeapTuple SearchCatCache4b(CatCache *cache, Datum v1, Datum v2, Datum v3, Datum v4); +/* catcache search functions with expiration feature */ +static inline HeapTuple SearchCatCacheInternale(CatCache *cache, + int nkeys, + Datum v1, Datum v2, + Datum v3, Datum v4); + +static HeapTuple SearchCatCachee(CatCache *cache, + Datum v1, Datum v2, Datum v3, Datum v4); +static HeapTuple SearchCatCache1e(CatCache *cache, Datum v1); +static HeapTuple SearchCatCache2e(CatCache *cache, Datum v1, Datum v2); +static HeapTuple SearchCatCache3e(CatCache *cache, + Datum v1, Datum v2, Datum v3); +static HeapTuple SearchCatCache4e(CatCache *cache, + Datum v1, Datum v2, Datum v3, Datum v4); + +static bool CatCacheCleanupOldEntries(CatCache *cp); + static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys, Datum v1, Datum v2, Datum v3, Datum v4); static uint32 CatalogCacheComputeTupleHashValue(CatCache *cache, int nkeys, @@ -125,13 +143,35 @@ static SearchCatCacheFuncsType catcache_base = { SearchCatCache4b }; +static SearchCatCacheFuncsType catcache_expire = { + SearchCatCachee, + SearchCatCache1e, + SearchCatCache2e, + SearchCatCache3e, + SearchCatCache4e +}; + SearchCatCacheFuncsType *SearchCatCacheFuncs = NULL; +/* set catcache function set according to guc variables */ +static void +set_catcache_functions(void) +{ + if (catalog_cache_prune_min_age < 0) + SearchCatCacheFuncs = &catcache_base; + else + SearchCatCacheFuncs = &catcache_expire; +} + + /* GUC assign function */ void assign_catalog_cache_prune_min_age(int newval, void *extra) { catalog_cache_prune_min_age = newval; + + /* choose corresponding function set */ + set_catcache_functions(); } /* @@ -872,7 +912,7 @@ InitCatCache(int id, slist_init(&CacheHdr->ch_caches); CacheHdr->ch_ntup = 0; - SearchCatCacheFuncs = &catcache_base; + set_catcache_functions(); #ifdef CATCACHE_STATS /* set up to dump stats at backend exit */ @@ -938,6 +978,10 @@ RehashCatCache(CatCache *cp) elog(DEBUG1, "rehashing catalog cache id %d for %s; %d tups, %d buckets", cp->id, cp->cc_relname, cp->cc_ntup, cp->cc_nbuckets); + /* try removing old entries before expanding hash */ + if (CatCacheCleanupOldEntries(cp)) + return; + /* Allocate a new, larger, hash table. */ newnbuckets = cp->cc_nbuckets * 2; newbucket = (dlist_head *) MemoryContextAllocZero(CacheMemoryContext, newnbuckets * sizeof(dlist_head)); @@ -1222,7 +1266,7 @@ SearchCatCacheb(CatCache *cache, Datum v3, Datum v4) { - return SearchCatCacheInternal(cache, cache->cc_nkeys, v1, v2, v3, v4); + return SearchCatCacheInternalb(cache, cache->cc_nkeys, v1, v2, v3, v4); } @@ -1236,7 +1280,7 @@ static HeapTuple SearchCatCache1b(CatCache *cache, Datum v1) { - return SearchCatCacheInternal(cache, 1, v1, 0, 0, 0); + return SearchCatCacheInternalb(cache, 1, v1, 0, 0, 0); } @@ -1244,7 +1288,7 @@ static HeapTuple SearchCatCache2b(CatCache *cache, Datum v1, Datum v2) { - return SearchCatCacheInternal(cache, 2, v1, v2, 0, 0); + return SearchCatCacheInternalb(cache, 2, v1, v2, 0, 0); } @@ -1252,7 +1296,7 @@ static HeapTuple SearchCatCache3b(CatCache *cache, Datum v1, Datum v2, Datum v3) { - return SearchCatCacheInternal(cache, 3, v1, v2, v3, 0); + return SearchCatCacheInternalb(cache, 3, v1, v2, v3, 0); } @@ -1260,19 +1304,19 @@ static HeapTuple SearchCatCache4b(CatCache *cache, Datum v1, Datum v2, Datum v3, Datum v4) { - return SearchCatCacheInternal(cache, 4, v1, v2, v3, v4); + return SearchCatCacheInternalb(cache, 4, v1, v2, v3, v4); } /* - * Work-horse for SearchCatCache/SearchCatCacheN. + * Work-horse for SearchCatCacheb/SearchCatCacheNb. */ static inline HeapTuple -SearchCatCacheInternal(CatCache *cache, - int nkeys, - Datum v1, - Datum v2, - Datum v3, - Datum v4) +SearchCatCacheInternalb(CatCache *cache, + int nkeys, + Datum v1, + Datum v2, + Datum v3, + Datum v4) { Datum arguments[CATCACHE_MAXKEYS]; uint32 hashValue; @@ -1497,6 +1541,269 @@ SearchCatCacheMiss(CatCache *cache, return &ct->tuple; } +/* + * SearchCatCache with entry pruning + * + * These functions works the same way with SearchCatCacheNb() functions except + * that less-used entries are removed following catalog_cache_prune_min_age + * setting. + */ +static HeapTuple +SearchCatCachee(CatCache *cache, + Datum v1, + Datum v2, + Datum v3, + Datum v4) +{ + return SearchCatCacheInternale(cache, cache->cc_nkeys, v1, v2, v3, v4); +} + + +/* + * SearchCatCacheN() are SearchCatCache() versions for a specific number of + * arguments. The compiler can inline the body and unroll loops, making them a + * bit faster than SearchCatCache(). + */ + +static HeapTuple +SearchCatCache1e(CatCache *cache, + Datum v1) +{ + return SearchCatCacheInternale(cache, 1, v1, 0, 0, 0); +} + + +static HeapTuple +SearchCatCache2e(CatCache *cache, + Datum v1, Datum v2) +{ + return SearchCatCacheInternale(cache, 2, v1, v2, 0, 0); +} + + +static HeapTuple +SearchCatCache3e(CatCache *cache, + Datum v1, Datum v2, Datum v3) +{ + return SearchCatCacheInternale(cache, 3, v1, v2, v3, 0); +} + + +static HeapTuple +SearchCatCache4e(CatCache *cache, + Datum v1, Datum v2, Datum v3, Datum v4) +{ + return SearchCatCacheInternale(cache, 4, v1, v2, v3, v4); +} + +/* + * Work-horse for SearchCatCachee/SearchCatCacheNe. + */ +static inline HeapTuple +SearchCatCacheInternale(CatCache *cache, + int nkeys, + Datum v1, + Datum v2, + Datum v3, + Datum v4) +{ + Datum arguments[CATCACHE_MAXKEYS]; + uint32 hashValue; + Index hashIndex; + dlist_iter iter; + dlist_head *bucket; + CatCTup *ct; + + /* Make sure we're in an xact, even if this ends up being a cache hit */ + Assert(IsTransactionState()); + + Assert(cache->cc_nkeys == nkeys); + + /* + * one-time startup overhead for each cache + */ + if (unlikely(cache->cc_tupdesc == NULL)) + CatalogCacheInitializeCache(cache); + +#ifdef CATCACHE_STATS + cache->cc_searches++; +#endif + + /* Initialize local parameter array */ + arguments[0] = v1; + arguments[1] = v2; + arguments[2] = v3; + arguments[3] = v4; + + /* + * find the hash bucket in which to look for the tuple + */ + hashValue = CatalogCacheComputeHashValue(cache, nkeys, v1, v2, v3, v4); + hashIndex = HASH_INDEX(hashValue, cache->cc_nbuckets); + + /* + * scan the hash bucket until we find a match or exhaust our tuples + * + * Note: it's okay to use dlist_foreach here, even though we modify the + * dlist within the loop, because we don't continue the loop afterwards. + */ + bucket = &cache->cc_bucket[hashIndex]; + dlist_foreach(iter, bucket) + { + ct = dlist_container(CatCTup, cache_elem, iter.cur); + + if (ct->dead) + continue; /* ignore dead entries */ + + if (ct->hash_value != hashValue) + continue; /* quickly skip entry if wrong hash val */ + + if (!CatalogCacheCompareTuple(cache, nkeys, ct->keys, arguments)) + continue; + + /* + * We found a match in the cache. Move it to the front of the list + * for its hashbucket, in order to speed subsequent searches. (The + * most frequently accessed elements in any hashbucket will tend to be + * near the front of the hashbucket's list.) + */ + dlist_move_head(bucket, &ct->cache_elem); + + /* + * Prolong life of this entry. Since we want run as less instructions + * as possible and want the branch be stable for performance reasons, + * we don't give a strict cap on the counter. All numbers above 1 will + * be regarded as 2 in CatCacheCleanupOldEntries(). + */ + ct->naccess++; + if (unlikely(ct->naccess == 0)) + ct->naccess = 2; + ct->lastaccess = catcacheclock; + + /* + * If it's a positive entry, bump its refcount and return it. If it's + * negative, we can report failure to the caller. + */ + if (!ct->negative) + { + ResourceOwnerEnlargeCatCacheRefs(CurrentResourceOwner); + ct->refcount++; + ResourceOwnerRememberCatCacheRef(CurrentResourceOwner, &ct->tuple); + + CACHE_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d", + cache->cc_relname, hashIndex); + +#ifdef CATCACHE_STATS + cache->cc_hits++; +#endif + + return &ct->tuple; + } + else + { + CACHE_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d", + cache->cc_relname, hashIndex); + +#ifdef CATCACHE_STATS + cache->cc_neg_hits++; +#endif + + return NULL; + } + } + + return SearchCatCacheMiss(cache, nkeys, hashValue, hashIndex, v1, v2, v3, v4); +} + +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries happen to be left unused for a long time for several + * reasons. Remove such entries to prevent catcache from bloating. It is based + * on the similar algorithm with buffer eviction. Entries that are accessed + * several times in a certain period live longer than those that have had less + * access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int nremoved = 0; + int i; + long oldest_ts = catcacheclock; + long age; + int us; + + /* Return immediately if disabled */ + if (catalog_cache_prune_min_age < 0) + return false; + + /* Don't scan the hash when we know we don't have prunable entries */ + TimestampDifference(cp->cc_oldest_ts, catcacheclock, &age, &us); + if (age < catalog_cache_prune_min_age) + return false; + + /* Scan over the whole hash to find entries to remove */ + for (i = 0 ; i < cp->cc_nbuckets ; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + + /* Don't remove referenced entries */ + if (ct->refcount == 0 && + (ct->c_list == NULL || ct->c_list->refcount == 0)) + { + /* + * Calculate the duration from the time from the last access + * to the "current" time. catcacheclock is updated + * per-statement basis and additionaly udpated periodically + * during a long running query. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &age, &us); + + if (age > catalog_cache_prune_min_age) + { + /* + * Entries that are not accessed after the last pruning + * are removed in that seconds, and their lives are + * prolonged according to how many times they are accessed + * up to three times of the duration. We don't try shrink + * buckets since pruning effectively caps catcache + * expansion in the long term. + */ + if (ct->naccess > 2) + ct->naccess = 1; + else if (ct->naccess > 0) + ct->naccess--; + else + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + + /* don't update oldest_ts by removed entry */ + continue; + } + } + } + + /* update oldest timestamp if the entry remains alive */ + if (ct->lastaccess < oldest_ts) + oldest_ts = ct->lastaccess; + } + } + + cp->cc_oldest_ts = oldest_ts; + + if (nremoved > 0) + elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d", + cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved); + + return nremoved > 0; +} + + /* * ReleaseCatCache * @@ -1960,6 +2267,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); -- 2.23.0 #! /usr/bin/perl $collist = ""; foreach $i (0..1000) { $collist .= sprintf(", c%05d int", $i); } $collist = substr($collist, 2); printf "drop schema if exists test cascade;\n"; printf "create schema test;\n"; printf "create table test.p ($collist) partition by list (c00000);\n"; foreach $i (0..2999) { printf "create table test.t%04d partition of test.p for values in (%d);\n", $i, $i; } #!/bin/bash LOOPS=20 ITERATION=10 BINROOT=/home/horiguti/bin DATADIR=/home/horiguti/data/data_catexpe PREC="numeric(10,2)" /usr/bin/killall postgres /usr/bin/sleep 3 run() { local BINARY=$1 local PGCTL=$2/bin/pg_ctl local PGSQL=$2/bin/postgres local PSQL=$2/bin/psql if [ "$3" != "" ]; then local SETTING1="set catalog_cache_prune_min_age to \"$3\";" local SETTING2="set catalog_cache_prune_min_age to \"$4\";" local SETTING3="set catalog_cache_prune_min_age to \"$5\";" fi # ($PGSQL -D $DATADIR 2>&1 > /dev/null)& ($PGSQL -D $DATADIR 2>&1 > /dev/null | /usr/bin/sed -e 's/^/# /')& /usr/bin/sleep 3 ${PSQL} postgres <<EOF create extension if not exists catcachebench; select catcachebench(0); $SETTING3 select * from generate_series(2, 2) test, LATERAL (select '${BINARY}' as version, count(r)::text || '/${LOOPS}' as n, min(r)::${PREC}, stddev(r)::${PREC} from (select catcachebench(test) as r from generate_series(1, ${LOOPS})) r) r EOF $PGCTL --pgdata=$DATADIR stop 2>&1 > /dev/null | /usr/bin/sed -e 's/^/# /' # oreport > $BINARY_perf.txt } for i in $(seq 0 ${ITERATION}); do run "master" $BINROOT/pgsql_master_o2 "" "" "" run "base" $BINROOT/pgsql_catexp-base "" "" "" run "ind" $BINROOT/pgsql_catexp-ind "" "" "" run "expire-off" $BINROOT/pgsql_catexpe "-1" "-1" "-1" run "expire-on" $BINROOT/pgsql_catexpe "300s" "1s" "0" done #! /usr/bin/bash BINROOT=/home/horiguti/bin/pgsql_ for i in master_o2 catexp-base catexp-ind catexpe; do rm -r /home/horiguti/bin/pgsql_$i/*; done for i in master_o2 catexp-base catexp-ind catexpe; do ls -l /home/horiguti/bin/pgsql_$i/bin/postgres; done function build () { echo $1 make distclean git checkout $2 git diff master..HEAD > diff_$1.txt ./configure --enable-debug --enable-tap-tests --enable-nls --with-openssl --with-libxml --with-llvm --prefix=${BINROOT}$1LLVM_CONFIG="/usr/bin/llvm-config" make -sj8 all make install cd contrib/catcachebench make clean make all make install cd ../.. } build "master_o2" "795e92756cd1" build "catexp-base" "b2ebc9b4f1c" build "catexp-ind" "631a04026d" build "catexpe" "025e5e8a98d" for i in master_o2 catexp-base catexp-ind catexpe; do ls -l /home/horiguti/bin/pgsql_$i/bin/postgres; done
Hello Kyotaro-san, I see this patch is stuck in WoA since 2019/12/01, although there's a new patch version from 2020/01/14. But the patch seems to no longer apply, at least according to https://commitfest.cputube.org :-( So at this point the status is actually correct. Not sure about the appveyor build (it seems to be about jsonb_set_lax), but on travis it fails like this: catcache.c:820:1: error: no previous prototype for ‘CatalogCacheFlushCatalog2’ [-Werror=missing-prototypes] so I'll leave it in WoA for now. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-Jan-21, Tomas Vondra wrote: > Not sure about the appveyor build (it seems to be about jsonb_set_lax), > but on travis it fails like this: > > catcache.c:820:1: error: no previous prototype for ‘CatalogCacheFlushCatalog2’ [-Werror=missing-prototypes] Hmm ... travis is running -Werror? That seems overly strict. I think we shouldn't punt a patch because of that. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > On 2020-Jan-21, Tomas Vondra wrote: >> Not sure about the appveyor build (it seems to be about jsonb_set_lax), FWIW, I think I fixed jsonb_set_lax yesterday, so that problem should be gone the next time the cfbot tries this. >> but on travis it fails like this: >> catcache.c:820:1: error: no previous prototype for ‘CatalogCacheFlushCatalog2’ [-Werror=missing-prototypes] > Hmm ... travis is running -Werror? That seems overly strict. I think > we shouldn't punt a patch because of that. Why not? We're not going to allow pushing a patch that throws warnings on common compilers. Or if that does happen, some committer is going to have to spend time cleaning it up. Better to clean it up sooner. (There is, btw, at least one buildfarm animal using -Werror.) regards, tom lane
Hello. At Tue, 21 Jan 2020 14:17:53 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > On 2020-Jan-21, Tomas Vondra wrote: > >> Not sure about the appveyor build (it seems to be about jsonb_set_lax), > > FWIW, I think I fixed jsonb_set_lax yesterday, so that problem should > be gone the next time the cfbot tries this. > > >> but on travis it fails like this: > >> catcache.c:820:1: error: no previous prototype for ‘CatalogCacheFlushCatalog2’ [-Werror=missing-prototypes] > > > Hmm ... travis is running -Werror? That seems overly strict. I think > > we shouldn't punt a patch because of that. > > Why not? We're not going to allow pushing a patch that throws warnings > on common compilers. Or if that does happen, some committer is going > to have to spend time cleaning it up. Better to clean it up sooner. > > (There is, btw, at least one buildfarm animal using -Werror.) Mmm. The cause of the error is tentative (or crude or brute) benchmarking function provided as an extension which is not actually a part of the patch and was included for reviewer's convenience. Howerver, I don't want it work on Windows build. If that is regarded as a reason for being punt, I'll repost a new version without the benchmark soon. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Jan 21, 2020 at 02:17:53PM -0500, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> Hmm ... travis is running -Werror? That seems overly strict. I think >> we shouldn't punt a patch because of that. > > Why not? We're not going to allow pushing a patch that throws warnings > on common compilers. Or if that does happen, some committer is going > to have to spend time cleaning it up. Better to clean it up sooner. > > (There is, btw, at least one buildfarm animal using -Werror.) I agree that it is good to have in Mr Robot. More early detection means less follow-up cleanup. -- Michael
Attachment
At Tue, 21 Jan 2020 17:29:47 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in > I see this patch is stuck in WoA since 2019/12/01, although there's a > new patch version from 2020/01/14. But the patch seems to no longer > apply, at least according to https://commitfest.cputube.org :-( So at > this point the status is actually correct. > > Not sure about the appveyor build (it seems to be about > jsonb_set_lax), > but on travis it fails like this: > > catcache.c:820:1: error: no previous prototype for > ‘CatalogCacheFlushCatalog2’ [-Werror=missing-prototypes] I changed my mind to attach the benchmark patch as .txt file, expecting the checkers not picking up it as a part of the patchset. I have in the precise performance measurement mode for a long time, but I think it's settled. I'd like to return to normal mode and explain this patch. === Motive of the patch System cache is a mechanism that accelerates access to system catalogs Basically the entries in a cache is removed via invalidation machanism when corresponding system catalog entry is removed. On the other hand the system cache also holds "negative" entries that indicates that the object is nonexistent, which entry accelerates the response for nonexistent objects. But the negative cache doesn't have a chance of removal. On a long-lived session that accepts a wide variety of queries on many objects, system cache holds the cache entries for many objects that are accessed once or a few times. Suppose every object is accessed once per, say, 30 minutes, and the query doesn't needed to run in a very short time. Such cache entries are almost useless but occupy a large amount of memory. === Possible solutions. Many caching system has expiration mechanism, which removes "useless" entries to keep the size under a certain limit. The limit is typically defined by memory usage or expiration time, in a hard or soft way. Since we don't implement an detailed accouting of memory usage by cache for the performance reasons, we can use coarse memory accounting or expiration time. This patch uses expiration time because it can be detemined on a rather clearer basis. === Pruning timing The next point is when to prune cache entries. Apparently it's not reasonable to do on every cache access time, since pruning takes a far longer time than cache access. The system cache is implemented on a hash. When there's no room for a new cache entry, it gets twice in size and rehashesall entries. If pruning made some space for the new entry, rehashing can be avoided, so this patch tries pruningjust before enlarging hash table. A system cache can be shrinked if less than a half of the size is used, but this patch doesn't that. It is because we cannot predict if the system cache that have just shrinked is going to enlarged just after and I don't want get this patch that complex. === Performance The pruning mechanism adds several entries to cache entry and updates System cache is a very light-weight machinery so that inserting one branch affects performance apparently. So in this patch, the new stuff is isolated from existing code path using indirect call. After trials on some call-points that can be indirect calls, I found that SearchCatCache[1-4]() is the only point that doesn't affect performance. (Please see upthread for details.) That configuraion also allows future implements of system caches, such like shared system caches. The alternative SearchCatCache[1-4] functions get a bit slower because it maintains access timestamp and access counter. Addition to that pruning puts a certain amount of additional time if no entries are not pruned off. === Pruning criteria At the pruning time described above, every entry is examined agianst the GUC variable catalog_cache_prune_min_age. The pruning mechanism involves a clock-sweep-like mechanism where an entry lives longer if it had accessed. Entry of which access counter is zero is pruned after catalog_cache_prune_min_age. Otherwise an entry survives the pruning round and its counter is decremented. All the timestamp used by the stuff is "catcacheclock", which is updated at every transaction start. === Concise test The attached test1.pl can be used to replay the syscache-bloat caused by negative entries. Setting $prune_age to -1, pruning is turned of and you can see that the backend unlimitedly takes more and more memory as time proceeds. Setting it to 10 or so, the memory size of backend process will stops raising at certain amount. === The patch The attached following are the patch. They have been separated for the benchmarking reasons but that seems to make the patch more easy to read so I leave it alone. I forgot its correct version through a long time of benchmarking so I started from v1 now. - v1-0001-base_change.patch Adds new members to existing structs and catcacheclock-related code. - v1-0002-Make-CatCacheSearchN-indirect-functions.patch Changes SearchCatCacheN functions to be called by indirect calls. - v1-0003-CatCache-expiration-feature.patch The core code of the patch. - catcache-benchmark-extension.patch.txt The benchmarking extension that was used for benchmarking upthread. Just for information. - test1.pl Test script to make syscache bloat. The patchset doesn't contain documentaion for the new GUC option. I will add it later. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From fb80260907ac4ac0ff330806632f095484772fd1 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 10 Jan 2020 15:02:26 +0900 Subject: [PATCH v1 1/3] base_change Adds struct members needed by catcache expiration feature and a GUC variable that controls the behavior of the feature. But no substantial code is not added yet. If existence of some variables alone can cause degradation, benchmarking after this patch shows that. --- src/backend/access/transam/xact.c | 3 +++ src/backend/utils/cache/catcache.c | 15 +++++++++++++++ src/backend/utils/misc/guc.c | 13 +++++++++++++ src/include/utils/catcache.h | 20 ++++++++++++++++++++ 4 files changed, 51 insertions(+) diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 017f03b6d8..1268a7fb80 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -1067,6 +1067,9 @@ ForceSyncCommit(void) static void AtStart_Cache(void) { + if (xactStartTimestamp != 0) + SetCatCacheClock(xactStartTimestamp); + AcceptInvalidationMessages(); } diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 64776e3209..7248bd0d41 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -60,9 +60,18 @@ #define CACHE_elog(...) #endif +/* + * GUC variable to define the minimum age of entries that will be considered + * to be evicted in seconds. -1 to disable the feature. + */ +int catalog_cache_prune_min_age = 300; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Clock for the last accessed time of a catcache entry. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -99,6 +108,12 @@ static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos, static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *srckeys, Datum *dstkeys); +/* GUC assign function */ +void +assign_catalog_cache_prune_min_age(int newval, void *extra) +{ + catalog_cache_prune_min_age = newval; +} /* * internal support functions diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index e44f71e991..3029e44d7a 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -83,6 +83,8 @@ #include "tsearch/ts_cache.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" +#include "utils/guc_tables.h" #include "utils/float.h" #include "utils/guc_tables.h" #include "utils/memutils.h" @@ -2293,6 +2295,17 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("System catalog cache entries that live unused for longer than this seconds are considered forremoval."), + gettext_noop("The value of -1 turns off pruning."), + GUC_UNIT_S + }, + &catalog_cache_prune_min_age, + 300, -1, INT_MAX, + NULL, assign_catalog_cache_prune_min_age, NULL + }, + /* * We use the hopefully-safely-small value of 100kB as the compiled-in * default for max_stack_depth. InitializeGUCOptions will increase it if diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index f4aa316604..3d3870f05a 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -61,6 +62,7 @@ typedef struct catcache slist_node cc_next; /* list link */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ + TimestampTz cc_oldest_ts; /* timestamp of the oldest tuple in the hash */ /* * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS @@ -119,6 +121,8 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ + unsigned int naccess; /* # of access to this entry */ + TimestampTz lastaccess; /* timestamp of the last usage */ /* * The tuple may also be a member of at most one CatCList. (If a single @@ -189,6 +193,22 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; +/* for guc.c, not PGDLLPMPORT'ed */ +extern int catalog_cache_prune_min_age; + +/* source clock for access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* SetCatCacheClock - set catcache timestamp source clodk */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + + +extern void assign_catalog_cache_prune_min_age(int newval, void *extra); + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, -- 2.23.0 From 2b4449372acfbdf728a79f43dec0a0109c30228d Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Thu, 9 Jan 2020 19:22:18 +0900 Subject: [PATCH v1 2/3] Make CatCacheSearchN indirect functions After some expriments showed that the best way to add a new feature to the current CatCacheSearch path is making SearchCatCacheN functions replacable using indirect calling. This patch does that. If the change of how to call the functions alone causes degradataion, benchmarking after this patch applied shows that. --- src/backend/utils/cache/catcache.c | 42 +++++++++++++++++++++++------- src/include/utils/catcache.h | 40 ++++++++++++++++++++++++---- 2 files changed, 67 insertions(+), 15 deletions(-) diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 7248bd0d41..a4e3676a89 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -84,6 +84,15 @@ static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache, Datum v1, Datum v2, Datum v3, Datum v4); +static HeapTuple SearchCatCacheb(CatCache *cache, + Datum v1, Datum v2, Datum v3, Datum v4); +static HeapTuple SearchCatCache1b(CatCache *cache, Datum v1); +static HeapTuple SearchCatCache2b(CatCache *cache, Datum v1, Datum v2); +static HeapTuple SearchCatCache3b(CatCache *cache, + Datum v1, Datum v2, Datum v3); +static HeapTuple SearchCatCache4b(CatCache *cache, + Datum v1, Datum v2, Datum v3, Datum v4); + static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys, Datum v1, Datum v2, Datum v3, Datum v4); static uint32 CatalogCacheComputeTupleHashValue(CatCache *cache, int nkeys, @@ -108,6 +117,16 @@ static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos, static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *srckeys, Datum *dstkeys); +static SearchCatCacheFuncsType catcache_base = { + SearchCatCacheb, + SearchCatCache1b, + SearchCatCache2b, + SearchCatCache3b, + SearchCatCache4b +}; + +SearchCatCacheFuncsType *SearchCatCacheFuncs = NULL; + /* GUC assign function */ void assign_catalog_cache_prune_min_age(int newval, void *extra) @@ -817,6 +836,9 @@ InitCatCache(int id, CacheHdr = (CatCacheHeader *) palloc(sizeof(CatCacheHeader)); slist_init(&CacheHdr->ch_caches); CacheHdr->ch_ntup = 0; + + SearchCatCacheFuncs = &catcache_base; + #ifdef CATCACHE_STATS /* set up to dump stats at backend exit */ on_proc_exit(CatCachePrintStats, 0); @@ -1158,8 +1180,8 @@ IndexScanOK(CatCache *cache, ScanKey cur_skey) * the caller need not go to the trouble of converting it to a fully * null-padded NAME. */ -HeapTuple -SearchCatCache(CatCache *cache, +static HeapTuple +SearchCatCacheb(CatCache *cache, Datum v1, Datum v2, Datum v3, @@ -1175,32 +1197,32 @@ SearchCatCache(CatCache *cache, * bit faster than SearchCatCache(). */ -HeapTuple -SearchCatCache1(CatCache *cache, +static HeapTuple +SearchCatCache1b(CatCache *cache, Datum v1) { return SearchCatCacheInternal(cache, 1, v1, 0, 0, 0); } -HeapTuple -SearchCatCache2(CatCache *cache, +static HeapTuple +SearchCatCache2b(CatCache *cache, Datum v1, Datum v2) { return SearchCatCacheInternal(cache, 2, v1, v2, 0, 0); } -HeapTuple -SearchCatCache3(CatCache *cache, +static HeapTuple +SearchCatCache3b(CatCache *cache, Datum v1, Datum v2, Datum v3) { return SearchCatCacheInternal(cache, 3, v1, v2, v3, 0); } -HeapTuple -SearchCatCache4(CatCache *cache, +static HeapTuple +SearchCatCache4b(CatCache *cache, Datum v1, Datum v2, Datum v3, Datum v4) { return SearchCatCacheInternal(cache, 4, v1, v2, v3, v4); diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 3d3870f05a..f9e9889339 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -189,6 +189,36 @@ typedef struct catcacheheader int ch_ntup; /* # of tuples in all caches */ } CatCacheHeader; +typedef HeapTuple (*SearchCatCache_fn)(CatCache *cache, + Datum v1, Datum v2, Datum v3, Datum v4); +typedef HeapTuple (*SearchCatCache1_fn)(CatCache *cache, Datum v1); +typedef HeapTuple (*SearchCatCache2_fn)(CatCache *cache, Datum v1, Datum v2); +typedef HeapTuple (*SearchCatCache3_fn)(CatCache *cache, Datum v1, Datum v2, + Datum v3); +typedef HeapTuple (*SearchCatCache4_fn)(CatCache *cache, + Datum v1, Datum v2, Datum v3, Datum v4); + +typedef struct SearchCatCacheFuncsType +{ + SearchCatCache_fn SearchCatCache; + SearchCatCache1_fn SearchCatCache1; + SearchCatCache2_fn SearchCatCache2; + SearchCatCache3_fn SearchCatCache3; + SearchCatCache4_fn SearchCatCache4; +} SearchCatCacheFuncsType; + +extern PGDLLIMPORT SearchCatCacheFuncsType *SearchCatCacheFuncs; + +#define SearchCatCache(cache, v1, v2, v3, v4) \ + SearchCatCacheFuncs->SearchCatCache(cache, v1, v2, v3, v4) +#define SearchCatCache1(cache, v1) \ + SearchCatCacheFuncs->SearchCatCache1(cache, v1) +#define SearchCatCache2(cache, v1, v2) \ + SearchCatCacheFuncs->SearchCatCache2(cache, v1, v2) +#define SearchCatCache3(cache, v1, v2, v3) \ + SearchCatCacheFuncs->SearchCatCache3(cache, v1, v2, v3) +#define SearchCatCache4(cache, v1, v2, v3, v4) \ + SearchCatCacheFuncs->SearchCatCache4(cache, v1, v2, v3, v4) /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; @@ -216,15 +246,15 @@ extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, int nbuckets); extern void InitCatCachePhase2(CatCache *cache, bool touch_index); -extern HeapTuple SearchCatCache(CatCache *cache, +extern HeapTuple (*SearchCatCache)(CatCache *cache, Datum v1, Datum v2, Datum v3, Datum v4); -extern HeapTuple SearchCatCache1(CatCache *cache, +extern HeapTuple (*SearchCatCache1)(CatCache *cache, Datum v1); -extern HeapTuple SearchCatCache2(CatCache *cache, +extern HeapTuple (*SearchCatCache2)(CatCache *cache, Datum v1, Datum v2); -extern HeapTuple SearchCatCache3(CatCache *cache, +extern HeapTuple (*SearchCatCache3)(CatCache *cache, Datum v1, Datum v2, Datum v3); -extern HeapTuple SearchCatCache4(CatCache *cache, +extern HeapTuple (*SearchCatCache4)(CatCache *cache, Datum v1, Datum v2, Datum v3, Datum v4); extern void ReleaseCatCache(HeapTuple tuple); -- 2.23.0 From e348af29c3cae63212dcb3d982e419e53bc86517 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 10 Jan 2020 15:08:54 +0900 Subject: [PATCH v1 3/3] CatCache expiration feature. This adds the catcache expiration feature to the catcache mechanism. Current catcache doesn't remove an entry and there's a case where many hash entries occupy large amont of memory , being not accessed ever after. This can be a quite serious issue on the cases of long-running sessions. The expiration feature keeps process memory usage below certain amount, in exchange of some extent of degradation if it is turned on. --- src/backend/utils/cache/catcache.c | 343 +++++++++++++++++++++++++++-- 1 file changed, 326 insertions(+), 17 deletions(-) diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index a4e3676a89..29bc980d8e 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -72,10 +72,11 @@ static CatCacheHeader *CacheHdr = NULL; /* Clock for the last accessed time of a catcache entry. */ TimestampTz catcacheclock = 0; -static inline HeapTuple SearchCatCacheInternal(CatCache *cache, - int nkeys, - Datum v1, Datum v2, - Datum v3, Datum v4); +/* basic catcache search functions */ +static inline HeapTuple SearchCatCacheInternalb(CatCache *cache, + int nkeys, + Datum v1, Datum v2, + Datum v3, Datum v4); static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache, int nkeys, @@ -93,6 +94,23 @@ static HeapTuple SearchCatCache3b(CatCache *cache, static HeapTuple SearchCatCache4b(CatCache *cache, Datum v1, Datum v2, Datum v3, Datum v4); +/* catcache search functions with expiration feature */ +static inline HeapTuple SearchCatCacheInternale(CatCache *cache, + int nkeys, + Datum v1, Datum v2, + Datum v3, Datum v4); + +static HeapTuple SearchCatCachee(CatCache *cache, + Datum v1, Datum v2, Datum v3, Datum v4); +static HeapTuple SearchCatCache1e(CatCache *cache, Datum v1); +static HeapTuple SearchCatCache2e(CatCache *cache, Datum v1, Datum v2); +static HeapTuple SearchCatCache3e(CatCache *cache, + Datum v1, Datum v2, Datum v3); +static HeapTuple SearchCatCache4e(CatCache *cache, + Datum v1, Datum v2, Datum v3, Datum v4); + +static bool CatCacheCleanupOldEntries(CatCache *cp); + static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys, Datum v1, Datum v2, Datum v3, Datum v4); static uint32 CatalogCacheComputeTupleHashValue(CatCache *cache, int nkeys, @@ -125,13 +143,35 @@ static SearchCatCacheFuncsType catcache_base = { SearchCatCache4b }; +static SearchCatCacheFuncsType catcache_expire = { + SearchCatCachee, + SearchCatCache1e, + SearchCatCache2e, + SearchCatCache3e, + SearchCatCache4e +}; + SearchCatCacheFuncsType *SearchCatCacheFuncs = NULL; +/* set catcache function set according to guc variables */ +static void +set_catcache_functions(void) +{ + if (catalog_cache_prune_min_age < 0) + SearchCatCacheFuncs = &catcache_base; + else + SearchCatCacheFuncs = &catcache_expire; +} + + /* GUC assign function */ void assign_catalog_cache_prune_min_age(int newval, void *extra) { catalog_cache_prune_min_age = newval; + + /* choose corresponding function set */ + set_catcache_functions(); } /* @@ -837,7 +877,7 @@ InitCatCache(int id, slist_init(&CacheHdr->ch_caches); CacheHdr->ch_ntup = 0; - SearchCatCacheFuncs = &catcache_base; + set_catcache_functions(); #ifdef CATCACHE_STATS /* set up to dump stats at backend exit */ @@ -900,6 +940,10 @@ RehashCatCache(CatCache *cp) int newnbuckets; int i; + /* try removing old entries before expanding hash */ + if (CatCacheCleanupOldEntries(cp)) + return; + elog(DEBUG1, "rehashing catalog cache id %d for %s; %d tups, %d buckets", cp->id, cp->cc_relname, cp->cc_ntup, cp->cc_nbuckets); @@ -1187,7 +1231,7 @@ SearchCatCacheb(CatCache *cache, Datum v3, Datum v4) { - return SearchCatCacheInternal(cache, cache->cc_nkeys, v1, v2, v3, v4); + return SearchCatCacheInternalb(cache, cache->cc_nkeys, v1, v2, v3, v4); } @@ -1201,7 +1245,7 @@ static HeapTuple SearchCatCache1b(CatCache *cache, Datum v1) { - return SearchCatCacheInternal(cache, 1, v1, 0, 0, 0); + return SearchCatCacheInternalb(cache, 1, v1, 0, 0, 0); } @@ -1209,7 +1253,7 @@ static HeapTuple SearchCatCache2b(CatCache *cache, Datum v1, Datum v2) { - return SearchCatCacheInternal(cache, 2, v1, v2, 0, 0); + return SearchCatCacheInternalb(cache, 2, v1, v2, 0, 0); } @@ -1217,7 +1261,7 @@ static HeapTuple SearchCatCache3b(CatCache *cache, Datum v1, Datum v2, Datum v3) { - return SearchCatCacheInternal(cache, 3, v1, v2, v3, 0); + return SearchCatCacheInternalb(cache, 3, v1, v2, v3, 0); } @@ -1225,19 +1269,19 @@ static HeapTuple SearchCatCache4b(CatCache *cache, Datum v1, Datum v2, Datum v3, Datum v4) { - return SearchCatCacheInternal(cache, 4, v1, v2, v3, v4); + return SearchCatCacheInternalb(cache, 4, v1, v2, v3, v4); } /* - * Work-horse for SearchCatCache/SearchCatCacheN. + * Work-horse for SearchCatCacheb/SearchCatCacheNb. */ static inline HeapTuple -SearchCatCacheInternal(CatCache *cache, - int nkeys, - Datum v1, - Datum v2, - Datum v3, - Datum v4) +SearchCatCacheInternalb(CatCache *cache, + int nkeys, + Datum v1, + Datum v2, + Datum v3, + Datum v4) { Datum arguments[CATCACHE_MAXKEYS]; uint32 hashValue; @@ -1462,6 +1506,269 @@ SearchCatCacheMiss(CatCache *cache, return &ct->tuple; } +/* + * SearchCatCache with entry pruning + * + * These functions works the same way with SearchCatCacheNb() functions except + * that less-used entries are removed following catalog_cache_prune_min_age + * setting. + */ +static HeapTuple +SearchCatCachee(CatCache *cache, + Datum v1, + Datum v2, + Datum v3, + Datum v4) +{ + return SearchCatCacheInternale(cache, cache->cc_nkeys, v1, v2, v3, v4); +} + + +/* + * SearchCatCacheN() are SearchCatCache() versions for a specific number of + * arguments. The compiler can inline the body and unroll loops, making them a + * bit faster than SearchCatCache(). + */ + +static HeapTuple +SearchCatCache1e(CatCache *cache, + Datum v1) +{ + return SearchCatCacheInternale(cache, 1, v1, 0, 0, 0); +} + + +static HeapTuple +SearchCatCache2e(CatCache *cache, + Datum v1, Datum v2) +{ + return SearchCatCacheInternale(cache, 2, v1, v2, 0, 0); +} + + +static HeapTuple +SearchCatCache3e(CatCache *cache, + Datum v1, Datum v2, Datum v3) +{ + return SearchCatCacheInternale(cache, 3, v1, v2, v3, 0); +} + + +static HeapTuple +SearchCatCache4e(CatCache *cache, + Datum v1, Datum v2, Datum v3, Datum v4) +{ + return SearchCatCacheInternale(cache, 4, v1, v2, v3, v4); +} + +/* + * Work-horse for SearchCatCachee/SearchCatCacheNe. + */ +static inline HeapTuple +SearchCatCacheInternale(CatCache *cache, + int nkeys, + Datum v1, + Datum v2, + Datum v3, + Datum v4) +{ + Datum arguments[CATCACHE_MAXKEYS]; + uint32 hashValue; + Index hashIndex; + dlist_iter iter; + dlist_head *bucket; + CatCTup *ct; + + /* Make sure we're in an xact, even if this ends up being a cache hit */ + Assert(IsTransactionState()); + + Assert(cache->cc_nkeys == nkeys); + + /* + * one-time startup overhead for each cache + */ + if (unlikely(cache->cc_tupdesc == NULL)) + CatalogCacheInitializeCache(cache); + +#ifdef CATCACHE_STATS + cache->cc_searches++; +#endif + + /* Initialize local parameter array */ + arguments[0] = v1; + arguments[1] = v2; + arguments[2] = v3; + arguments[3] = v4; + + /* + * find the hash bucket in which to look for the tuple + */ + hashValue = CatalogCacheComputeHashValue(cache, nkeys, v1, v2, v3, v4); + hashIndex = HASH_INDEX(hashValue, cache->cc_nbuckets); + + /* + * scan the hash bucket until we find a match or exhaust our tuples + * + * Note: it's okay to use dlist_foreach here, even though we modify the + * dlist within the loop, because we don't continue the loop afterwards. + */ + bucket = &cache->cc_bucket[hashIndex]; + dlist_foreach(iter, bucket) + { + ct = dlist_container(CatCTup, cache_elem, iter.cur); + + if (ct->dead) + continue; /* ignore dead entries */ + + if (ct->hash_value != hashValue) + continue; /* quickly skip entry if wrong hash val */ + + if (!CatalogCacheCompareTuple(cache, nkeys, ct->keys, arguments)) + continue; + + /* + * We found a match in the cache. Move it to the front of the list + * for its hashbucket, in order to speed subsequent searches. (The + * most frequently accessed elements in any hashbucket will tend to be + * near the front of the hashbucket's list.) + */ + dlist_move_head(bucket, &ct->cache_elem); + + /* + * Prolong life of this entry. Since we want run as less instructions + * as possible and want the branch be stable for performance reasons, + * we don't give a strict cap on the counter. All numbers above 1 will + * be regarded as 2 in CatCacheCleanupOldEntries(). + */ + ct->naccess++; + if (unlikely(ct->naccess == 0)) + ct->naccess = 2; + ct->lastaccess = catcacheclock; + + /* + * If it's a positive entry, bump its refcount and return it. If it's + * negative, we can report failure to the caller. + */ + if (!ct->negative) + { + ResourceOwnerEnlargeCatCacheRefs(CurrentResourceOwner); + ct->refcount++; + ResourceOwnerRememberCatCacheRef(CurrentResourceOwner, &ct->tuple); + + CACHE_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d", + cache->cc_relname, hashIndex); + +#ifdef CATCACHE_STATS + cache->cc_hits++; +#endif + + return &ct->tuple; + } + else + { + CACHE_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d", + cache->cc_relname, hashIndex); + +#ifdef CATCACHE_STATS + cache->cc_neg_hits++; +#endif + + return NULL; + } + } + + return SearchCatCacheMiss(cache, nkeys, hashValue, hashIndex, v1, v2, v3, v4); +} + +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries happen to be left unused for a long time for several + * reasons. Remove such entries to prevent catcache from bloating. It is based + * on the similar algorithm with buffer eviction. Entries that are accessed + * several times in a certain period live longer than those that have had less + * access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int nremoved = 0; + int i; + long oldest_ts = catcacheclock; + long age; + int us; + + /* Return immediately if disabled */ + if (catalog_cache_prune_min_age < 0) + return false; + + /* Don't scan the hash when we know we don't have prunable entries */ + TimestampDifference(cp->cc_oldest_ts, catcacheclock, &age, &us); + if (age < catalog_cache_prune_min_age) + return false; + + /* Scan over the whole hash to find entries to remove */ + for (i = 0 ; i < cp->cc_nbuckets ; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + + /* Don't remove referenced entries */ + if (ct->refcount == 0 && + (ct->c_list == NULL || ct->c_list->refcount == 0)) + { + /* + * Calculate the duration from the time from the last access + * to the "current" time. catcacheclock is updated + * per-statement basis and additionaly udpated periodically + * during a long running query. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &age, &us); + + if (age > catalog_cache_prune_min_age) + { + /* + * Entries that are not accessed after the last pruning + * are removed in that seconds, and their lives are + * prolonged according to how many times they are accessed + * up to three times of the duration. We don't try shrink + * buckets since pruning effectively caps catcache + * expansion in the long term. + */ + if (ct->naccess > 2) + ct->naccess = 1; + else if (ct->naccess > 0) + ct->naccess--; + else + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + + /* don't update oldest_ts by removed entry */ + continue; + } + } + } + + /* update oldest timestamp if the entry remains alive */ + if (ct->lastaccess < oldest_ts) + oldest_ts = ct->lastaccess; + } + } + + cp->cc_oldest_ts = oldest_ts; + + if (nremoved > 0) + elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d", + cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved); + + return nremoved > 0; +} + + /* * ReleaseCatCache * @@ -1925,6 +2232,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); -- 2.23.0 From 7a793b13803d5defd6d5154a075d1d4cb6826103 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Thu, 14 Nov 2019 19:24:36 +0900 Subject: [PATCH] catcache benchmark extension Provides the function catcachebench(bench_no int), which runs CPU intensive benchmark on catcache search. The test table is created by a script separately provided. catcachebench(0): prewarm catcache with provided test tables. catcachebench(1): fetches all attribute stats of all tables. This benchmark loads a vast number of unique entries. Expriration doesn't work since it runs in a transaction. catcachebench(2): fetches all attribute stats of a tables many times. This benchmark repeatedly accesses already loaded entries. Expriration doesn't work since it runs in a transaction. catcachebench(3): fetches all attribute stats of all tables four times. Different from other modes, this runs expiration by forcibly updates reference clock variable every 1000 entries. At this point, variables needed for the expiration feature is not added so SetCatCacheClock is a dummy macro that just replaces it with its parameter. --- contrib/catcachebench/Makefile | 17 + contrib/catcachebench/catcachebench--0.0.sql | 14 + contrib/catcachebench/catcachebench.c | 330 +++++++++++++++++++ contrib/catcachebench/catcachebench.control | 6 + src/backend/utils/cache/catcache.c | 35 ++ src/backend/utils/cache/syscache.c | 2 +- src/include/utils/catcache.h | 3 + 7 files changed, 406 insertions(+), 1 deletion(-) create mode 100644 contrib/catcachebench/Makefile create mode 100644 contrib/catcachebench/catcachebench--0.0.sql create mode 100644 contrib/catcachebench/catcachebench.c create mode 100644 contrib/catcachebench/catcachebench.control diff --git a/contrib/catcachebench/Makefile b/contrib/catcachebench/Makefile new file mode 100644 index 0000000000..0478818b25 --- /dev/null +++ b/contrib/catcachebench/Makefile @@ -0,0 +1,17 @@ +MODULE_big = catcachebench +OBJS = catcachebench.o + +EXTENSION = catcachebench +DATA = catcachebench--0.0.sql +PGFILEDESC = "catcachebench - benchmark for catcache pruning feature" + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/catcachebench +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/catcachebench/catcachebench--0.0.sql b/contrib/catcachebench/catcachebench--0.0.sql new file mode 100644 index 0000000000..ea9cd62abb --- /dev/null +++ b/contrib/catcachebench/catcachebench--0.0.sql @@ -0,0 +1,14 @@ +/* contrib/catcachebench/catcachebench--0.0.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION catcachebench" to load this file. \quit + +CREATE FUNCTION catcachebench(IN type int) +RETURNS double precision +AS 'MODULE_PATHNAME', 'catcachebench' +LANGUAGE C STRICT VOLATILE; + +CREATE FUNCTION catcachereadstats(OUT catid int, OUT reloid oid, OUT searches bigint, OUT hits bigint, OUT neg_hits bigint) +RETURNS SETOF record +AS 'MODULE_PATHNAME', 'catcachereadstats' +LANGUAGE C STRICT VOLATILE; diff --git a/contrib/catcachebench/catcachebench.c b/contrib/catcachebench/catcachebench.c new file mode 100644 index 0000000000..b6c2b8f577 --- /dev/null +++ b/contrib/catcachebench/catcachebench.c @@ -0,0 +1,330 @@ +/* + * catcachebench: test code for cache pruning feature + */ +/* #define CATCACHE_STATS */ +#include "postgres.h" +#include "catalog/pg_type.h" +#include "catalog/pg_statistic.h" +#include "executor/spi.h" +#include "funcapi.h" +#include "libpq/pqsignal.h" +#include "utils/catcache.h" +#include "utils/syscache.h" +#include "utils/timestamp.h" + +Oid tableoids[10000]; +int ntables = 0; +int16 attnums[1000]; +int natts = 0; + +PG_MODULE_MAGIC; + +double catcachebench1(void); +double catcachebench2(void); +double catcachebench3(void); +void collectinfo(void); +void catcachewarmup(void); + +PG_FUNCTION_INFO_V1(catcachebench); +PG_FUNCTION_INFO_V1(catcachereadstats); + +extern void CatalogCacheFlushCatalog2(Oid catId); +extern int64 catcache_called; +extern CatCache *SysCache[]; + +typedef struct catcachestatsstate +{ + TupleDesc tupd; + int catId; +} catcachestatsstate; + +Datum +catcachereadstats(PG_FUNCTION_ARGS) +{ + catcachestatsstate *state_data = NULL; + FuncCallContext *fctx; + + if (SRF_IS_FIRSTCALL()) + { + TupleDesc tupdesc; + MemoryContext mctx; + + fctx = SRF_FIRSTCALL_INIT(); + mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx); + + state_data = palloc(sizeof(catcachestatsstate)); + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + state_data->tupd = tupdesc; + state_data->catId = 0; + + fctx->user_fctx = state_data; + + MemoryContextSwitchTo(mctx); + } + + fctx = SRF_PERCALL_SETUP(); + state_data = fctx->user_fctx; + + if (state_data->catId < SysCacheSize) + { + Datum values[5]; + bool nulls[5]; + HeapTuple resulttup; + Datum result; + int catId = state_data->catId++; + + memset(nulls, 0, sizeof(nulls)); + memset(values, 0, sizeof(values)); + values[0] = Int16GetDatum(catId); + values[1] = ObjectIdGetDatum(SysCache[catId]->cc_reloid); +#ifdef CATCACHE_STATS + values[2] = Int64GetDatum(SysCache[catId]->cc_searches); + values[3] = Int64GetDatum(SysCache[catId]->cc_hits); + values[4] = Int64GetDatum(SysCache[catId]->cc_neg_hits); +#endif + resulttup = heap_form_tuple(state_data->tupd, values, nulls); + result = HeapTupleGetDatum(resulttup); + + SRF_RETURN_NEXT(fctx, result); + } + + SRF_RETURN_DONE(fctx); +} + +Datum +catcachebench(PG_FUNCTION_ARGS) +{ + int testtype = PG_GETARG_INT32(0); + double ms; + + collectinfo(); + + /* flush the catalog -- safe? don't mind. */ + CatalogCacheFlushCatalog2(StatisticRelationId); + + switch (testtype) + { + case 0: + catcachewarmup(); /* prewarm of syscatalog */ + PG_RETURN_NULL(); + case 1: + ms = catcachebench1(); break; + case 2: + ms = catcachebench2(); break; + case 3: + ms = catcachebench3(); break; + default: + elog(ERROR, "Invalid test type: %d", testtype); + } + + PG_RETURN_DATUM(Float8GetDatum(ms)); +} + +/* + * fetch all attribute entires of all tables. + */ +double +catcachebench1(void) +{ + int t, a; + instr_time start, + duration; + + PG_SETMASK(&BlockSig); + INSTR_TIME_SET_CURRENT(start); + for (t = 0 ; t < ntables ; t++) + { + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[t]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + PG_SETMASK(&UnBlockSig); + + return INSTR_TIME_GET_MILLISEC(duration); +}; + +/* + * fetch all attribute entires of a table many times. + */ +double +catcachebench2(void) +{ + int t, a; + instr_time start, + duration; + + PG_SETMASK(&BlockSig); + INSTR_TIME_SET_CURRENT(start); + for (t = 0 ; t < 240000 ; t++) + { + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[0]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + PG_SETMASK(&UnBlockSig); + + return INSTR_TIME_GET_MILLISEC(duration); +}; + +/* + * fetch all attribute entires of all tables several times with having + * expiration happen. + */ +double +catcachebench3(void) +{ + const int clock_step = 1000; + int i, t, a; + instr_time start, + duration; + + PG_SETMASK(&BlockSig); + INSTR_TIME_SET_CURRENT(start); + for (i = 0 ; i < 4 ; i++) + { + int ct = clock_step; + + for (t = 0 ; t < ntables ; t++) + { + /* + * catcacheclock is updated by transaction timestamp, so needs to + * be updated by other means for this test to work. Here I choosed + * to update the clock every 1000 tables scan. + */ + if (--ct < 0) + { + SetCatCacheClock(GetCurrentTimestamp()); + ct = clock_step; + } + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[t]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } + } + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + PG_SETMASK(&UnBlockSig); + + return INSTR_TIME_GET_MILLISEC(duration); +}; + +void +catcachewarmup(void) +{ + int t, a; + + /* load up catalog tables */ + for (t = 0 ; t < ntables ; t++) + { + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[t]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } +} + +void +collectinfo(void) +{ + int ret; + Datum values[10000]; + bool nulls[10000]; + Oid types0[] = {OIDOID}; + int i; + + ntables = 0; + natts = 0; + + SPI_connect(); + /* collect target tables */ + ret = SPI_execute("select oid from pg_class where relnamespace = (select oid from pg_namespace where nspname = \'test\')", + true, 0); + if (ret != SPI_OK_SELECT) + elog(ERROR, "Failed 1"); + if (SPI_processed == 0) + elog(ERROR, "no relation found in schema \"test\""); + if (SPI_processed > 10000) + elog(ERROR, "too many relation found in schema \"test\""); + + for (i = 0 ; i < SPI_processed ; i++) + { + heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc, + values, nulls); + if (nulls[0]) + elog(ERROR, "Failed 2"); + + tableoids[ntables++] = DatumGetObjectId(values[0]); + } + SPI_finish(); + elog(DEBUG1, "%d tables found", ntables); + + values[0] = ObjectIdGetDatum(tableoids[0]); + nulls[0] = false; + SPI_connect(); + ret = SPI_execute_with_args("select attnum from pg_attribute where attrelid = (select oid from pg_class where oid =$1)", + 1, types0, values, NULL, true, 0); + if (SPI_processed == 0) + elog(ERROR, "no attribute found in table %d", tableoids[0]); + if (SPI_processed > 10000) + elog(ERROR, "too many relation found in table %d", tableoids[0]); + + /* collect target attributes. assuming all tables have the same attnums */ + for (i = 0 ; i < SPI_processed ; i++) + { + int16 attnum; + + heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc, + values, nulls); + if (nulls[0]) + elog(ERROR, "Failed 3"); + attnum = DatumGetInt16(values[0]); + + if (attnum > 0) + attnums[natts++] = attnum; + } + SPI_finish(); + elog(DEBUG1, "%d attributes found", natts); +} diff --git a/contrib/catcachebench/catcachebench.control b/contrib/catcachebench/catcachebench.control new file mode 100644 index 0000000000..3fc9d2e420 --- /dev/null +++ b/contrib/catcachebench/catcachebench.control @@ -0,0 +1,6 @@ +# catcachebench + +comment = 'benchmark for catcache pruning' +default_version = '0.0' +module_pathname = '$libdir/catcachebench' +relocatable = true diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 0c68c04caa..35e1a07e57 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -814,6 +814,41 @@ CatalogCacheFlushCatalog(Oid catId) CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call"); } + +/* FUNCTION FOR BENCHMARKING */ +void +CatalogCacheFlushCatalog2(Oid catId) +{ + slist_iter iter; + + CACHE_elog(DEBUG2, "CatalogCacheFlushCatalog called for %u", catId); + + slist_foreach(iter, &CacheHdr->ch_caches) + { + CatCache *cache = slist_container(CatCache, cc_next, iter.cur); + + /* Does this cache store tuples of the target catalog? */ + if (cache->cc_reloid == catId) + { + /* Yes, so flush all its contents */ + ResetCatalogCache(cache); + + /* Tell inval.c to call syscache callbacks for this cache */ + CallSyscacheCallbacks(cache->id, 0); + + cache->cc_nbuckets = 128; + pfree(cache->cc_bucket); + cache->cc_bucket = palloc0(128 * sizeof(dlist_head)); + ereport(DEBUG1, + (errmsg("Catcache reset"), + errhidestmt(true))); + } + } + + CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call"); +} +/* END: FUNCTION FOR BENCHMARKING */ + /* * InitCatCache * diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c index 53d9ddf159..1c79a85a8c 100644 --- a/src/backend/utils/cache/syscache.c +++ b/src/backend/utils/cache/syscache.c @@ -983,7 +983,7 @@ static const struct cachedesc cacheinfo[] = { } }; -static CatCache *SysCache[SysCacheSize]; +CatCache *SysCache[SysCacheSize]; static bool CacheInitialized = false; diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index f9e9889339..dc0ad1a268 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -278,4 +278,7 @@ extern void PrepareToInvalidateCacheTuple(Relation relation, extern void PrintCatCacheLeakWarning(HeapTuple tuple); extern void PrintCatCacheListLeakWarning(CatCList *list); +/* tentative change to allow benchmark on master branch */ +#define SetCatCacheClock(ts) (ts) + #endif /* CATCACHE_H */ -- 2.23.0 #! /usr/bin/perl use Expect; $prune_age = -1; $warmup_secs = 10; $interval = 10; my $exp = Expect->spawn("psql", "postgres") or die "cannot execute psql: $!\n"; $exp->log_stdout(0); #print $exp "set track_catalog_cache_usage_interval to 1000;\n"; #$exp->expect(10, "postgres=#"); print $exp "set catalog_cache_prune_min_age to '${prune_age}s';\n"; $exp->expect(10, "postgres=#"); $starttime = time(); $count = 0; $mean = 0; $nexttime = $starttime + $warmup_secs; $firsttime = 1; while (1) { print $exp "begin; create temp table t1 (a int, b int, c int, d int, e int, f int, g int, h int, i int, j int) on commitdrop; insert into t1 values (1, 2, 3, 4, 5, 6, 7, 8, 9, 10); select * from t1; commit;\n"; $exp->expect(10, "postgres=#"); $count++; if (time() > $nexttime) { if ($firsttime) { $count = 0; $firsttime = 0; } elsif ($mean == 0) { $mean = $count; } else { $mean = $mean * 0.9 + $count * 0.1; } printf STDERR "%6d : %9.2f\n", $count, $mean if ($mean > 0); $count = 0; $nexttime += $interval; } }
On 19/11/2019 12:48, Kyotaro Horiguchi wrote: > 1. Inserting a branch in SearchCatCacheInternal. (CatCache_Pattern_1.patch) > > This is the most straightforward way to add an alternative feature. > > pattern 1 | 8459.73 | 28.15 # 9% (>> 1%) slower than 7757.58 > pattern 1 | 8504.83 | 55.61 > pattern 1 | 8541.81 | 41.56 > pattern 1 | 8552.20 | 27.99 > master | 7757.58 | 22.65 > master | 7801.32 | 20.64 > master | 7839.57 | 25.28 > master | 7925.30 | 38.84 > > It's so slow that it cannot be used. This is very surprising. A branch that's never taken ought to be predicted by the CPU's branch-predictor, and be very cheap. Do we actually need a branch there? If I understand correctly, the point is to bump up a usage counter on the catcache entry. You could increment the counter unconditionally, even if the feature is not used, and avoid the branch that way. Another thought is to bump up the usage counter in ReleaseCatCache(), and only when the refcount reaches zero. That might be somewhat cheaper, if it's a common pattern to acquire additional leases on an entry that's already referenced. Yet another thought is to replace 'refcount' with an 'acquirecount' and 'releasecount'. In SearchCatCacheInternal(), increment acquirecount, and in ReleaseCatCache, increment releasecount. When they are equal, the entry is not in use. Now you have a counter that gets incremented on every access, with the same number of CPU instructions in the hot paths as we have today. Or maybe there are some other ways we could micro-optimize SearchCatCacheInternal(), to buy back the slowdown that this feature would add? For example, you could remove the "if (cl->dead) continue;" check, if dead entries were kept out of the hash buckets. Or maybe the catctup struct could be made slightly smaller somehow, so that it would fit more comfortably in a single cache line. My point is that I don't think we want to complicate the code much for this. All the indirection stuff seems over-engineered for this. Let's find a way to keep it simple. - Heikki
Thank you for the comment! First off, I thought that I managed to eliminate the degradation observed on the previous versions, but significant degradation (1.1% slower) is still seen in on case. Anyway, before sending the new patch, let met just answer for the comments. At Thu, 5 Nov 2020 11:09:09 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in > On 19/11/2019 12:48, Kyotaro Horiguchi wrote: > > 1. Inserting a branch in > > SearchCatCacheInternal. (CatCache_Pattern_1.patch) > > This is the most straightforward way to add an alternative feature. > > pattern 1 | 8459.73 | 28.15 # 9% (>> 1%) slower than 7757.58 > > pattern 1 | 8504.83 | 55.61 > > pattern 1 | 8541.81 | 41.56 > > pattern 1 | 8552.20 | 27.99 > > master | 7757.58 | 22.65 > > master | 7801.32 | 20.64 > > master | 7839.57 | 25.28 > > master | 7925.30 | 38.84 > > It's so slow that it cannot be used. > > This is very surprising. A branch that's never taken ought to be > predicted by the CPU's branch-predictor, and be very cheap. (A) original test patch I naively thought that the code path is too short to bury the degradation of additional a few instructions. Actually I measured performance again with the same patch set on the current master and had the more or less the same result. master 8195.58ms, patched 8817.40 ms: +10.75% However, I noticed that the additional call was a recursive call and a jmp inserted for the recursive call seems taking significant time. After avoiding the recursive call, the difference reduced to +0.96% (master 8268.71ms : patched 8348.30ms) Just two instructions below are inserted in this case, which looks reasonable. 8720ff <+31>: cmpl $0xffffffff,0x4ba942(%rip) # 0xd2ca48 <catalog_cache_prune_min_age> 872106 <+38>: jl 0x872240 <SearchCatCache1+352> (call to a function) (C) inserting bare counter-update code without a branch > Do we actually need a branch there? If I understand correctly, the > point is to bump up a usage counter on the catcache entry. You could > increment the counter unconditionally, even if the feature is not > used, and avoid the branch that way. That change causes 4.9% degradation, which is worse than having a branch. master 8364.54ms, patched 8666.86ms (+4.9%) The additional instructions follow. + 8721ab <+203>: mov 0x30(%rbx),%eax # %eax = ct->naccess + 8721ae <+206>: mov $0x2,%edx + 8721b3 <+211>: add $0x1,%eax # %eax++ + 8721b6 <+214>: cmove %edx,%eax # if %eax == 0 then %eax = 2 <original code> + 8721bf <+223>: mov %eax,0x30(%rbx) # ct->naccess = %eax + 8721c2 <+226>: mov 0x4cfe9f(%rip),%rax # 0xd42068 <catcacheclock> + 8721c9 <+233>: mov %rax,0x38(%rbx) # ct->lastaccess = %rax (D) naively branching then updateing, again. Come to think of this, I measured the same with a branch again, specifically: (It showed siginificant degradation before, in my memory.) dlsit_move_head(bucket, &ct->cache_elem); + if (catalog_cache_prune_min_age < -1) # never be true + { + (counter update) + } And I had effectively the same numbers from both master and patched. master 8066.93ms, patched 8052.37ms (-0.18%) The above branching inserts the same two instructions with (B) into different place but the result differs, for a reason uncertain to me. + 8721bb <+203>: cmpl $0xffffffff,0x4bb886(%rip) # <catalog_cache_prune_min_age> + 8721c2 <+210>: jl 0x872208 <SearchCatCache1+280> I'm not sure why but the patched beats the master by a small difference. Anyway ths new result shows that compiler might have got smarter than before? (E) bumping up in ReleaseCatCache() (won't work) > Another thought is to bump up the usage counter in ReleaseCatCache(), > and only when the refcount reaches zero. That might be somewhat > cheaper, if it's a common pattern to acquire additional leases on an > entry that's already referenced. > > Yet another thought is to replace 'refcount' with an 'acquirecount' > and 'releasecount'. In SearchCatCacheInternal(), increment > acquirecount, and in ReleaseCatCache, increment releasecount. When > they are equal, the entry is not in use. Now you have a counter that > gets incremented on every access, with the same number of CPU > instructions in the hot paths as we have today. These don't work for negative caches, since the corresponding tuples are never released. (F) removing less-significant code. > Or maybe there are some other ways we could micro-optimize > SearchCatCacheInternal(), to buy back the slowdown that this feature Yeah, I thought of that in the beginning. (I removed dlist_move_head() at the time.) But the most difficult aspect of this approach is that I cannot tell whether the modification never cause degradation or not. > would add? For example, you could remove the "if (cl->dead) continue;" > check, if dead entries were kept out of the hash buckets. Or maybe the > catctup struct could be made slightly smaller somehow, so that it > would fit more comfortably in a single cache line. As a trial, I removed that code and added the ct->naccess code. master 8187.44ms, patched 8266.74ms (+1.0%) So the removal decreased the degradation by about 3.9% of the total time. > My point is that I don't think we want to complicate the code much for > this. All the indirection stuff seems over-engineered for this. Let's > find a way to keep it simple. Yes, agreed from the bottom of my heart. I aspire to find a simple way to avoid degradation. regars. -- Kyotaro Horiguchi NTT Open Source Software Center
me> First off, I thought that I managed to eliminate the degradation me> observed on the previous versions, but significant degradation (1.1% me> slower) is still seen in on case. While trying benchmarking with many patterns, I noticed that it slows down catcache search significantly to call CatCacheCleanupOldEntries() even if the function does almost nothing. Oddly enough the degradation gets larger if I removed the counter-updating code from SearchCatCacheInternal. It seems that RehashCatCache is called far frequently than I thought and CatCacheCleanupOldEntries was suffering the branch penalty. The degradation vanished by a likely() attached to the condition. On the contrary patched version is constantly slightly faster than master. For now, I measured the patch with three access patterns as the catcachebench was designed. master patched-off patched-on(300s) test 1 3898.18ms 3896.11ms (-0.1%) 3889.44ms (- 0.2%) test 2 8013.37ms 8098.51ms (+1.1%) 8640.63ms (+ 7.8%) test 3 6146.95ms 6147.91ms (+0.0%) 15466 ms (+152 %) master : This patch is not applied. patched-off: This patch is applied and catalog_cache_prune_min_age = -1 patched-on : This patch is applied and catalog_cache_prune_min_age = 0 test 1: Creates many negative entries in STATRELATTINH (expiration doesn't happen) test 2: Repeat fetch several negative entries for many times. test 3: test 1 with expiration happens. The result looks far better, but the test 2 still shows a small degradation... I'll continue investigating it.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 9516267f0e2943cf955cbbfe5133c13c36288ee6 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Fri, 6 Nov 2020 17:27:18 +0900 Subject: [PATCH v4] CatCache expiration feature --- src/backend/access/transam/xact.c | 3 + src/backend/utils/cache/catcache.c | 125 +++++++++++++++++++++++++++++ src/backend/utils/misc/guc.c | 12 +++ src/include/utils/catcache.h | 20 +++++ 4 files changed, 160 insertions(+) diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index af6afcebb1..a246fcc4c0 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -1086,6 +1086,9 @@ static void AtStart_Cache(void) { AcceptInvalidationMessages(); + + if (xactStartTimestamp != 0) + SetCatCacheClock(xactStartTimestamp); } /* diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 3613ae5f44..f63224bfd5 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -38,6 +38,7 @@ #include "utils/rel.h" #include "utils/resowner_private.h" #include "utils/syscache.h" +#include "utils/timestamp.h" /* #define CACHEDEBUG */ /* turns DEBUG elogs on */ @@ -60,9 +61,18 @@ #define CACHE_elog(...) #endif +/* + * GUC variable to define the minimum age of entries that will be considered + * to be evicted in seconds. -1 to disable the feature. + */ +int catalog_cache_prune_min_age = -1; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Clock for the last accessed time of a catcache entry. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -74,6 +84,7 @@ static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache, Index hashIndex, Datum v1, Datum v2, Datum v3, Datum v4); +static bool CatCacheCleanupOldEntries(CatCache *cp); static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys, Datum v1, Datum v2, Datum v3, Datum v4); @@ -99,6 +110,12 @@ static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos, static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *srckeys, Datum *dstkeys); +/* GUC assign function */ +void +assign_catalog_cache_prune_min_age(int newval, void *extra) +{ + catalog_cache_prune_min_age = newval; +} /* * internal support functions @@ -863,6 +880,10 @@ RehashCatCache(CatCache *cp) int newnbuckets; int i; + /* try removing old entries before expanding hash */ + if (CatCacheCleanupOldEntries(cp)) + return; + elog(DEBUG1, "rehashing catalog cache id %d for %s; %d tups, %d buckets", cp->id, cp->cc_relname, cp->cc_ntup, cp->cc_nbuckets); @@ -1264,6 +1285,20 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* + * Prolong life of this entry. Since we want run as less instructions + * as possible and want the branch be stable for performance reasons, + * we don't give a strict cap on the counter. All numbers above 1 will + * be regarded as 2 in CatCacheCleanupOldEntries(). + */ + if (unlikely(catalog_cache_prune_min_age >= 0)) + { + ct->naccess++; + if (unlikely(ct->naccess == 0)) + ct->naccess = 2; + ct->lastaccess = catcacheclock; + } + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1425,6 +1460,94 @@ SearchCatCacheMiss(CatCache *cache, return &ct->tuple; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries happen to be left unused for a long time for several + * reasons. Remove such entries to prevent catcache from bloating. It is based + * on the similar algorithm with buffer eviction. Entries that are accessed + * several times in a certain period live longer than those that have had less + * access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int nremoved = 0; + int i; + long oldest_ts = catcacheclock; + long age; + int us; + + /* Return immediately if disabled */ + if (likely(catalog_cache_prune_min_age < 0)) + return false; + + /* Don't scan the hash when we know we don't have prunable entries */ + TimestampDifference(cp->cc_oldest_ts, catcacheclock, &age, &us); + if (age < catalog_cache_prune_min_age) + return false; + + /* Scan over the whole hash to find entries to remove */ + for (i = 0 ; i < cp->cc_nbuckets ; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + + /* Don't remove referenced entries */ + if (ct->refcount == 0 && + (ct->c_list == NULL || ct->c_list->refcount == 0)) + { + /* + * Calculate the duration from the time from the last access + * to the "current" time. catcacheclock is updated + * per-statement basis and additionaly udpated periodically + * during a long running query. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &age, &us); + + if (age > catalog_cache_prune_min_age) + { + /* + * Entries that are not accessed after the last pruning + * are removed in that seconds, and their lives are + * prolonged according to how many times they are accessed + * up to three times of the duration. We don't try shrink + * buckets since pruning effectively caps catcache + * expansion in the long term. + */ + if (ct->naccess > 2) + ct->naccess = 1; + else if (ct->naccess > 0) + ct->naccess--; + else + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + + /* don't update oldest_ts by removed entry */ + continue; + } + } + } + + /* update oldest timestamp if the entry remains alive */ + if (ct->lastaccess < oldest_ts) + oldest_ts = ct->lastaccess; + } + } + + cp->cc_oldest_ts = oldest_ts; + + if (nremoved > 0) + elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d", + cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved); + + return nremoved > 0; +} + /* * ReleaseCatCache * @@ -1888,6 +2011,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 0; + ct->lastaccess = catcacheclock; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index a62d64eaa4..ca897cab2e 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -88,6 +88,7 @@ #include "utils/acl.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/float.h" #include "utils/guc_tables.h" #include "utils/memutils.h" @@ -3399,6 +3400,17 @@ static struct config_int ConfigureNamesInt[] = check_huge_page_size, NULL, NULL }, + { + {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("System catalog cache entries that are living unused more than this seconds are considered forremoval."), + gettext_noop("The value of -1 turns off pruning."), + GUC_UNIT_S + }, + &catalog_cache_prune_min_age, + -1, -1, INT_MAX, + NULL, assign_catalog_cache_prune_min_age, NULL + }, + /* End-of-list marker */ { {NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index f4aa316604..a11736f767 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -61,6 +62,7 @@ typedef struct catcache slist_node cc_next; /* list link */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ + TimestampTz cc_oldest_ts; /* timestamp of the oldest tuple in the hash */ /* * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS @@ -119,6 +121,8 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ + unsigned int naccess; /* # of access to this entry */ + TimestampTz lastaccess; /* timestamp of the last usage */ /* * The tuple may also be a member of at most one CatCList. (If a single @@ -189,6 +193,22 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; + +/* for guc.c, not PGDLLPMPORT'ed */ +extern int catalog_cache_prune_min_age; + +/* source clock for access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* SetCatCacheClock - set catcache timestamp source clodk */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + +extern void assign_catalog_cache_prune_min_age(int newval, void *extra); + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, -- 2.18.4
On 06/11/2020 10:24, Kyotaro Horiguchi wrote: > Thank you for the comment! > > First off, I thought that I managed to eliminate the degradation > observed on the previous versions, but significant degradation (1.1% > slower) is still seen in on case. One thing to keep in mind with micro-benchmarks like this is that even completely unrelated code changes can change the layout of the code in memory, which in turn can affect CPU caching affects in surprising ways. If you're lucky, you can see 1-5% differences just by adding a function that's never called, for example, if it happens to move other code in memory so that a some hot codepath or struct gets split across CPU cache lines. It can be infuriating when benchmarking. > At Thu, 5 Nov 2020 11:09:09 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in > (A) original test patch > > I naively thought that the code path is too short to bury the > degradation of additional a few instructions. Actually I measured > performance again with the same patch set on the current master and > had the more or less the same result. > > master 8195.58ms, patched 8817.40 ms: +10.75% > > However, I noticed that the additional call was a recursive call and a > jmp inserted for the recursive call seems taking significant > time. After avoiding the recursive call, the difference reduced to > +0.96% (master 8268.71ms : patched 8348.30ms) > > Just two instructions below are inserted in this case, which looks > reasonable. > > 8720ff <+31>: cmpl $0xffffffff,0x4ba942(%rip) # 0xd2ca48 <catalog_cache_prune_min_age> > 872106 <+38>: jl 0x872240 <SearchCatCache1+352> (call to a function) That's interesting. I think a 1% degradation would be acceptable. I think we'd like to enable this feature by default though, so the performance when it's enabled is also very important. > (C) inserting bare counter-update code without a branch > >> Do we actually need a branch there? If I understand correctly, the >> point is to bump up a usage counter on the catcache entry. You could >> increment the counter unconditionally, even if the feature is not >> used, and avoid the branch that way. > > That change causes 4.9% degradation, which is worse than having a > branch. > > master 8364.54ms, patched 8666.86ms (+4.9%) > > The additional instructions follow. > > + 8721ab <+203>: mov 0x30(%rbx),%eax # %eax = ct->naccess > + 8721ae <+206>: mov $0x2,%edx > + 8721b3 <+211>: add $0x1,%eax # %eax++ > + 8721b6 <+214>: cmove %edx,%eax # if %eax == 0 then %eax = 2 > <original code> > + 8721bf <+223>: mov %eax,0x30(%rbx) # ct->naccess = %eax > + 8721c2 <+226>: mov 0x4cfe9f(%rip),%rax # 0xd42068 <catcacheclock> > + 8721c9 <+233>: mov %rax,0x38(%rbx) # ct->lastaccess = %rax Do you need the "ntaccess == 2" test? You could always increment the counter, and in the code that uses ntaccess to decide what to evict, treat all values >= 2 the same. Need to handle integer overflow somehow. Or maybe not: integer overflow is so infrequent that even if a hot syscache entry gets evicted prematurely because its ntaccess count wrapped around to 0, it will happen so rarely that it won't make any difference in practice. - Heikki
At Fri, 6 Nov 2020 10:42:15 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in > On 06/11/2020 10:24, Kyotaro Horiguchi wrote: > > Thank you for the comment! > > First off, I thought that I managed to eliminate the degradation > > observed on the previous versions, but significant degradation (1.1% > > slower) is still seen in on case. > > One thing to keep in mind with micro-benchmarks like this is that even > completely unrelated code changes can change the layout of the code in > memory, which in turn can affect CPU caching affects in surprising > ways. If you're lucky, you can see 1-5% differences just by adding a > function that's never called, for example, if it happens to move other > code in memory so that a some hot codepath or struct gets split across > CPU cache lines. It can be infuriating when benchmarking. True. I sometimes had to make distclean to stabilize such benchmarks.. > > At Thu, 5 Nov 2020 11:09:09 +0200, Heikki Linnakangas > > <hlinnaka@iki.fi> wrote in > > (A) original test patch > > I naively thought that the code path is too short to bury the > > degradation of additional a few instructions. Actually I measured > > performance again with the same patch set on the current master and > > had the more or less the same result. > > master 8195.58ms, patched 8817.40 ms: +10.75% > > However, I noticed that the additional call was a recursive call and a > > jmp inserted for the recursive call seems taking significant > > time. After avoiding the recursive call, the difference reduced to > > +0.96% (master 8268.71ms : patched 8348.30ms) > > Just two instructions below are inserted in this case, which looks > > reasonable. > > 8720ff <+31>: cmpl $0xffffffff,0x4ba942(%rip) # 0xd2ca48 > > <catalog_cache_prune_min_age> > > 872106 <+38>: jl 0x872240 <SearchCatCache1+352> (call to a function) > > That's interesting. I think a 1% degradation would be acceptable. > > I think we'd like to enable this feature by default though, so the > performance when it's enabled is also very important. > > > (C) inserting bare counter-update code without a branch > > > >> Do we actually need a branch there? If I understand correctly, the > >> point is to bump up a usage counter on the catcache entry. You could > >> increment the counter unconditionally, even if the feature is not > >> used, and avoid the branch that way. > > That change causes 4.9% degradation, which is worse than having a > > branch. > > master 8364.54ms, patched 8666.86ms (+4.9%) > > The additional instructions follow. > > + 8721ab <+203>: mov 0x30(%rbx),%eax # %eax = ct->naccess > > + 8721ae <+206>: mov $0x2,%edx > > + 8721b3 <+211>: add $0x1,%eax # %eax++ > > + 8721b6 <+214>: cmove %edx,%eax # if %eax == 0 then %eax = 2 > > <original code> > > + 8721bf <+223>: mov %eax,0x30(%rbx) # ct->naccess = %eax > > + 8721c2 <+226>: mov 0x4cfe9f(%rip),%rax # 0xd42068 <catcacheclock> > > + 8721c9 <+233>: mov %rax,0x38(%rbx) # ct->lastaccess = %rax > > Do you need the "ntaccess == 2" test? You could always increment the > counter, and in the code that uses ntaccess to decide what to evict, > treat all values >= 2 the same. > > Need to handle integer overflow somehow. Or maybe not: integer > overflow is so infrequent that even if a hot syscache entry gets > evicted prematurely because its ntaccess count wrapped around to 0, it > will happen so rarely that it won't make any difference in practice. Agreed. Ok, I have prioritized completely avoiding degradation on the normal path, but laxing that restriction to 1% or so makes the code far simpler and make the expiration path signifinicantly faster. Now the branch for counter-increment is removed. For similar branches for counter-decrement side in CatCacheCleanupOldEntries(), Min() is compiled into cmovbe and a branch was removed.
At Mon, 09 Nov 2020 11:13:31 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > Now the branch for counter-increment is removed. For similar > branches for counter-decrement side in CatCacheCleanupOldEntries(), > Min() is compiled into cmovbe and a branch was removed. Mmm. Sorry, I sent this by a mistake. Please ignore it. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Fri, 6 Nov 2020 10:42:15 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in > Do you need the "ntaccess == 2" test? You could always increment the > counter, and in the code that uses ntaccess to decide what to evict, > treat all values >= 2 the same. > > Need to handle integer overflow somehow. Or maybe not: integer > overflow is so infrequent that even if a hot syscache entry gets > evicted prematurely because its ntaccess count wrapped around to 0, it > will happen so rarely that it won't make any difference in practice. That relaxing simplifies the code significantly, but a significant degradation by about 5% still exists. (SearchCatCacheInternal()) + ct->naccess++; !+ ct->lastaccess = catcacheclock; If I removed the second line above, the degradation disappears (-0.7%). However, I don't find the corresponding numbers in the output of perf. The sum of the numbers for the removed instructions is (0.02 + 0.28 = 0.3%). I don't think the degradation as the whole doesn't always reflect to the instruction level profiling, but I'm stuck here, anyway. % samples master p2 patched (p2 = patched - "ct->lastaccess = catcacheclock) ============================================================================= 0.47 | 0.27 | 0.17 | mov %rbx,0x8(%rbp) | | | SearchCatCacheInternal(): | | | ct->naccess++; | | | ct->lastaccess = catcacheclock; ----- |----- | 0.02 |10f: mov catcacheclock,%rax | | | ct->naccess++; ----- | 0.96 | 1.00 | addl $0x1,0x14(%rbx) | | | return NULL; ----- | 0.11 | 0.16 | xor %ebp,%ebp | | | if (!ct->negative) 0.27 | 0.30 | 0.03 | cmpb $0x0,0x21(%rbx) | | | ct->lastaccess = catcacheclock; ----- | ---- | 0.28 | mov %rax,0x18(%rbx) | | | if (!ct->negative) 0.34 | 0.08 | 0.59 | ↓ jne 149 For your information, the same table for a bit wider range follows. % samples master p2 patched (p2 = patched - "ct->lastaccess = catcacheclock) ============================================================================= | | | dlist_foreach(iter, bucket) 6.91 | 7.06 | 5.89 | mov 0x8(%rbp),%rbx 0.78 | 0.73 | 0.81 | test %rbx,%rbx | | | ↓ je 160 | | | cmp %rbx,%rbp 0.46 | 0.52 | 0.39 | ↓ jne 9d | | | ↓ jmpq 160 | | | nop 5.68 | 5.54 | 6.03 | 90: mov 0x8(%rbx),%rbx 1.44 | 1.42 | 1.43 | cmp %rbx,%rbp | | | ↓ je 160 | | | { | | | ct = dlist_container(CatCTup, cache_elem, iter.cur); | | | | | | if (ct->dead) 30.36 |30.97 | 31.48 | 9d: cmpb $0x0,0x20(%rbx) 2.63 | 2.60 | 2.69 | ↑ jne 90 | | | continue; /* ignore dead entries */ | | | | | | if (ct->hash_value != hashValue) 1.41 | 1.37 | 1.35 | cmp -0x24(%rbx),%edx 3.19 | 2.97 | 2.87 | ↑ jne 90 7.17 | 5.53 | 6.89 | mov %r13,%rsi 0.02 | 0.04 | 0.04 | xor %r12d,%r12d 3.00 | 2.98 | 2.95 | ↓ jmp b5 0.15 | 0.61 | 0.20 | b0: mov 0x10(%rsp,%r12,1),%rsi 6.58 | 5.04 | 5.95 | b5: mov %ecx,0xc(%rsp) | | | CatalogCacheCompareTuple(): | | | if (!(cc_fastequal[i]) (cachekeys[i], searchkeys[i])) 1.51 | 0.92 | 1.66 | mov -0x20(%rbx,%r12,1),%rdi 0.54 | 1.64 | 0.58 | mov %edx,0x8(%rsp) 3.78 | 3.11 | 3.86 | → callq *0x38(%r14,%r12,1) 0.43 | 2.30 | 0.34 | mov 0x8(%rsp),%edx 0.20 | 0.94 | 0.25 | mov 0xc(%rsp),%ecx 0.44 | 0.41 | 0.44 | test %al,%al | | | ↑ je 90 | | | for (i = 0; i < nkeys; i++) 2.28 | 1.07 | 2.26 | add $0x8,%r12 0.08 | 0.23 | 0.07 | cmp $0x18,%r12 0.11 | 0.64 | 0.10 | ↑ jne b0 | | | dlist_move_head(): | | | */ | | | static inline void | | | dlist_move_head(dlist_head *head, dlist_node *node) | | | { | | | /* fast path if it's already at the head */ | | | if (head->head.next == node) 0.08 | 0.61 | 0.04 | cmp 0x8(%rbp),%rbx 0.02 | 0.10 | 0.00 | ↓ je 10f | | | return; | | | | | | dlist_delete(node); 0.01 | 0.20 | 0.06 | mov 0x8(%rbx),%rax | | | dlist_delete(): | | | node->prev->next = node->next; 0.75 | 0.13 | 0.72 | mov (%rbx),%rdx 2.89 | 3.42 | 2.22 | mov %rax,0x8(%rdx) | | | node->next->prev = node->prev; 0.01 | 0.09 | 0.00 | mov (%rbx),%rdx 0.04 | 0.62 | 0.58 | mov %rdx,(%rax) | | | dlist_push_head(): | | | if (head->head.next == NULL) /* convert NULL header to circular */ 0.31 | 0.08 | 0.28 | mov 0x8(%rbp),%rax 0.55 | 0.44 | 0.28 | test %rax,%rax | | | ↓ je 180 | | | node->next = head->head.next; 0.00 | 0.08 | 0.06 |101: mov %rax,0x8(%rbx) | | | node->prev = &head->head; 0.17 | 0.73 | 0.37 | mov %rbp,(%rbx) | | | node->next->prev = node; 0.34 | 0.08 | 1.13 | mov %rbx,(%rax) | | | head->head.next = node; 0.47 | 0.27 | 0.17 | mov %rbx,0x8(%rbp) | | | SearchCatCacheInternal(): | | | ct->naccess++; | | | ct->lastaccess = catcacheclock; ----- |----- | 0.02 |10f: mov catcacheclock,%rax | | | ct->naccess++; ----- | 0.96 | 1.00 | addl $0x1,0x14(%rbx) | | | return NULL; ----- | 0.11 | 0.16 | xor %ebp,%ebp | | | if (!ct->negative) 0.27 | 0.30 | 0.03 | cmpb $0x0,0x21(%rbx) | | | ct->lastaccess = catcacheclock; ----- | ---- | 0.28 | mov %rax,0x18(%rbx) | | | if (!ct->negative) 0.34 | 0.08 | 0.59 | ↓ jne 149 -- Kyotaro Horiguchi NTT Open Source Software Center From 498a55ff07f19646ca09034dfdc4c68459a74855 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Fri, 6 Nov 2020 17:27:18 +0900 Subject: [PATCH v5] CatCache expiration feature --- src/backend/access/transam/xact.c | 3 + src/backend/utils/cache/catcache.c | 118 +++++++++++++++++++++++++++++ src/backend/utils/misc/guc.c | 12 +++ src/include/utils/catcache.h | 20 +++++ 4 files changed, 153 insertions(+) diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index af6afcebb1..a246fcc4c0 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -1086,6 +1086,9 @@ static void AtStart_Cache(void) { AcceptInvalidationMessages(); + + if (xactStartTimestamp != 0) + SetCatCacheClock(xactStartTimestamp); } /* diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 3613ae5f44..b457fed7ab 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -38,6 +38,7 @@ #include "utils/rel.h" #include "utils/resowner_private.h" #include "utils/syscache.h" +#include "utils/timestamp.h" /* #define CACHEDEBUG */ /* turns DEBUG elogs on */ @@ -60,9 +61,18 @@ #define CACHE_elog(...) #endif +/* + * GUC variable to define the minimum age of entries that will be considered + * to be evicted in seconds. -1 to disable the feature. + */ +int catalog_cache_prune_min_age = -1; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Clock for the last accessed time of a catcache entry. */ +TimestampTz catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -74,6 +84,7 @@ static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache, Index hashIndex, Datum v1, Datum v2, Datum v3, Datum v4); +static bool CatCacheCleanupOldEntries(CatCache *cp); static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys, Datum v1, Datum v2, Datum v3, Datum v4); @@ -99,6 +110,12 @@ static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos, static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *srckeys, Datum *dstkeys); +/* GUC assign function */ +void +assign_catalog_cache_prune_min_age(int newval, void *extra) +{ + catalog_cache_prune_min_age = newval; +} /* * internal support functions @@ -863,6 +880,10 @@ RehashCatCache(CatCache *cp) int newnbuckets; int i; + /* try removing old entries before expanding hash */ + if (CatCacheCleanupOldEntries(cp)) + return; + elog(DEBUG1, "rehashing catalog cache id %d for %s; %d tups, %d buckets", cp->id, cp->cc_relname, cp->cc_ntup, cp->cc_nbuckets); @@ -1264,6 +1285,16 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* + * Prolong life of this entry. Since we want run as less instructions + * as possible and want the branch be stable for performance reasons, + * we don't care of wrap-around and possible false-negative for old + * entries. The window is quite narrow and the counter doesn't gets so + * large while expiration is active. + */ + ct->naccess++; + ct->lastaccess = catcacheclock; + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1425,6 +1456,91 @@ SearchCatCacheMiss(CatCache *cache, return &ct->tuple; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries happen to be left unused for a long time for several + * reasons. Remove such entries to prevent catcache from bloating. It is based + * on the similar algorithm with buffer eviction. Entries that are accessed + * several times in a certain period live longer than those that have had less + * access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int nremoved = 0; + int i; + long oldest_ts = catcacheclock; + long age; + int us; + + /* Return immediately if disabled */ + if (likely(catalog_cache_prune_min_age < 0)) + return false; + + /* Don't scan the hash when we know we don't have prunable entries */ + TimestampDifference(cp->cc_oldest_ts, catcacheclock, &age, &us); + if (age < catalog_cache_prune_min_age) + return false; + + /* Scan over the whole hash to find entries to remove */ + for (i = 0 ; i < cp->cc_nbuckets ; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + + /* Don't remove referenced entries */ + if (ct->refcount == 0 && + (ct->c_list == NULL || ct->c_list->refcount == 0)) + { + /* + * Calculate the duration from the time from the last access + * to the "current" time. catcacheclock is updated + * per-statement basis and additionaly udpated periodically + * during a long running query. + */ + TimestampDifference(ct->lastaccess, catcacheclock, &age, &us); + + if (age > catalog_cache_prune_min_age) + { + /* + * Entries that are not accessed after the last pruning + * are removed in that seconds, and their lives are + * prolonged according to how many times they are accessed + * up to three times of the duration. We don't try shrink + * buckets since pruning effectively caps catcache + * expansion in the long term. + */ + ct->naccess = Min(2, ct->naccess); + if (--ct->naccess == 0) + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + + /* don't update oldest_ts by removed entry */ + continue; + } + } + } + + /* update oldest timestamp if the entry remains alive */ + if (ct->lastaccess < oldest_ts) + oldest_ts = ct->lastaccess; + } + } + + cp->cc_oldest_ts = oldest_ts; + + if (nremoved > 0) + elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d", + cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved); + + return nremoved > 0; +} + /* * ReleaseCatCache * @@ -1888,6 +2004,8 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->naccess = 1; + ct->lastaccess = catcacheclock; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index bb34630e8e..95213853aa 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -88,6 +88,7 @@ #include "utils/acl.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/float.h" #include "utils/guc_tables.h" #include "utils/memutils.h" @@ -3399,6 +3400,17 @@ static struct config_int ConfigureNamesInt[] = check_huge_page_size, NULL, NULL }, + { + {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("System catalog cache entries that are living unused more than this seconds are considered forremoval."), + gettext_noop("The value of -1 turns off pruning."), + GUC_UNIT_S + }, + &catalog_cache_prune_min_age, + -1, -1, INT_MAX, + NULL, assign_catalog_cache_prune_min_age, NULL + }, + /* End-of-list marker */ { {NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index f4aa316604..a11736f767 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -61,6 +62,7 @@ typedef struct catcache slist_node cc_next; /* list link */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ + TimestampTz cc_oldest_ts; /* timestamp of the oldest tuple in the hash */ /* * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS @@ -119,6 +121,8 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ + unsigned int naccess; /* # of access to this entry */ + TimestampTz lastaccess; /* timestamp of the last usage */ /* * The tuple may also be a member of at most one CatCList. (If a single @@ -189,6 +193,22 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; + +/* for guc.c, not PGDLLPMPORT'ed */ +extern int catalog_cache_prune_min_age; + +/* source clock for access timestamp of catcache entries */ +extern TimestampTz catcacheclock; + +/* SetCatCacheClock - set catcache timestamp source clodk */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = ts; +} + +extern void assign_catalog_cache_prune_min_age(int newval, void *extra); + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, -- 2.18.4
On 09/11/2020 11:34, Kyotaro Horiguchi wrote: > At Fri, 6 Nov 2020 10:42:15 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in >> Do you need the "ntaccess == 2" test? You could always increment the >> counter, and in the code that uses ntaccess to decide what to evict, >> treat all values >= 2 the same. >> >> Need to handle integer overflow somehow. Or maybe not: integer >> overflow is so infrequent that even if a hot syscache entry gets >> evicted prematurely because its ntaccess count wrapped around to 0, it >> will happen so rarely that it won't make any difference in practice. > > That relaxing simplifies the code significantly, but a significant > degradation by about 5% still exists. > > (SearchCatCacheInternal()) > + ct->naccess++; > !+ ct->lastaccess = catcacheclock; > > If I removed the second line above, the degradation disappears > (-0.7%). 0.7% degradation is probably acceptable. > However, I don't find the corresponding numbers in the output > of perf. The sum of the numbers for the removed instructions is (0.02 > + 0.28 = 0.3%). I don't think the degradation as the whole doesn't > always reflect to the instruction level profiling, but I'm stuck here, > anyway. Hmm. Some kind of cache miss effect, perhaps? offsetof(CatCTup, tuple) is exactly 64 bytes currently, so any fields that you add after 'tuple' will go on a different cache line. Maybe it would help if you just move the new fields before 'tuple'. Making CatCTup smaller might help. Some ideas/observations: - The 'ct_magic' field is only used for assertion checks. Could remove it. - 4 Datums (32 bytes) are allocated for the keys, even though most catcaches have fewer key columns. - In the current syscaches, keys[2] and keys[3] are only used to store 32-bit oids or some other smaller fields. Allocating a full 64-bit Datum for them wastes memory. - You could move the dead flag at the end of the struct or remove it altogether, with the change I mentioned earlier to not keep dead items in the buckets - You could steal a few bit for dead/negative flags from some other field. Use special values for tuple.t_len for them or something. With some of these tricks, you could shrink CatCTup so that the new lastaccess and naccess fields would fit in the same cacheline. That said, I think this is good enough performance-wise as it is. So if we want to improve performance in general, that can be a separate patch. - Heikki
On Tue, Nov 17, 2020 at 10:46 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > 0.7% degradation is probably acceptable. I haven't looked at this patch in a while and I'm pleased with the way it seems to have been redesigned. It seems relatively simple and unlikely to cause big headaches. I would say that 0.7% is probably not acceptable on a general workload, but it seems fine on a benchmark that is specifically designed to be a worst-case for this patch, which I gather is what's happening here. I think it would be nice if we could enable this feature by default. Does it cause a measurable regression on realistic workloads when enabled? I bet a default of 5 or 10 minutes would help many users. One idea for improving things might be to move the "return immediately" tests in CatCacheCleanupOldEntries() to the caller, and only call this function if they indicate that there is some purpose. This would avoid the function call overhead when nothing can be done. Perhaps the two tests could be combined into one and simplified. Like, suppose the code looks (roughly) like this: if (catcacheclock >= time_at_which_we_can_prune) CatCacheCleanupOldEntries(...); To make it that simple, we want catcacheclock and time_at_which_we_can_prune to be stored as bare uint64 quantities so we don't need TimestampDifference(). And we want time_at_which_we_can_prune to be set to PG_UINT64_MAX when the feature is disabled. But those both seem like pretty achievable things... and it seems like the result would probably be faster than what you have now. + * per-statement basis and additionaly udpated periodically two words spelled wrong +void +assign_catalog_cache_prune_min_age(int newval, void *extra) +{ + catalog_cache_prune_min_age = newval; +} hmm, do we need this? + /* + * Entries that are not accessed after the last pruning + * are removed in that seconds, and their lives are + * prolonged according to how many times they are accessed + * up to three times of the duration. We don't try shrink + * buckets since pruning effectively caps catcache + * expansion in the long term. + */ + ct->naccess = Min(2, ct->naccess); The code doesn't match the comment, it seems, because the limit here is 2, not 3. I wonder if this does anything anyway. My intuition is that when a catcache entry gets accessed at all it's probably likely to get accessed a bunch of times. If there are any meaningful thresholds here I'd expect us to be trying to distinguish things like 1000+ accesses vs. 100-1000 vs. 10-100 vs. 1-10. Or maybe we don't need to distinguish at all and can just have a single mark bit rather than a counter. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
At Tue, 17 Nov 2020 17:46:25 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in > On 09/11/2020 11:34, Kyotaro Horiguchi wrote: > > At Fri, 6 Nov 2020 10:42:15 +0200, Heikki Linnakangas <hlinnaka@iki.fi> > > wrote in > >> Do you need the "ntaccess == 2" test? You could always increment the > >> counter, and in the code that uses ntaccess to decide what to evict, > >> treat all values >= 2 the same. > >> > >> Need to handle integer overflow somehow. Or maybe not: integer > >> overflow is so infrequent that even if a hot syscache entry gets > >> evicted prematurely because its ntaccess count wrapped around to 0, it > >> will happen so rarely that it won't make any difference in practice. > > That relaxing simplifies the code significantly, but a significant > > degradation by about 5% still exists. > > (SearchCatCacheInternal()) > > + ct->naccess++; > > !+ ct->lastaccess = catcacheclock; > > If I removed the second line above, the degradation disappears > > (-0.7%). > > 0.7% degradation is probably acceptable. Sorry for the confusion "-0.7% degradation" meant "+0.7% gain". > > However, I don't find the corresponding numbers in the output > > of perf. The sum of the numbers for the removed instructions is (0.02 > > + 0.28 = 0.3%). I don't think the degradation as the whole doesn't > > always reflect to the instruction level profiling, but I'm stuck here, > > anyway. > > Hmm. Some kind of cache miss effect, perhaps? offsetof(CatCTup, tuple) is Shouldn't it be seen in the perf result? > exactly 64 bytes currently, so any fields that you add after 'tuple' will go > on a different cache line. Maybe it would help if you just move the new fields > before 'tuple'. > > Making CatCTup smaller might help. Some ideas/observations: > > - The 'ct_magic' field is only used for assertion checks. Could remove it. Ok, removed. > - 4 Datums (32 bytes) are allocated for the keys, even though most catcaches > - have fewer key columns. > - In the current syscaches, keys[2] and keys[3] are only used to store 32-bit > - oids or some other smaller fields. Allocating a full 64-bit Datum for them > - wastes memory. It seems to be the last resort. > - You could move the dead flag at the end of the struct or remove it > - altogether, with the change I mentioned earlier to not keep dead items in > - the buckets This seems most promising so I did this. One annoyance is we need to know whether a catcache tuple is invalidated or not to judge whether to remove it. I used CatCtop.cache_elem.prev to signal the same in the next version. > - You could steal a few bit for dead/negative flags from some other field. Use > - special values for tuple.t_len for them or something. I stealed the MSB of refcount as negative, but the bit masking operations seems making the function slower. The benchmark-2 gets slower by around +2% as the total. > With some of these tricks, you could shrink CatCTup so that the new lastaccess > and naccess fields would fit in the same cacheline. > > That said, I think this is good enough performance-wise as it is. So if we > want to improve performance in general, that can be a separate patch. Removing CatCTup.dead increased the performance of catcache search significantly, but catcache entry creation gets slower for uncertain rasons.. (Continues to a reply to Robert's comment) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Thank you for the comments. At Tue, 17 Nov 2020 16:22:54 -0500, Robert Haas <robertmhaas@gmail.com> wrote in > On Tue, Nov 17, 2020 at 10:46 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > 0.7% degradation is probably acceptable. > > I haven't looked at this patch in a while and I'm pleased with the way > it seems to have been redesigned. It seems relatively simple and > unlikely to cause big headaches. I would say that 0.7% is probably not > acceptable on a general workload, but it seems fine on a benchmark Sorry for the confusing notation, "-0.7% degradation" meant +0.7% *gain*, which I thinks is error. However, the next patch makes catcache apparently *faster* so the difference doesn't matter.. > that is specifically designed to be a worst-case for this patch, which > I gather is what's happening here. I think it would be nice if we > could enable this feature by default. Does it cause a measurable > regression on realistic workloads when enabled? I bet a default of 5 > or 10 minutes would help many users. > > One idea for improving things might be to move the "return > immediately" tests in CatCacheCleanupOldEntries() to the caller, and > only call this function if they indicate that there is some purpose. > This would avoid the function call overhead when nothing can be done. > Perhaps the two tests could be combined into one and simplified. Like, > suppose the code looks (roughly) like this: > > if (catcacheclock >= time_at_which_we_can_prune) > CatCacheCleanupOldEntries(...); Compiler removes the call (or inlines the function) but of course we can write that way and it shows the condition for calling the function better. The codelet above forgetting consideration on the result of CatCacheCleanupOldEntries() itself. The function returns false when all "old" entries have been invalidated or explicitly removed and we need to expand the hash in that case. > To make it that simple, we want catcacheclock and > time_at_which_we_can_prune to be stored as bare uint64 quantities so > we don't need TimestampDifference(). And we want > time_at_which_we_can_prune to be set to PG_UINT64_MAX when the feature > is disabled. But those both seem like pretty achievable things... and > it seems like the result would probably be faster than what you have > now. The time_at_which_we_can_prune is not global but catcache-local and needs to change at the time catalog_cache_prune_min_age is changed. So the next version does as the follwoing: - if (CatCacheCleanupOldEntries(cp)) + if (catcacheclock - cp->cc_oldest_ts > prune_min_age_us && + CatCacheCleanupOldEntries(cp)) On the other hand CatCacheCleanupOldEntries can calcualte the time_at_which_we_can_prune once at the beginning of the function. That makes the condition in the loop simpler. - TimestampDifference(ct->lastaccess, catcacheclock, &age, &us); - - if (age > catalog_cache_prune_min_age) + if (ct->lastaccess < prune_threshold) { > + * per-statement basis and additionaly udpated periodically > > two words spelled wrong Ugg. Fixed. Checked all spellings and found another misspelling. > +void > +assign_catalog_cache_prune_min_age(int newval, void *extra) > +{ > + catalog_cache_prune_min_age = newval; > +} > > hmm, do we need this? *That* is actually useless, but the function is kept and not it maintains the internal-version of the GUC parameter (uint64 prune_min_age). > + /* > + * Entries that are not accessed after the last pruning > + * are removed in that seconds, and their lives are > + * prolonged according to how many times they are accessed > + * up to three times of the duration. We don't try shrink > + * buckets since pruning effectively caps catcache > + * expansion in the long term. > + */ > + ct->naccess = Min(2, ct->naccess); > > The code doesn't match the comment, it seems, because the limit here > is 2, not 3. I wonder if this does anything anyway. My intuition is > that when a catcache entry gets accessed at all it's probably likely > to get accessed a bunch of times. If there are any meaningful > thresholds here I'd expect us to be trying to distinguish things like > 1000+ accesses vs. 100-1000 vs. 10-100 vs. 1-10. Or maybe we don't > need to distinguish at all and can just have a single mark bit rather > than a counter. Agreed. Since I don't see a clear criteria for the threshold of the counter, I removed the naccess and related lines. I did the following changes in the attached. 1. Removed naccess and related lines. 2. Moved the precheck condition out of CatCacheCleanupOldEntries() to RehashCatCache(). 3. Use uint64 direct comparison instead of TimestampDifference(). 4. Removed CatCTup.dead flag. Performance measurement on the attached showed better result about searching but maybe worse for cache entry creation. Each time number is the mean of 10 runs. # Cacache (negative) entry creation : time(ms) (% to master) master : 3965.61 (100.0) patched-off: 4040.93 (101.9) patched-on : 4032.22 (101.7) # Searching negative cache entries master : 8173.46 (100.0) patched-off: 7983.43 ( 97.7) patched-on : 8049.88 ( 98.5) # Creation, searching and expiration master : 6393.23 (100.0) patched-off: 6527.94 (102.1) patched-on : 15880.01 (248.4) That is, catcache searching gets faster by 2-3% but creation gets slower by about 2%. If I moved the condition of 2 further up to CatalogCacheCreateEntry(), that degradation reduced to 0.6%. # Cacache (negative) entry creation master : 3967.45 (100.0) patched-off : 3990.43 (100.6) patched-on : 4108.96 (103.6) # Searching negative cache entries master : 8106.53 (100.0) patched-off : 8036.61 ( 99.1) patched-on : 8058.18 ( 99.4) # Creation, searching and expiration master : 6395.00 (100.0) patched-off : 6416.57 (100.3) patched-on : 15830.91 (247.6) It doesn't get smaller if I reverted the changed lines in CatalogCacheCreateEntry().. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 990514a853ad92b2d929cc026724194831ef8793 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 18 Nov 2020 16:54:31 +0900 Subject: [PATCH v5 1/3] CatCache expiration feature --- src/backend/access/transam/xact.c | 3 ++ src/backend/utils/cache/catcache.c | 87 +++++++++++++++++++++++++++++- src/backend/utils/misc/guc.c | 12 +++++ src/include/utils/catcache.h | 19 +++++++ 4 files changed, 120 insertions(+), 1 deletion(-) diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 03c553e7ea..4a2a90ce0c 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -1086,6 +1086,9 @@ static void AtStart_Cache(void) { AcceptInvalidationMessages(); + + if (xactStartTimestamp != 0) + SetCatCacheClock(xactStartTimestamp); } /* diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 3613ae5f44..1ebcc7dcd3 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -38,6 +38,7 @@ #include "utils/rel.h" #include "utils/resowner_private.h" #include "utils/syscache.h" +#include "utils/timestamp.h" /* #define CACHEDEBUG */ /* turns DEBUG elogs on */ @@ -60,9 +61,19 @@ #define CACHE_elog(...) #endif +/* + * GUC variable to define the minimum age of entries that will be considered + * to be evicted in seconds. -1 to disable the feature. + */ +int catalog_cache_prune_min_age = -1; +uint64 prune_min_age_us; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Clock for the last accessed time of a catcache entry. */ +uint64 catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -74,6 +85,7 @@ static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache, Index hashIndex, Datum v1, Datum v2, Datum v3, Datum v4); +static bool CatCacheCleanupOldEntries(CatCache *cp); static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys, Datum v1, Datum v2, Datum v3, Datum v4); @@ -99,6 +111,15 @@ static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos, static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *srckeys, Datum *dstkeys); +/* GUC assign function */ +void +assign_catalog_cache_prune_min_age(int newval, void *extra) +{ + if (newval < 0) + prune_min_age_us = UINT64_MAX; + else + prune_min_age_us = ((uint64) newval) * USECS_PER_SEC; +} /* * internal support functions @@ -1264,6 +1285,9 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* Record the last access timestamp */ + ct->lastaccess = catcacheclock; + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1425,6 +1449,61 @@ SearchCatCacheMiss(CatCache *cache, return &ct->tuple; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries happen to be left unused for a long time for several + * reasons. Remove such entries to prevent catcache from bloating. It is based + * on the similar algorithm with buffer eviction. Entries that are accessed + * several times in a certain period live longer than those that have had less + * access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int nremoved = 0; + int i; + long oldest_ts = catcacheclock; + uint64 prune_threshold = catcacheclock - prune_min_age_us; + + /* Scan over the whole hash to find entries to remove */ + for (i = 0 ; i < cp->cc_nbuckets ; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + + /* Don't remove referenced entries */ + if (ct->refcount == 0 && + (ct->c_list == NULL || ct->c_list->refcount == 0)) + { + if (ct->lastaccess < prune_threshold) + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + + /* don't let the removed entry update oldest_ts */ + continue; + } + } + + /* update the oldest timestamp if the entry remains alive */ + if (ct->lastaccess < oldest_ts) + oldest_ts = ct->lastaccess; + } + } + + cp->cc_oldest_ts = oldest_ts; + + if (nremoved > 0) + elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d", + cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved); + + return nremoved > 0; +} + /* * ReleaseCatCache * @@ -1888,6 +1967,7 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->lastaccess = catcacheclock; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); @@ -1899,7 +1979,12 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, * arbitrarily, we enlarge when fill factor > 2. */ if (cache->cc_ntup > cache->cc_nbuckets * 2) - RehashCatCache(cache); + { + /* try removing old entries before expanding hash */ + if (catcacheclock - cache->cc_oldest_ts < prune_min_age_us || + !CatCacheCleanupOldEntries(cache)) + RehashCatCache(cache); + } return ct; } diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index bb34630e8e..95213853aa 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -88,6 +88,7 @@ #include "utils/acl.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/float.h" #include "utils/guc_tables.h" #include "utils/memutils.h" @@ -3399,6 +3400,17 @@ static struct config_int ConfigureNamesInt[] = check_huge_page_size, NULL, NULL }, + { + {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("System catalog cache entries that are living unused more than this seconds are considered forremoval."), + gettext_noop("The value of -1 turns off pruning."), + GUC_UNIT_S + }, + &catalog_cache_prune_min_age, + -1, -1, INT_MAX, + NULL, assign_catalog_cache_prune_min_age, NULL + }, + /* End-of-list marker */ { {NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index f4aa316604..81587c3fe6 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -61,6 +62,7 @@ typedef struct catcache slist_node cc_next; /* list link */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ + uint64 cc_oldest_ts; /* timestamp (us) of the oldest tuple */ /* * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS @@ -119,6 +121,7 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ + uint64 lastaccess; /* timestamp in us of the last usage */ /* * The tuple may also be a member of at most one CatCList. (If a single @@ -189,6 +192,22 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; + +/* for guc.c, not PGDLLPMPORT'ed */ +extern int catalog_cache_prune_min_age; + +/* source clock for access timestamp of catcache entries */ +extern uint64 catcacheclock; + +/* SetCatCacheClock - set catcache timestamp source clock */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = (uint64) ts; +} + +extern void assign_catalog_cache_prune_min_age(int newval, void *extra); + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, -- 2.18.4 From 2bc7eb221768ee8484fb65db48fa16f6e2c4b347 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 18 Nov 2020 16:57:05 +0900 Subject: [PATCH v5 2/3] Remove "dead" flag from catcache tuple --- src/backend/utils/cache/catcache.c | 43 +++++++++++++----------------- src/include/utils/catcache.h | 10 ------- 2 files changed, 18 insertions(+), 35 deletions(-) diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 1ebcc7dcd3..3e6c4720dc 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -480,6 +480,13 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) Assert(ct->refcount == 0); Assert(ct->my_cache == cache); + /* delink from linked list if not yet */ + if (ct->cache_elem.prev) + { + dlist_delete(&ct->cache_elem); + ct->cache_elem.prev = NULL; + } + if (ct->c_list) { /* @@ -487,14 +494,10 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) * which will recurse back to me, and the recursive call will do the * work. Set the "dead" flag to make sure it does recurse. */ - ct->dead = true; CatCacheRemoveCList(cache, ct->c_list); return; /* nothing left to do */ } - /* delink from linked list */ - dlist_delete(&ct->cache_elem); - /* * Free keys when we're dealing with a negative entry, normal entries just * point into tuple, allocated together with the CatCTup. @@ -534,7 +537,7 @@ CatCacheRemoveCList(CatCache *cache, CatCList *cl) /* if the member is dead and now has no references, remove it */ if ( #ifndef CATCACHE_FORCE_RELEASE - ct->dead && + ct->cache_elem.prev == NULL && #endif ct->refcount == 0) CatCacheRemoveCTup(cache, ct); @@ -609,7 +612,9 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue) if (ct->refcount > 0 || (ct->c_list && ct->c_list->refcount > 0)) { - ct->dead = true; + dlist_delete(&ct->cache_elem); + ct->cache_elem.prev = NULL; + /* list, if any, was marked dead above */ Assert(ct->c_list == NULL || ct->c_list->dead); } @@ -688,7 +693,8 @@ ResetCatalogCache(CatCache *cache) if (ct->refcount > 0 || (ct->c_list && ct->c_list->refcount > 0)) { - ct->dead = true; + dlist_delete(&ct->cache_elem); + ct->cache_elem.prev = NULL; /* list, if any, was marked dead above */ Assert(ct->c_list == NULL || ct->c_list->dead); } @@ -1268,9 +1274,6 @@ SearchCatCacheInternal(CatCache *cache, { ct = dlist_container(CatCTup, cache_elem, iter.cur); - if (ct->dead) - continue; /* ignore dead entries */ - if (ct->hash_value != hashValue) continue; /* quickly skip entry if wrong hash val */ @@ -1522,7 +1525,6 @@ ReleaseCatCache(HeapTuple tuple) offsetof(CatCTup, tuple)); /* Safety checks to ensure we were handed a cache entry */ - Assert(ct->ct_magic == CT_MAGIC); Assert(ct->refcount > 0); ct->refcount--; @@ -1530,7 +1532,7 @@ ReleaseCatCache(HeapTuple tuple) if ( #ifndef CATCACHE_FORCE_RELEASE - ct->dead && + ct->cache_elem.prev == NULL && #endif ct->refcount == 0 && (ct->c_list == NULL || ct->c_list->refcount == 0)) @@ -1737,8 +1739,8 @@ SearchCatCacheList(CatCache *cache, { ct = dlist_container(CatCTup, cache_elem, iter.cur); - if (ct->dead || ct->negative) - continue; /* ignore dead and negative entries */ + if (ct->negative) + continue; /* ignore negative entries */ if (ct->hash_value != hashValue) continue; /* quickly skip entry if wrong hash val */ @@ -1799,14 +1801,13 @@ SearchCatCacheList(CatCache *cache, { foreach(ctlist_item, ctlist) { + Assert (ct->cache_elem.prev != NULL); + ct = (CatCTup *) lfirst(ctlist_item); Assert(ct->c_list == NULL); Assert(ct->refcount > 0); ct->refcount--; if ( -#ifndef CATCACHE_FORCE_RELEASE - ct->dead && -#endif ct->refcount == 0 && (ct->c_list == NULL || ct->c_list->refcount == 0)) CatCacheRemoveCTup(cache, ct); @@ -1834,9 +1835,6 @@ SearchCatCacheList(CatCache *cache, /* release the temporary refcount on the member */ Assert(ct->refcount > 0); ct->refcount--; - /* mark list dead if any members already dead */ - if (ct->dead) - cl->dead = true; } Assert(i == nmembers); @@ -1960,11 +1958,9 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, * Finish initializing the CatCTup header, and add it to the cache's * linked list and counts. */ - ct->ct_magic = CT_MAGIC; ct->my_cache = cache; ct->c_list = NULL; ct->refcount = 0; /* for the moment */ - ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; ct->lastaccess = catcacheclock; @@ -2158,9 +2154,6 @@ PrintCatCacheLeakWarning(HeapTuple tuple) CatCTup *ct = (CatCTup *) (((char *) tuple) - offsetof(CatCTup, tuple)); - /* Safety check to ensure we were handed a cache entry */ - Assert(ct->ct_magic == CT_MAGIC); - elog(WARNING, "cache reference leak: cache %s (%d), tuple %u/%u has count %d", ct->my_cache->cc_relname, ct->my_cache->id, ItemPointerGetBlockNumber(&(tuple->t_self)), diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 81587c3fe6..36940f4e3b 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -87,9 +87,6 @@ typedef struct catcache typedef struct catctup { - int ct_magic; /* for identifying CatCTup entries */ -#define CT_MAGIC 0x57261502 - uint32 hash_value; /* hash value for this tuple's keys */ /* @@ -106,19 +103,12 @@ typedef struct catctup dlist_node cache_elem; /* list member of per-bucket list */ /* - * A tuple marked "dead" must not be returned by subsequent searches. - * However, it won't be physically deleted from the cache until its - * refcount goes to zero. (If it's a member of a CatCList, the list's - * refcount must go to zero, too; also, remember to mark the list dead at - * the same time the tuple is marked.) - * * A negative cache entry is an assertion that there is no tuple matching * a particular key. This is just as useful as a normal entry so far as * avoiding catalog searches is concerned. Management of positive and * negative entries is identical. */ int refcount; /* number of active references */ - bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ uint64 lastaccess; /* timestamp in us of the last usage */ -- 2.18.4 From 61806132ea82e90997e38a6cb0fa0d74bc3c4c2b Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 18 Nov 2020 16:56:41 +0900 Subject: [PATCH v5 3/3] catcachebench --- contrib/catcachebench/Makefile | 17 + contrib/catcachebench/catcachebench--0.0.sql | 14 + contrib/catcachebench/catcachebench.c | 330 +++++++++++++++++++ contrib/catcachebench/catcachebench.control | 6 + src/backend/utils/cache/catcache.c | 33 ++ src/backend/utils/cache/syscache.c | 2 +- 6 files changed, 401 insertions(+), 1 deletion(-) create mode 100644 contrib/catcachebench/Makefile create mode 100644 contrib/catcachebench/catcachebench--0.0.sql create mode 100644 contrib/catcachebench/catcachebench.c create mode 100644 contrib/catcachebench/catcachebench.control diff --git a/contrib/catcachebench/Makefile b/contrib/catcachebench/Makefile new file mode 100644 index 0000000000..0478818b25 --- /dev/null +++ b/contrib/catcachebench/Makefile @@ -0,0 +1,17 @@ +MODULE_big = catcachebench +OBJS = catcachebench.o + +EXTENSION = catcachebench +DATA = catcachebench--0.0.sql +PGFILEDESC = "catcachebench - benchmark for catcache pruning feature" + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/catcachebench +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/catcachebench/catcachebench--0.0.sql b/contrib/catcachebench/catcachebench--0.0.sql new file mode 100644 index 0000000000..ea9cd62abb --- /dev/null +++ b/contrib/catcachebench/catcachebench--0.0.sql @@ -0,0 +1,14 @@ +/* contrib/catcachebench/catcachebench--0.0.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION catcachebench" to load this file. \quit + +CREATE FUNCTION catcachebench(IN type int) +RETURNS double precision +AS 'MODULE_PATHNAME', 'catcachebench' +LANGUAGE C STRICT VOLATILE; + +CREATE FUNCTION catcachereadstats(OUT catid int, OUT reloid oid, OUT searches bigint, OUT hits bigint, OUT neg_hits bigint) +RETURNS SETOF record +AS 'MODULE_PATHNAME', 'catcachereadstats' +LANGUAGE C STRICT VOLATILE; diff --git a/contrib/catcachebench/catcachebench.c b/contrib/catcachebench/catcachebench.c new file mode 100644 index 0000000000..b5a4d794ed --- /dev/null +++ b/contrib/catcachebench/catcachebench.c @@ -0,0 +1,330 @@ +/* + * catcachebench: test code for cache pruning feature + */ +/* #define CATCACHE_STATS */ +#include "postgres.h" +#include "catalog/pg_type.h" +#include "catalog/pg_statistic.h" +#include "executor/spi.h" +#include "funcapi.h" +#include "libpq/pqsignal.h" +#include "utils/catcache.h" +#include "utils/syscache.h" +#include "utils/timestamp.h" + +Oid tableoids[10000]; +int ntables = 0; +int16 attnums[1000]; +int natts = 0; + +PG_MODULE_MAGIC; + +double catcachebench1(void); +double catcachebench2(void); +double catcachebench3(void); +void collectinfo(void); +void catcachewarmup(void); + +PG_FUNCTION_INFO_V1(catcachebench); +PG_FUNCTION_INFO_V1(catcachereadstats); + +extern void CatalogCacheFlushCatalog2(Oid catId); +extern int64 catcache_called; +extern CatCache *SysCache[]; + +typedef struct catcachestatsstate +{ + TupleDesc tupd; + int catId; +} catcachestatsstate; + +Datum +catcachereadstats(PG_FUNCTION_ARGS) +{ + catcachestatsstate *state_data = NULL; + FuncCallContext *fctx; + + if (SRF_IS_FIRSTCALL()) + { + TupleDesc tupdesc; + MemoryContext mctx; + + fctx = SRF_FIRSTCALL_INIT(); + mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx); + + state_data = palloc(sizeof(catcachestatsstate)); + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + state_data->tupd = tupdesc; + state_data->catId = 0; + + fctx->user_fctx = state_data; + + MemoryContextSwitchTo(mctx); + } + + fctx = SRF_PERCALL_SETUP(); + state_data = fctx->user_fctx; + + if (state_data->catId < SysCacheSize) + { + Datum values[5]; + bool nulls[5]; + HeapTuple resulttup; + Datum result; + int catId = state_data->catId++; + + memset(nulls, 0, sizeof(nulls)); + memset(values, 0, sizeof(values)); + values[0] = Int16GetDatum(catId); + values[1] = ObjectIdGetDatum(SysCache[catId]->cc_reloid); +#ifdef CATCACHE_STATS + values[2] = Int64GetDatum(SysCache[catId]->cc_searches); + values[3] = Int64GetDatum(SysCache[catId]->cc_hits); + values[4] = Int64GetDatum(SysCache[catId]->cc_neg_hits); +#endif + resulttup = heap_form_tuple(state_data->tupd, values, nulls); + result = HeapTupleGetDatum(resulttup); + + SRF_RETURN_NEXT(fctx, result); + } + + SRF_RETURN_DONE(fctx); +} + +Datum +catcachebench(PG_FUNCTION_ARGS) +{ + int testtype = PG_GETARG_INT32(0); + double ms; + + collectinfo(); + + /* flush the catalog -- safe? don't mind. */ + CatalogCacheFlushCatalog2(StatisticRelationId); + + switch (testtype) + { + case 0: + catcachewarmup(); /* prewarm of syscatalog */ + PG_RETURN_NULL(); + case 1: + ms = catcachebench1(); break; + case 2: + ms = catcachebench2(); break; + case 3: + ms = catcachebench3(); break; + default: + elog(ERROR, "Invalid test type: %d", testtype); + } + + PG_RETURN_DATUM(Float8GetDatum(ms)); +} + +/* + * fetch all attribute entires of all tables. + */ +double +catcachebench1(void) +{ + int t, a; + instr_time start, + duration; + + PG_SETMASK(&BlockSig); + INSTR_TIME_SET_CURRENT(start); + for (t = 0 ; t < ntables ; t++) + { + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[t]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + PG_SETMASK(&UnBlockSig); + + return INSTR_TIME_GET_MILLISEC(duration); +}; + +/* + * fetch all attribute entires of a table 6000 times. + */ +double +catcachebench2(void) +{ + int t, a; + instr_time start, + duration; + + PG_SETMASK(&BlockSig); + INSTR_TIME_SET_CURRENT(start); + for (t = 0 ; t < 240000 ; t++) + { + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[0]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + PG_SETMASK(&UnBlockSig); + + return INSTR_TIME_GET_MILLISEC(duration); +}; + +/* + * fetch all attribute entires of all tables twice with having expiration + * happen. + */ +double +catcachebench3(void) +{ + const int clock_step = 1000; + int i, t, a; + instr_time start, + duration; + + PG_SETMASK(&BlockSig); + INSTR_TIME_SET_CURRENT(start); + for (i = 0 ; i < 4 ; i++) + { + int ct = clock_step; + + for (t = 0 ; t < ntables ; t++) + { + /* + * catcacheclock is updated by transaction timestamp, so needs to + * be updated by other means for this test to work. Here I choosed + * to update the clock every 1000 tables scan. + */ + if (--ct < 0) + { + SetCatCacheClock(GetCurrentTimestamp()); + ct = clock_step; + } + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[t]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } + } + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + PG_SETMASK(&UnBlockSig); + + return INSTR_TIME_GET_MILLISEC(duration); +}; + +void +catcachewarmup(void) +{ + int t, a; + + /* load up catalog tables */ + for (t = 0 ; t < ntables ; t++) + { + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[t]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } +} + +void +collectinfo(void) +{ + int ret; + Datum values[10000]; + bool nulls[10000]; + Oid types0[] = {OIDOID}; + int i; + + ntables = 0; + natts = 0; + + SPI_connect(); + /* collect target tables */ + ret = SPI_execute("select oid from pg_class where relnamespace = (select oid from pg_namespace where nspname = \'test\')", + true, 0); + if (ret != SPI_OK_SELECT) + elog(ERROR, "Failed 1"); + if (SPI_processed == 0) + elog(ERROR, "no relation found in schema \"test\""); + if (SPI_processed > 10000) + elog(ERROR, "too many relation found in schema \"test\""); + + for (i = 0 ; i < SPI_processed ; i++) + { + heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc, + values, nulls); + if (nulls[0]) + elog(ERROR, "Failed 2"); + + tableoids[ntables++] = DatumGetObjectId(values[0]); + } + SPI_finish(); + elog(DEBUG1, "%d tables found", ntables); + + values[0] = ObjectIdGetDatum(tableoids[0]); + nulls[0] = false; + SPI_connect(); + ret = SPI_execute_with_args("select attnum from pg_attribute where attrelid = (select oid from pg_class where oid =$1)", + 1, types0, values, NULL, true, 0); + if (SPI_processed == 0) + elog(ERROR, "no attribute found in table %d", tableoids[0]); + if (SPI_processed > 10000) + elog(ERROR, "too many relation found in table %d", tableoids[0]); + + /* collect target attributes. assuming all tables have the same attnums */ + for (i = 0 ; i < SPI_processed ; i++) + { + int16 attnum; + + heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc, + values, nulls); + if (nulls[0]) + elog(ERROR, "Failed 3"); + attnum = DatumGetInt16(values[0]); + + if (attnum > 0) + attnums[natts++] = attnum; + } + SPI_finish(); + elog(DEBUG1, "%d attributes found", natts); +} diff --git a/contrib/catcachebench/catcachebench.control b/contrib/catcachebench/catcachebench.control new file mode 100644 index 0000000000..3fc9d2e420 --- /dev/null +++ b/contrib/catcachebench/catcachebench.control @@ -0,0 +1,6 @@ +# catcachebench + +comment = 'benchmark for catcache pruning' +default_version = '0.0' +module_pathname = '$libdir/catcachebench' +relocatable = true diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 3e6c4720dc..294d906416 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -767,6 +767,39 @@ CatalogCacheFlushCatalog(Oid catId) CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call"); } + +/* FUNCTION FOR BENCHMARKING */ +void +CatalogCacheFlushCatalog2(Oid catId) +{ + slist_iter iter; + + CACHE_elog(DEBUG2, "CatalogCacheFlushCatalog called for %u", catId); + + slist_foreach(iter, &CacheHdr->ch_caches) + { + CatCache *cache = slist_container(CatCache, cc_next, iter.cur); + + /* Does this cache store tuples of the target catalog? */ + if (cache->cc_reloid == catId) + { + /* Yes, so flush all its contents */ + ResetCatalogCache(cache); + + /* Tell inval.c to call syscache callbacks for this cache */ + CallSyscacheCallbacks(cache->id, 0); + + cache->cc_nbuckets = 128; + pfree(cache->cc_bucket); + cache->cc_bucket = palloc0(128 * sizeof(dlist_head)); + elog(LOG, "Catcache reset"); + } + } + + CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call"); +} +/* END: FUNCTION FOR BENCHMARKING */ + /* * InitCatCache * diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c index 809b27a038..e83b3f66d1 100644 --- a/src/backend/utils/cache/syscache.c +++ b/src/backend/utils/cache/syscache.c @@ -982,7 +982,7 @@ static const struct cachedesc cacheinfo[] = { } }; -static CatCache *SysCache[SysCacheSize]; +CatCache *SysCache[SysCacheSize]; static bool CacheInitialized = false; -- 2.18.4
Hi, On 2020-11-19 14:25:36 +0900, Kyotaro Horiguchi wrote: > # Creation, searching and expiration > master : 6393.23 (100.0) > patched-off: 6527.94 (102.1) > patched-on : 15880.01 (248.4) What's the deal with this massive increase here? Greetings, Andres Freund
At Wed, 18 Nov 2020 21:42:02 -0800, Andres Freund <andres@anarazel.de> wrote in > Hi, > > On 2020-11-19 14:25:36 +0900, Kyotaro Horiguchi wrote: > > # Creation, searching and expiration > > master : 6393.23 (100.0) > > patched-off: 6527.94 (102.1) > > patched-on : 15880.01 (248.4) > > What's the deal with this massive increase here? CatCacheRemovedCTup(). If I replaced a call to the function in the cleanup functoin with dlist_delete(), the result changes as: master : 6372.04 (100.0) (2) patched-off : 6464.97 (101.5) (2) patched-on : 5354.42 ( 84.0) (2) We could boost the expiration if we reuse the "deleted" entry at the next entry creation. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Thu, 19 Nov 2020 15:23:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Wed, 18 Nov 2020 21:42:02 -0800, Andres Freund <andres@anarazel.de> wrote in > > Hi, > > > > On 2020-11-19 14:25:36 +0900, Kyotaro Horiguchi wrote: > > > # Creation, searching and expiration > > > master : 6393.23 (100.0) > > > patched-off: 6527.94 (102.1) > > > patched-on : 15880.01 (248.4) > > > > What's the deal with this massive increase here? > > CatCacheRemovedCTup(). If I replaced a call to the function in the > cleanup functoin with dlist_delete(), the result changes as: > > master : 6372.04 (100.0) (2) > patched-off : 6464.97 (101.5) (2) > patched-on : 5354.42 ( 84.0) (2) > > We could boost the expiration if we reuse the "deleted" entry at the > next entry creation. That result should be bogus. It forgot to update cc_ntup.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Ah. It was obvious from the first. Sorry for the sloppy diagnosis. At Fri, 20 Nov 2020 16:08:40 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Thu, 19 Nov 2020 15:23:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > At Wed, 18 Nov 2020 21:42:02 -0800, Andres Freund <andres@anarazel.de> wrote in > > > Hi, > > > > > > On 2020-11-19 14:25:36 +0900, Kyotaro Horiguchi wrote: > > > > # Creation, searching and expiration > > > > master : 6393.23 (100.0) > > > > patched-off: 6527.94 (102.1) > > > > patched-on : 15880.01 (248.4) > > > > > > What's the deal with this massive increase here? catalog_cache_min_prune_age was set to 0 at the time, so almost all catcache entries are dropped at rehashing time. Most of the difference should be the time to search on the system catalog. 2020-11-20 16:25:25.988 LOG: database system is ready to accept connections 2020-11-20 16:26:48.504 LOG: Catcache reset 2020-11-20 16:26:48.504 LOG: pruning catalog cache id=58 for pg_statistic: removed 0 / 257: 0.001500 ms 2020-11-20 16:26:48.504 LOG: rehashed catalog cache id 58 for pg_statistic; 257 tups, 256 buckets, 0.020748 ms 2020-11-20 16:26:48.505 LOG: pruning catalog cache id=58 for pg_statistic: removed 0 / 513: 0.003221 ms 2020-11-20 16:26:48.505 LOG: rehashed catalog cache id 58 for pg_statistic; 513 tups, 512 buckets, 0.006962 ms 2020-11-20 16:26:48.505 LOG: pruning catalog cache id=58 for pg_statistic: removed 0 / 1025: 0.006744 ms 2020-11-20 16:26:48.505 LOG: rehashed catalog cache id 58 for pg_statistic; 1025 tups, 1024 buckets, 0.009580 ms 2020-11-20 16:26:48.507 LOG: pruning catalog cache id=58 for pg_statistic: removed 0 / 2049: 0.015683 ms 2020-11-20 16:26:48.507 LOG: rehashed catalog cache id 58 for pg_statistic; 2049 tups, 2048 buckets, 0.041008 ms 2020-11-20 16:26:48.509 LOG: pruning catalog cache id=58 for pg_statistic: removed 0 / 4097: 0.042438 ms 2020-11-20 16:26:48.509 LOG: rehashed catalog cache id 58 for pg_statistic; 4097 tups, 4096 buckets, 0.077379 ms 2020-11-20 16:26:48.515 LOG: pruning catalog cache id=58 for pg_statistic: removed 0 / 8193: 0.123798 ms 2020-11-20 16:26:48.515 LOG: rehashed catalog cache id 58 for pg_statistic; 8193 tups, 8192 buckets, 0.198505 ms 2020-11-20 16:26:48.525 LOG: pruning catalog cache id=58 for pg_statistic: removed 0 / 16385: 0.180831 ms 2020-11-20 16:26:48.526 LOG: rehashed catalog cache id 58 for pg_statistic; 16385 tups, 16384 buckets, 0.361109 ms 2020-11-20 16:26:48.546 LOG: pruning catalog cache id=58 for pg_statistic: removed 0 / 32769: 0.717899 ms 2020-11-20 16:26:48.547 LOG: rehashed catalog cache id 58 for pg_statistic; 32769 tups, 32768 buckets, 1.443587 ms 2020-11-20 16:26:48.588 LOG: pruning catalog cache id=58 for pg_statistic: removed 0 / 65537: 1.204804 ms 2020-11-20 16:26:48.591 LOG: rehashed catalog cache id 58 for pg_statistic; 65537 tups, 65536 buckets, 3.069916 ms 2020-11-20 16:26:48.674 LOG: pruning catalog cache id=58 for pg_statistic: removed 0 / 131073: 2.707709 ms 2020-11-20 16:26:48.681 LOG: rehashed catalog cache id 58 for pg_statistic; 131073 tups, 131072 buckets, 7.127622 ms 2020-11-20 16:26:48.848 LOG: pruning catalog cache id=58 for pg_statistic: removed 0 / 262145: 5.895630 ms 2020-11-20 16:26:48.862 LOG: rehashed catalog cache id 58 for pg_statistic; 262145 tups, 262144 buckets, 13.433610 ms 2020-11-20 16:26:49.195 LOG: pruning catalog cache id=58 for pg_statistic: removed 0 / 524289: 12.302632 ms 2020-11-20 16:26:49.223 LOG: rehashed catalog cache id 58 for pg_statistic; 524289 tups, 524288 buckets, 27.710900 ms 2020-11-20 16:26:49.937 LOG: pruning catalog cache id=58 for pg_statistic: removed 1001000 / 1048577: 66.062629 ms 2020-11-20 16:26:51.195 LOG: pruning catalog cache id=58 for pg_statistic: removed 1002001 / 1048577: 65.533468 ms 2020-11-20 16:26:52.413 LOG: pruning catalog cache id=58 for pg_statistic: removed 0 / 1048577: 25.623740 ms 2020-11-20 16:26:52.468 LOG: rehashed catalog cache id 58 for pg_statistic; 1048577 tups, 1048576 buckets, 54.314825 ms 2020-11-20 16:26:53.898 LOG: pruning catalog cache id=58 for pg_statistic: removed 2000999 / 2097153: 134.530582 ms 2020-11-20 16:26:56.404 LOG: pruning catalog cache id=58 for pg_statistic: removed 1002001 / 2097153: 111.634597 ms 2020-11-20 16:26:57.779 LOG: pruning catalog cache id=58 for pg_statistic: removed 2000999 / 2097153: 134.628430 ms 2020-11-20 16:27:00.389 LOG: pruning catalog cache id=58 for pg_statistic: removed 1002001 / 2097153: 147.221688 ms 2020-11-20 16:27:01.851 LOG: pruning catalog cache id=58 for pg_statistic: removed 2000999 / 2097153: 177.610820 ms regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hello. The commit 4656e3d668 (debug_invalidate_system_caches_always) conflicted with this patch. Rebased. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From ec069488fd2675369530f3f967f02a7b683f0a7f Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 18 Nov 2020 16:54:31 +0900 Subject: [PATCH v6 1/3] CatCache expiration feature --- src/backend/access/transam/xact.c | 3 ++ src/backend/utils/cache/catcache.c | 87 +++++++++++++++++++++++++++++- src/backend/utils/misc/guc.c | 12 +++++ src/include/utils/catcache.h | 19 +++++++ 4 files changed, 120 insertions(+), 1 deletion(-) diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index a2068e3fd4..86888d2409 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -1086,6 +1086,9 @@ static void AtStart_Cache(void) { AcceptInvalidationMessages(); + + if (xactStartTimestamp != 0) + SetCatCacheClock(xactStartTimestamp); } /* diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index fa2b49c676..644d92dd9a 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -38,6 +38,7 @@ #include "utils/rel.h" #include "utils/resowner_private.h" #include "utils/syscache.h" +#include "utils/timestamp.h" /* #define CACHEDEBUG */ /* turns DEBUG elogs on */ @@ -60,9 +61,19 @@ #define CACHE_elog(...) #endif +/* + * GUC variable to define the minimum age of entries that will be considered + * to be evicted in seconds. -1 to disable the feature. + */ +int catalog_cache_prune_min_age = -1; +uint64 prune_min_age_us; + /* Cache management header --- pointer is NULL until created */ static CatCacheHeader *CacheHdr = NULL; +/* Clock for the last accessed time of a catcache entry. */ +uint64 catcacheclock = 0; + static inline HeapTuple SearchCatCacheInternal(CatCache *cache, int nkeys, Datum v1, Datum v2, @@ -74,6 +85,7 @@ static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache, Index hashIndex, Datum v1, Datum v2, Datum v3, Datum v4); +static bool CatCacheCleanupOldEntries(CatCache *cp); static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys, Datum v1, Datum v2, Datum v3, Datum v4); @@ -99,6 +111,15 @@ static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos, static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *srckeys, Datum *dstkeys); +/* GUC assign function */ +void +assign_catalog_cache_prune_min_age(int newval, void *extra) +{ + if (newval < 0) + prune_min_age_us = UINT64_MAX; + else + prune_min_age_us = ((uint64) newval) * USECS_PER_SEC; +} /* * internal support functions @@ -1264,6 +1285,9 @@ SearchCatCacheInternal(CatCache *cache, */ dlist_move_head(bucket, &ct->cache_elem); + /* Record the last access timestamp */ + ct->lastaccess = catcacheclock; + /* * If it's a positive entry, bump its refcount and return it. If it's * negative, we can report failure to the caller. @@ -1425,6 +1449,61 @@ SearchCatCacheMiss(CatCache *cache, return &ct->tuple; } +/* + * CatCacheCleanupOldEntries - Remove infrequently-used entries + * + * Catcache entries happen to be left unused for a long time for several + * reasons. Remove such entries to prevent catcache from bloating. It is based + * on the similar algorithm with buffer eviction. Entries that are accessed + * several times in a certain period live longer than those that have had less + * access in the same duration. + */ +static bool +CatCacheCleanupOldEntries(CatCache *cp) +{ + int nremoved = 0; + int i; + long oldest_ts = catcacheclock; + uint64 prune_threshold = catcacheclock - prune_min_age_us; + + /* Scan over the whole hash to find entries to remove */ + for (i = 0 ; i < cp->cc_nbuckets ; i++) + { + dlist_mutable_iter iter; + + dlist_foreach_modify(iter, &cp->cc_bucket[i]) + { + CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); + + /* Don't remove referenced entries */ + if (ct->refcount == 0 && + (ct->c_list == NULL || ct->c_list->refcount == 0)) + { + if (ct->lastaccess < prune_threshold) + { + CatCacheRemoveCTup(cp, ct); + nremoved++; + + /* don't let the removed entry update oldest_ts */ + continue; + } + } + + /* update the oldest timestamp if the entry remains alive */ + if (ct->lastaccess < oldest_ts) + oldest_ts = ct->lastaccess; + } + } + + cp->cc_oldest_ts = oldest_ts; + + if (nremoved > 0) + elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d", + cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved); + + return nremoved > 0; +} + /* * ReleaseCatCache * @@ -1888,6 +1967,7 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; + ct->lastaccess = catcacheclock; dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem); @@ -1899,7 +1979,12 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, * arbitrarily, we enlarge when fill factor > 2. */ if (cache->cc_ntup > cache->cc_nbuckets * 2) - RehashCatCache(cache); + { + /* try removing old entries before expanding hash */ + if (catcacheclock - cache->cc_oldest_ts < prune_min_age_us || + !CatCacheCleanupOldEntries(cache)) + RehashCatCache(cache); + } return ct; } diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 17579eeaca..255e9fa73d 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -88,6 +88,7 @@ #include "utils/acl.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/catcache.h" #include "utils/float.h" #include "utils/guc_tables.h" #include "utils/memutils.h" @@ -3445,6 +3446,17 @@ static struct config_int ConfigureNamesInt[] = NULL, NULL, NULL }, + { + {"catalog_cache_prune_min_age", PGC_USERSET, RESOURCES_MEM, + gettext_noop("System catalog cache entries that are living unused more than this seconds are considered forremoval."), + gettext_noop("The value of -1 turns off pruning."), + GUC_UNIT_S + }, + &catalog_cache_prune_min_age, + -1, -1, INT_MAX, + NULL, assign_catalog_cache_prune_min_age, NULL + }, + /* End-of-list marker */ { {NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index ddc2762eb3..291e857e38 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -22,6 +22,7 @@ #include "access/htup.h" #include "access/skey.h" +#include "datatype/timestamp.h" #include "lib/ilist.h" #include "utils/relcache.h" @@ -61,6 +62,7 @@ typedef struct catcache slist_node cc_next; /* list link */ ScanKeyData cc_skey[CATCACHE_MAXKEYS]; /* precomputed key info for heap * scans */ + uint64 cc_oldest_ts; /* timestamp (us) of the oldest tuple */ /* * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS @@ -119,6 +121,7 @@ typedef struct catctup bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ + uint64 lastaccess; /* timestamp in us of the last usage */ /* * The tuple may also be a member of at most one CatCList. (If a single @@ -189,6 +192,22 @@ typedef struct catcacheheader /* this extern duplicates utils/memutils.h... */ extern PGDLLIMPORT MemoryContext CacheMemoryContext; + +/* for guc.c, not PGDLLPMPORT'ed */ +extern int catalog_cache_prune_min_age; + +/* source clock for access timestamp of catcache entries */ +extern uint64 catcacheclock; + +/* SetCatCacheClock - set catcache timestamp source clock */ +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = (uint64) ts; +} + +extern void assign_catalog_cache_prune_min_age(int newval, void *extra); + extern void CreateCacheMemoryContext(void); extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid, -- 2.27.0 From 95b39756890b7f53b99e20180ad1a62b450ef237 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 18 Nov 2020 16:57:05 +0900 Subject: [PATCH v6 2/3] Remove "dead" flag from catcache tuple --- src/backend/utils/cache/catcache.c | 43 +++++++++++++----------------- src/include/utils/catcache.h | 10 ------- 2 files changed, 18 insertions(+), 35 deletions(-) diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 644d92dd9a..611b65168d 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -480,6 +480,13 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) Assert(ct->refcount == 0); Assert(ct->my_cache == cache); + /* delink from linked list if not yet */ + if (ct->cache_elem.prev) + { + dlist_delete(&ct->cache_elem); + ct->cache_elem.prev = NULL; + } + if (ct->c_list) { /* @@ -487,14 +494,10 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct) * which will recurse back to me, and the recursive call will do the * work. Set the "dead" flag to make sure it does recurse. */ - ct->dead = true; CatCacheRemoveCList(cache, ct->c_list); return; /* nothing left to do */ } - /* delink from linked list */ - dlist_delete(&ct->cache_elem); - /* * Free keys when we're dealing with a negative entry, normal entries just * point into tuple, allocated together with the CatCTup. @@ -534,7 +537,7 @@ CatCacheRemoveCList(CatCache *cache, CatCList *cl) /* if the member is dead and now has no references, remove it */ if ( #ifndef CATCACHE_FORCE_RELEASE - ct->dead && + ct->cache_elem.prev == NULL && #endif ct->refcount == 0) CatCacheRemoveCTup(cache, ct); @@ -609,7 +612,9 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue) if (ct->refcount > 0 || (ct->c_list && ct->c_list->refcount > 0)) { - ct->dead = true; + dlist_delete(&ct->cache_elem); + ct->cache_elem.prev = NULL; + /* list, if any, was marked dead above */ Assert(ct->c_list == NULL || ct->c_list->dead); } @@ -688,7 +693,8 @@ ResetCatalogCache(CatCache *cache) if (ct->refcount > 0 || (ct->c_list && ct->c_list->refcount > 0)) { - ct->dead = true; + dlist_delete(&ct->cache_elem); + ct->cache_elem.prev = NULL; /* list, if any, was marked dead above */ Assert(ct->c_list == NULL || ct->c_list->dead); } @@ -1268,9 +1274,6 @@ SearchCatCacheInternal(CatCache *cache, { ct = dlist_container(CatCTup, cache_elem, iter.cur); - if (ct->dead) - continue; /* ignore dead entries */ - if (ct->hash_value != hashValue) continue; /* quickly skip entry if wrong hash val */ @@ -1522,7 +1525,6 @@ ReleaseCatCache(HeapTuple tuple) offsetof(CatCTup, tuple)); /* Safety checks to ensure we were handed a cache entry */ - Assert(ct->ct_magic == CT_MAGIC); Assert(ct->refcount > 0); ct->refcount--; @@ -1530,7 +1532,7 @@ ReleaseCatCache(HeapTuple tuple) if ( #ifndef CATCACHE_FORCE_RELEASE - ct->dead && + ct->cache_elem.prev == NULL && #endif ct->refcount == 0 && (ct->c_list == NULL || ct->c_list->refcount == 0)) @@ -1737,8 +1739,8 @@ SearchCatCacheList(CatCache *cache, { ct = dlist_container(CatCTup, cache_elem, iter.cur); - if (ct->dead || ct->negative) - continue; /* ignore dead and negative entries */ + if (ct->negative) + continue; /* ignore negative entries */ if (ct->hash_value != hashValue) continue; /* quickly skip entry if wrong hash val */ @@ -1799,14 +1801,13 @@ SearchCatCacheList(CatCache *cache, { foreach(ctlist_item, ctlist) { + Assert (ct->cache_elem.prev != NULL); + ct = (CatCTup *) lfirst(ctlist_item); Assert(ct->c_list == NULL); Assert(ct->refcount > 0); ct->refcount--; if ( -#ifndef CATCACHE_FORCE_RELEASE - ct->dead && -#endif ct->refcount == 0 && (ct->c_list == NULL || ct->c_list->refcount == 0)) CatCacheRemoveCTup(cache, ct); @@ -1834,9 +1835,6 @@ SearchCatCacheList(CatCache *cache, /* release the temporary refcount on the member */ Assert(ct->refcount > 0); ct->refcount--; - /* mark list dead if any members already dead */ - if (ct->dead) - cl->dead = true; } Assert(i == nmembers); @@ -1960,11 +1958,9 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, * Finish initializing the CatCTup header, and add it to the cache's * linked list and counts. */ - ct->ct_magic = CT_MAGIC; ct->my_cache = cache; ct->c_list = NULL; ct->refcount = 0; /* for the moment */ - ct->dead = false; ct->negative = negative; ct->hash_value = hashValue; ct->lastaccess = catcacheclock; @@ -2158,9 +2154,6 @@ PrintCatCacheLeakWarning(HeapTuple tuple) CatCTup *ct = (CatCTup *) (((char *) tuple) - offsetof(CatCTup, tuple)); - /* Safety check to ensure we were handed a cache entry */ - Assert(ct->ct_magic == CT_MAGIC); - elog(WARNING, "cache reference leak: cache %s (%d), tuple %u/%u has count %d", ct->my_cache->cc_relname, ct->my_cache->id, ItemPointerGetBlockNumber(&(tuple->t_self)), diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 291e857e38..53b0bf31eb 100644 --- a/src/include/utils/catcache.h +++ b/src/include/utils/catcache.h @@ -87,9 +87,6 @@ typedef struct catcache typedef struct catctup { - int ct_magic; /* for identifying CatCTup entries */ -#define CT_MAGIC 0x57261502 - uint32 hash_value; /* hash value for this tuple's keys */ /* @@ -106,19 +103,12 @@ typedef struct catctup dlist_node cache_elem; /* list member of per-bucket list */ /* - * A tuple marked "dead" must not be returned by subsequent searches. - * However, it won't be physically deleted from the cache until its - * refcount goes to zero. (If it's a member of a CatCList, the list's - * refcount must go to zero, too; also, remember to mark the list dead at - * the same time the tuple is marked.) - * * A negative cache entry is an assertion that there is no tuple matching * a particular key. This is just as useful as a normal entry so far as * avoiding catalog searches is concerned. Management of positive and * negative entries is identical. */ int refcount; /* number of active references */ - bool dead; /* dead but not yet removed? */ bool negative; /* negative cache entry? */ HeapTupleData tuple; /* tuple management header */ uint64 lastaccess; /* timestamp in us of the last usage */ -- 2.27.0 From e706934b35f6d6df20c09532d3c53a520cd704cc Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 18 Nov 2020 16:56:41 +0900 Subject: [PATCH v6 3/3] catcachebench --- contrib/catcachebench/Makefile | 17 + contrib/catcachebench/catcachebench--0.0.sql | 14 + contrib/catcachebench/catcachebench.c | 330 +++++++++++++++++++ contrib/catcachebench/catcachebench.control | 6 + src/backend/utils/cache/catcache.c | 33 ++ src/backend/utils/cache/syscache.c | 2 +- 6 files changed, 401 insertions(+), 1 deletion(-) create mode 100644 contrib/catcachebench/Makefile create mode 100644 contrib/catcachebench/catcachebench--0.0.sql create mode 100644 contrib/catcachebench/catcachebench.c create mode 100644 contrib/catcachebench/catcachebench.control diff --git a/contrib/catcachebench/Makefile b/contrib/catcachebench/Makefile new file mode 100644 index 0000000000..0478818b25 --- /dev/null +++ b/contrib/catcachebench/Makefile @@ -0,0 +1,17 @@ +MODULE_big = catcachebench +OBJS = catcachebench.o + +EXTENSION = catcachebench +DATA = catcachebench--0.0.sql +PGFILEDESC = "catcachebench - benchmark for catcache pruning feature" + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/catcachebench +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/catcachebench/catcachebench--0.0.sql b/contrib/catcachebench/catcachebench--0.0.sql new file mode 100644 index 0000000000..ea9cd62abb --- /dev/null +++ b/contrib/catcachebench/catcachebench--0.0.sql @@ -0,0 +1,14 @@ +/* contrib/catcachebench/catcachebench--0.0.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION catcachebench" to load this file. \quit + +CREATE FUNCTION catcachebench(IN type int) +RETURNS double precision +AS 'MODULE_PATHNAME', 'catcachebench' +LANGUAGE C STRICT VOLATILE; + +CREATE FUNCTION catcachereadstats(OUT catid int, OUT reloid oid, OUT searches bigint, OUT hits bigint, OUT neg_hits bigint) +RETURNS SETOF record +AS 'MODULE_PATHNAME', 'catcachereadstats' +LANGUAGE C STRICT VOLATILE; diff --git a/contrib/catcachebench/catcachebench.c b/contrib/catcachebench/catcachebench.c new file mode 100644 index 0000000000..b5a4d794ed --- /dev/null +++ b/contrib/catcachebench/catcachebench.c @@ -0,0 +1,330 @@ +/* + * catcachebench: test code for cache pruning feature + */ +/* #define CATCACHE_STATS */ +#include "postgres.h" +#include "catalog/pg_type.h" +#include "catalog/pg_statistic.h" +#include "executor/spi.h" +#include "funcapi.h" +#include "libpq/pqsignal.h" +#include "utils/catcache.h" +#include "utils/syscache.h" +#include "utils/timestamp.h" + +Oid tableoids[10000]; +int ntables = 0; +int16 attnums[1000]; +int natts = 0; + +PG_MODULE_MAGIC; + +double catcachebench1(void); +double catcachebench2(void); +double catcachebench3(void); +void collectinfo(void); +void catcachewarmup(void); + +PG_FUNCTION_INFO_V1(catcachebench); +PG_FUNCTION_INFO_V1(catcachereadstats); + +extern void CatalogCacheFlushCatalog2(Oid catId); +extern int64 catcache_called; +extern CatCache *SysCache[]; + +typedef struct catcachestatsstate +{ + TupleDesc tupd; + int catId; +} catcachestatsstate; + +Datum +catcachereadstats(PG_FUNCTION_ARGS) +{ + catcachestatsstate *state_data = NULL; + FuncCallContext *fctx; + + if (SRF_IS_FIRSTCALL()) + { + TupleDesc tupdesc; + MemoryContext mctx; + + fctx = SRF_FIRSTCALL_INIT(); + mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx); + + state_data = palloc(sizeof(catcachestatsstate)); + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + state_data->tupd = tupdesc; + state_data->catId = 0; + + fctx->user_fctx = state_data; + + MemoryContextSwitchTo(mctx); + } + + fctx = SRF_PERCALL_SETUP(); + state_data = fctx->user_fctx; + + if (state_data->catId < SysCacheSize) + { + Datum values[5]; + bool nulls[5]; + HeapTuple resulttup; + Datum result; + int catId = state_data->catId++; + + memset(nulls, 0, sizeof(nulls)); + memset(values, 0, sizeof(values)); + values[0] = Int16GetDatum(catId); + values[1] = ObjectIdGetDatum(SysCache[catId]->cc_reloid); +#ifdef CATCACHE_STATS + values[2] = Int64GetDatum(SysCache[catId]->cc_searches); + values[3] = Int64GetDatum(SysCache[catId]->cc_hits); + values[4] = Int64GetDatum(SysCache[catId]->cc_neg_hits); +#endif + resulttup = heap_form_tuple(state_data->tupd, values, nulls); + result = HeapTupleGetDatum(resulttup); + + SRF_RETURN_NEXT(fctx, result); + } + + SRF_RETURN_DONE(fctx); +} + +Datum +catcachebench(PG_FUNCTION_ARGS) +{ + int testtype = PG_GETARG_INT32(0); + double ms; + + collectinfo(); + + /* flush the catalog -- safe? don't mind. */ + CatalogCacheFlushCatalog2(StatisticRelationId); + + switch (testtype) + { + case 0: + catcachewarmup(); /* prewarm of syscatalog */ + PG_RETURN_NULL(); + case 1: + ms = catcachebench1(); break; + case 2: + ms = catcachebench2(); break; + case 3: + ms = catcachebench3(); break; + default: + elog(ERROR, "Invalid test type: %d", testtype); + } + + PG_RETURN_DATUM(Float8GetDatum(ms)); +} + +/* + * fetch all attribute entires of all tables. + */ +double +catcachebench1(void) +{ + int t, a; + instr_time start, + duration; + + PG_SETMASK(&BlockSig); + INSTR_TIME_SET_CURRENT(start); + for (t = 0 ; t < ntables ; t++) + { + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[t]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + PG_SETMASK(&UnBlockSig); + + return INSTR_TIME_GET_MILLISEC(duration); +}; + +/* + * fetch all attribute entires of a table 6000 times. + */ +double +catcachebench2(void) +{ + int t, a; + instr_time start, + duration; + + PG_SETMASK(&BlockSig); + INSTR_TIME_SET_CURRENT(start); + for (t = 0 ; t < 240000 ; t++) + { + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[0]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + PG_SETMASK(&UnBlockSig); + + return INSTR_TIME_GET_MILLISEC(duration); +}; + +/* + * fetch all attribute entires of all tables twice with having expiration + * happen. + */ +double +catcachebench3(void) +{ + const int clock_step = 1000; + int i, t, a; + instr_time start, + duration; + + PG_SETMASK(&BlockSig); + INSTR_TIME_SET_CURRENT(start); + for (i = 0 ; i < 4 ; i++) + { + int ct = clock_step; + + for (t = 0 ; t < ntables ; t++) + { + /* + * catcacheclock is updated by transaction timestamp, so needs to + * be updated by other means for this test to work. Here I choosed + * to update the clock every 1000 tables scan. + */ + if (--ct < 0) + { + SetCatCacheClock(GetCurrentTimestamp()); + ct = clock_step; + } + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[t]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } + } + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + PG_SETMASK(&UnBlockSig); + + return INSTR_TIME_GET_MILLISEC(duration); +}; + +void +catcachewarmup(void) +{ + int t, a; + + /* load up catalog tables */ + for (t = 0 ; t < ntables ; t++) + { + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[t]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } +} + +void +collectinfo(void) +{ + int ret; + Datum values[10000]; + bool nulls[10000]; + Oid types0[] = {OIDOID}; + int i; + + ntables = 0; + natts = 0; + + SPI_connect(); + /* collect target tables */ + ret = SPI_execute("select oid from pg_class where relnamespace = (select oid from pg_namespace where nspname = \'test\')", + true, 0); + if (ret != SPI_OK_SELECT) + elog(ERROR, "Failed 1"); + if (SPI_processed == 0) + elog(ERROR, "no relation found in schema \"test\""); + if (SPI_processed > 10000) + elog(ERROR, "too many relation found in schema \"test\""); + + for (i = 0 ; i < SPI_processed ; i++) + { + heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc, + values, nulls); + if (nulls[0]) + elog(ERROR, "Failed 2"); + + tableoids[ntables++] = DatumGetObjectId(values[0]); + } + SPI_finish(); + elog(DEBUG1, "%d tables found", ntables); + + values[0] = ObjectIdGetDatum(tableoids[0]); + nulls[0] = false; + SPI_connect(); + ret = SPI_execute_with_args("select attnum from pg_attribute where attrelid = (select oid from pg_class where oid =$1)", + 1, types0, values, NULL, true, 0); + if (SPI_processed == 0) + elog(ERROR, "no attribute found in table %d", tableoids[0]); + if (SPI_processed > 10000) + elog(ERROR, "too many relation found in table %d", tableoids[0]); + + /* collect target attributes. assuming all tables have the same attnums */ + for (i = 0 ; i < SPI_processed ; i++) + { + int16 attnum; + + heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc, + values, nulls); + if (nulls[0]) + elog(ERROR, "Failed 3"); + attnum = DatumGetInt16(values[0]); + + if (attnum > 0) + attnums[natts++] = attnum; + } + SPI_finish(); + elog(DEBUG1, "%d attributes found", natts); +} diff --git a/contrib/catcachebench/catcachebench.control b/contrib/catcachebench/catcachebench.control new file mode 100644 index 0000000000..3fc9d2e420 --- /dev/null +++ b/contrib/catcachebench/catcachebench.control @@ -0,0 +1,6 @@ +# catcachebench + +comment = 'benchmark for catcache pruning' +default_version = '0.0' +module_pathname = '$libdir/catcachebench' +relocatable = true diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 611b65168d..aabea861ce 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -767,6 +767,39 @@ CatalogCacheFlushCatalog(Oid catId) CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call"); } + +/* FUNCTION FOR BENCHMARKING */ +void +CatalogCacheFlushCatalog2(Oid catId) +{ + slist_iter iter; + + CACHE_elog(DEBUG2, "CatalogCacheFlushCatalog called for %u", catId); + + slist_foreach(iter, &CacheHdr->ch_caches) + { + CatCache *cache = slist_container(CatCache, cc_next, iter.cur); + + /* Does this cache store tuples of the target catalog? */ + if (cache->cc_reloid == catId) + { + /* Yes, so flush all its contents */ + ResetCatalogCache(cache); + + /* Tell inval.c to call syscache callbacks for this cache */ + CallSyscacheCallbacks(cache->id, 0); + + cache->cc_nbuckets = 128; + pfree(cache->cc_bucket); + cache->cc_bucket = palloc0(128 * sizeof(dlist_head)); + elog(LOG, "Catcache reset"); + } + } + + CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call"); +} +/* END: FUNCTION FOR BENCHMARKING */ + /* * InitCatCache * diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c index e4dc4ee34e..b60416ec63 100644 --- a/src/backend/utils/cache/syscache.c +++ b/src/backend/utils/cache/syscache.c @@ -994,7 +994,7 @@ static const struct cachedesc cacheinfo[] = { } }; -static CatCache *SysCache[SysCacheSize]; +CatCache *SysCache[SysCacheSize]; static bool CacheInitialized = false; -- 2.27.0
Hi, On 19/11/2020 07:25, Kyotaro Horiguchi wrote: > Performance measurement on the attached showed better result about > searching but maybe worse for cache entry creation. Each time number > is the mean of 10 runs. > > # Cacache (negative) entry creation > : time(ms) (% to master) > master : 3965.61 (100.0) > patched-off: 4040.93 (101.9) > patched-on : 4032.22 (101.7) > > # Searching negative cache entries > master : 8173.46 (100.0) > patched-off: 7983.43 ( 97.7) > patched-on : 8049.88 ( 98.5) > > # Creation, searching and expiration > master : 6393.23 (100.0) > patched-off: 6527.94 (102.1) > patched-on : 15880.01 (248.4) > > > That is, catcache searching gets faster by 2-3% but creation gets > slower by about 2%. If I moved the condition of 2 further up to > CatalogCacheCreateEntry(), that degradation reduced to 0.6%. > > # Cacache (negative) entry creation > master : 3967.45 (100.0) > patched-off : 3990.43 (100.6) > patched-on : 4108.96 (103.6) > > # Searching negative cache entries > master : 8106.53 (100.0) > patched-off : 8036.61 ( 99.1) > patched-on : 8058.18 ( 99.4) > > # Creation, searching and expiration > master : 6395.00 (100.0) > patched-off : 6416.57 (100.3) > patched-on : 15830.91 (247.6) Can you share the exact script or steps to reproduce these numbers? I presume these are from the catcachebench extension, but I can't figure out which scenario above corresponds to which catcachebench test. Also, catcachebench seems to depend on a bunch of tables being created in schema called "test"; what tables did you use for the above numbers? - Heikki
At Tue, 26 Jan 2021 11:43:21 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in > Hi, > > On 19/11/2020 07:25, Kyotaro Horiguchi wrote: > > Performance measurement on the attached showed better result about > > searching but maybe worse for cache entry creation. Each time number > > is the mean of 10 runs. > > # Cacache (negative) entry creation > > : time(ms) (% to master) > > master : 3965.61 (100.0) > > patched-off: 4040.93 (101.9) > > patched-on : 4032.22 (101.7) > > # Searching negative cache entries > > master : 8173.46 (100.0) > > patched-off: 7983.43 ( 97.7) > > patched-on : 8049.88 ( 98.5) > > # Creation, searching and expiration > > master : 6393.23 (100.0) > > patched-off: 6527.94 (102.1) > > patched-on : 15880.01 (248.4) > > That is, catcache searching gets faster by 2-3% but creation gets > > slower by about 2%. If I moved the condition of 2 further up to > > CatalogCacheCreateEntry(), that degradation reduced to 0.6%. > > # Cacache (negative) entry creation > > master : 3967.45 (100.0) > > patched-off : 3990.43 (100.6) > > patched-on : 4108.96 (103.6) > > # Searching negative cache entries > > master : 8106.53 (100.0) > > patched-off : 8036.61 ( 99.1) > > patched-on : 8058.18 ( 99.4) > > # Creation, searching and expiration > > master : 6395.00 (100.0) > > patched-off : 6416.57 (100.3) > > patched-on : 15830.91 (247.6) > > Can you share the exact script or steps to reproduce these numbers? I > presume these are from the catcachebench extension, but I can't figure > out which scenario above corresponds to which catcachebench > test. Also, catcachebench seems to depend on a bunch of tables being > created in schema called "test"; what tables did you use for the above > numbers? gen_tbl.pl to generate the tables, then run2.sh to run the benchmark. sumlog.pl to summarize the result of run2.sh. $ ./gen_tbl.pl | psql postgres $ ./run2.sh | tee rawresult.txt | ./sumlog.pl (I found a bug in a benchmark-aid function (CatalogCacheFlushCatalog2), I repost an updated version soon.) Simple explanation follows since the scripts are a kind of crappy.. run2.sh: LOOPS : # of execution of catcachebench() in a single run USES : Take the average of this number of fastest executions in a single run. BINROOT : Common parent directory of target binaries. DATADIR : Data directory. (shared by all binaries) PREC : FP format for time and percentage in a result. TESTS : comma-separated numbers given to catcachebench. The "run" function spec run "binary-label" <binary-path> <A> <B> <C> where A, B and C are the value for catalog_cache_prune_min_age. "" means no setting (used for master binary). Currently only C is in effect but all the three should be non-empty string to make it effective. The result output is: test | version | n | r | stddev ------+-------------+-----+----------+--------- 1 | patched-off | 1/3 | 14211.96 | 261.19 test : # of catcachebench(#) version: binary label given to the run function n : USES / LOOPS r : result average time of catcachebench() in milliseconds stddev : stddev of regards. -- Kyotaro Horiguchi NTT Open Source Software Center #! /usr/bin/perl $collist = ""; foreach $i (0..1000) { $collist .= sprintf(", c%05d int", $i); } $collist = substr($collist, 2); printf "drop schema if exists test cascade;\n"; printf "create schema test;\n"; foreach $i (0..2999) { printf "create table test.t%04d ($collist);\n", $i; } #!/bin/bash LOOPS=3 USES=1 TESTS=1,2,3 BINROOT=/home/horiguti/bin DATADIR=/home/horiguti/data/data_work PREC="numeric(10,2)" /usr/bin/killall postgres /usr/bin/sleep 3 run() { local BINARY=$1 local PGCTL=$2/bin/pg_ctl local PGSQL=$2/bin/postgres local PSQL=$2/bin/psql if [ "$3" != "" ]; then local SETTING1="set catalog_cache_prune_min_age to \"$3\";" local SETTING2="set catalog_cache_prune_min_age to \"$4\";" local SETTING3="set catalog_cache_prune_min_age to \"$5\";" fi # ($PGSQL -D $DATADIR 2>&1 > /dev/null)& ($PGSQL -D $DATADIR 2>&1 > /dev/null | /usr/bin/sed -e 's/^/# /')& /usr/bin/sleep 3 ${PSQL} postgres <<EOF create extension if not exists catcachebench; select catcachebench(0); $SETTING3 select distinct * from unnest(ARRAY[${TESTS}]) as test, LATERAL (select '${BINARY}' as version, '${USES}/' || (count(r) OVER())::text as n, (avg(r) OVER ())::${PREC}, (stddev(r) OVER ())::${PREC} from (select catcachebench(test) as r from generate_series(1, ${LOOPS})) r order by r limit ${USES}) r EOF $PGCTL --pgdata=$DATADIR stop 2>&1 > /dev/null | /usr/bin/sed -e 's/^/# /' # oreport > $BINARY_perf.txt } for i in $(seq 0 2); do run "patched-off" $BINROOT/pgsql_catexp "-1" "-1" "-1" run "patched-on" $BINROOT/pgsql_catexp "0" "0" "0" run "master" $BINROOT/pgsql_master_o2 "" "" "" done #! /usr/bin/perl while (<STDIN>) { # if (/^\s+([0-9])\s*\|\s*(\w+)\s*\|\s*(\S+)\s*\|\s*([\d.]+)\s*\|\s*(\w+)\s*$/) { if (/^\s+([0-9])\s*\|\s*(\S+)\s*\|\s*(\S+)\s*\|\s*([\d.]+)\s*\|\s*([\d.]+)\s*$/) { $test = $1; $bin = $2; $time = $4; if (defined $sum{$test}{$bin}) { $sum{$test}{$bin} += $time; $num{$test}{$bin}++; } else { $sum{$test}{$bin} = 0; $num{$test}{$bin} = 0; } } } foreach $t (sort {$a cmp $b} keys %sum) { $master{$t} = $sum{$t}{master} / $num{$t}{master}; } foreach $t (sort {$a cmp $b} keys %sum) { foreach $b (sort {$a cmp $b} keys %{$sum{$t}}) { $mean = $sum{$t}{$b} / $num{$t}{$b}; $ratio = 100.0 * $mean / $master{$t}; printf("%-13s : %8.2f (%5.1f) (%d)\n", "$t:$b", $mean, $ratio, $num{$t}{$b}); } } diff --git a/contrib/catcachebench/Makefile b/contrib/catcachebench/Makefile new file mode 100644 index 0000000000..0478818b25 --- /dev/null +++ b/contrib/catcachebench/Makefile @@ -0,0 +1,17 @@ +MODULE_big = catcachebench +OBJS = catcachebench.o + +EXTENSION = catcachebench +DATA = catcachebench--0.0.sql +PGFILEDESC = "catcachebench - benchmark for catcache pruning feature" + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/catcachebench +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/catcachebench/catcachebench--0.0.sql b/contrib/catcachebench/catcachebench--0.0.sql new file mode 100644 index 0000000000..ea9cd62abb --- /dev/null +++ b/contrib/catcachebench/catcachebench--0.0.sql @@ -0,0 +1,14 @@ +/* contrib/catcachebench/catcachebench--0.0.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION catcachebench" to load this file. \quit + +CREATE FUNCTION catcachebench(IN type int) +RETURNS double precision +AS 'MODULE_PATHNAME', 'catcachebench' +LANGUAGE C STRICT VOLATILE; + +CREATE FUNCTION catcachereadstats(OUT catid int, OUT reloid oid, OUT searches bigint, OUT hits bigint, OUT neg_hits bigint) +RETURNS SETOF record +AS 'MODULE_PATHNAME', 'catcachereadstats' +LANGUAGE C STRICT VOLATILE; diff --git a/contrib/catcachebench/catcachebench.c b/contrib/catcachebench/catcachebench.c new file mode 100644 index 0000000000..f93d60e721 --- /dev/null +++ b/contrib/catcachebench/catcachebench.c @@ -0,0 +1,338 @@ +/* + * catcachebench: test code for cache pruning feature + */ +/* #define CATCACHE_STATS */ +#include "postgres.h" +#include "catalog/pg_type.h" +#include "catalog/pg_statistic.h" +#include "executor/spi.h" +#include "funcapi.h" +#include "libpq/pqsignal.h" +#include "utils/catcache.h" +#include "utils/syscache.h" +#include "utils/timestamp.h" + +Oid tableoids[10000]; +int ntables = 0; +int16 attnums[1000]; +int natts = 0; + +PG_MODULE_MAGIC; + +double catcachebench1(void); +double catcachebench2(void); +double catcachebench3(void); +void collectinfo(void); +void catcachewarmup(void); + +PG_FUNCTION_INFO_V1(catcachebench); +PG_FUNCTION_INFO_V1(catcachereadstats); + +extern void CatalogCacheFlushCatalog2(Oid catId); +extern int64 catcache_called; +extern CatCache *SysCache[]; + +typedef struct catcachestatsstate +{ + TupleDesc tupd; + int catId; +} catcachestatsstate; + +Datum +catcachereadstats(PG_FUNCTION_ARGS) +{ + catcachestatsstate *state_data = NULL; + FuncCallContext *fctx; + + if (SRF_IS_FIRSTCALL()) + { + TupleDesc tupdesc; + MemoryContext mctx; + + fctx = SRF_FIRSTCALL_INIT(); + mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx); + + state_data = palloc(sizeof(catcachestatsstate)); + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + state_data->tupd = tupdesc; + state_data->catId = 0; + + fctx->user_fctx = state_data; + + MemoryContextSwitchTo(mctx); + } + + fctx = SRF_PERCALL_SETUP(); + state_data = fctx->user_fctx; + + if (state_data->catId < SysCacheSize) + { + Datum values[5]; + bool nulls[5]; + HeapTuple resulttup; + Datum result; + int catId = state_data->catId++; + + memset(nulls, 0, sizeof(nulls)); + memset(values, 0, sizeof(values)); + values[0] = Int16GetDatum(catId); + values[1] = ObjectIdGetDatum(SysCache[catId]->cc_reloid); +#ifdef CATCACHE_STATS + values[2] = Int64GetDatum(SysCache[catId]->cc_searches); + values[3] = Int64GetDatum(SysCache[catId]->cc_hits); + values[4] = Int64GetDatum(SysCache[catId]->cc_neg_hits); +#endif + resulttup = heap_form_tuple(state_data->tupd, values, nulls); + result = HeapTupleGetDatum(resulttup); + + SRF_RETURN_NEXT(fctx, result); + } + + SRF_RETURN_DONE(fctx); +} + +Datum +catcachebench(PG_FUNCTION_ARGS) +{ + int testtype = PG_GETARG_INT32(0); + double ms; + + collectinfo(); + + /* flush the catalog -- safe? don't mind. */ + CatalogCacheFlushCatalog2(StatisticRelationId); + + switch (testtype) + { + case 0: + catcachewarmup(); /* prewarm of syscatalog */ + PG_RETURN_NULL(); + case 1: + ms = catcachebench1(); break; + case 2: + ms = catcachebench2(); break; + case 3: + ms = catcachebench3(); break; + default: + elog(ERROR, "Invalid test type: %d", testtype); + } + + PG_RETURN_DATUM(Float8GetDatum(ms)); +} + +/* + * fetch all attribute entires of all tables. + */ +double +catcachebench1(void) +{ + int t, a; + instr_time start, + duration; + + PG_SETMASK(&BlockSig); + INSTR_TIME_SET_CURRENT(start); + for (t = 0 ; t < ntables ; t++) + { + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[t]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + PG_SETMASK(&UnBlockSig); + + return INSTR_TIME_GET_MILLISEC(duration); +}; + +/* + * fetch all attribute entires of a table 6000 times. + */ +double +catcachebench2(void) +{ + int t, a; + instr_time start, + duration; + + PG_SETMASK(&BlockSig); + INSTR_TIME_SET_CURRENT(start); + for (t = 0 ; t < 240000 ; t++) + { + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[0]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + PG_SETMASK(&UnBlockSig); + + return INSTR_TIME_GET_MILLISEC(duration); +}; + +/* SetCatCacheClock - set catcache timestamp source clock */ +uint64 catcacheclock; +static inline void +SetCatCacheClock(TimestampTz ts) +{ + catcacheclock = (uint64) ts; +} + +/* + * fetch all attribute entires of all tables twice with having expiration + * happen. + */ +double +catcachebench3(void) +{ + const int clock_step = 1000; + int i, t, a; + instr_time start, + duration; + + PG_SETMASK(&BlockSig); + INSTR_TIME_SET_CURRENT(start); + for (i = 0 ; i < 4 ; i++) + { + int ct = clock_step; + + for (t = 0 ; t < ntables ; t++) + { + /* + * catcacheclock is updated by transaction timestamp, so needs to + * be updated by other means for this test to work. Here I choosed + * to update the clock every 1000 tables scan. + */ + if (--ct < 0) + { + SetCatCacheClock(GetCurrentTimestamp()); + ct = clock_step; + } + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[t]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } + } + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + PG_SETMASK(&UnBlockSig); + + return INSTR_TIME_GET_MILLISEC(duration); +}; + +void +catcachewarmup(void) +{ + int t, a; + + /* load up catalog tables */ + for (t = 0 ; t < ntables ; t++) + { + for (a = 0 ; a < natts ; a++) + { + HeapTuple tup; + + tup = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(tableoids[t]), + Int16GetDatum(attnums[a]), + BoolGetDatum(false)); + /* should be null, but.. */ + if (HeapTupleIsValid(tup)) + ReleaseSysCache(tup); + } + } +} + +void +collectinfo(void) +{ + int ret; + Datum values[10000]; + bool nulls[10000]; + Oid types0[] = {OIDOID}; + int i; + + ntables = 0; + natts = 0; + + SPI_connect(); + /* collect target tables */ + ret = SPI_execute("select oid from pg_class where relnamespace = (select oid from pg_namespace where nspname = \'test\')", + true, 0); + if (ret != SPI_OK_SELECT) + elog(ERROR, "Failed 1"); + if (SPI_processed == 0) + elog(ERROR, "no relation found in schema \"test\""); + if (SPI_processed > 10000) + elog(ERROR, "too many relation found in schema \"test\""); + + for (i = 0 ; i < SPI_processed ; i++) + { + heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc, + values, nulls); + if (nulls[0]) + elog(ERROR, "Failed 2"); + + tableoids[ntables++] = DatumGetObjectId(values[0]); + } + SPI_finish(); + elog(DEBUG1, "%d tables found", ntables); + + values[0] = ObjectIdGetDatum(tableoids[0]); + nulls[0] = false; + SPI_connect(); + ret = SPI_execute_with_args("select attnum from pg_attribute where attrelid = (select oid from pg_class where oid =$1)", + 1, types0, values, NULL, true, 0); + if (SPI_processed == 0) + elog(ERROR, "no attribute found in table %d", tableoids[0]); + if (SPI_processed > 10000) + elog(ERROR, "too many relation found in table %d", tableoids[0]); + + /* collect target attributes. assuming all tables have the same attnums */ + for (i = 0 ; i < SPI_processed ; i++) + { + int16 attnum; + + heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc, + values, nulls); + if (nulls[0]) + elog(ERROR, "Failed 3"); + attnum = DatumGetInt16(values[0]); + + if (attnum > 0) + attnums[natts++] = attnum; + } + SPI_finish(); + elog(DEBUG1, "%d attributes found", natts); +} diff --git a/contrib/catcachebench/catcachebench.control b/contrib/catcachebench/catcachebench.control new file mode 100644 index 0000000000..3fc9d2e420 --- /dev/null +++ b/contrib/catcachebench/catcachebench.control @@ -0,0 +1,6 @@ +# catcachebench + +comment = 'benchmark for catcache pruning' +default_version = '0.0' +module_pathname = '$libdir/catcachebench' +relocatable = true diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index fa2b49c676..11b94504af 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -740,6 +740,41 @@ CatalogCacheFlushCatalog(Oid catId) CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call"); } + +/* FUNCTION FOR BENCHMARKING */ +void +CatalogCacheFlushCatalog2(Oid catId) +{ + slist_iter iter; + + CACHE_elog(DEBUG2, "CatalogCacheFlushCatalog called for %u", catId); + + slist_foreach(iter, &CacheHdr->ch_caches) + { + CatCache *cache = slist_container(CatCache, cc_next, iter.cur); + + /* Does this cache store tuples of the target catalog? */ + if (cache->cc_reloid == catId) + { + /* Yes, so flush all its contents */ + ResetCatalogCache(cache); + + /* Tell inval.c to call syscache callbacks for this cache */ + CallSyscacheCallbacks(cache->id, 0); + + cache->cc_nbuckets = 128; + pfree(cache->cc_bucket); + cache->cc_bucket = + (dlist_head *) MemoryContextAllocZero(CacheMemoryContext, + cache->cc_nbuckets * sizeof(dlist_head)); + elog(LOG, "Catcache reset"); + } + } + + CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call"); +} +/* END: FUNCTION FOR BENCHMARKING */ + /* * InitCatCache * diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c index e4dc4ee34e..b60416ec63 100644 --- a/src/backend/utils/cache/syscache.c +++ b/src/backend/utils/cache/syscache.c @@ -994,7 +994,7 @@ static const struct cachedesc cacheinfo[] = { } }; -static CatCache *SysCache[SysCacheSize]; +CatCache *SysCache[SysCacheSize]; static bool CacheInitialized = false;