Thread: Global shared meta cache
Hi, hackers! My customer created hundreds of thousands of partition tables and tried to select data from hundreds of applications, which resulted in enormous consumption of memory because it consumed # of backend multiplied by # of local memory (ex. 100backends X 1GB = 100GB). Relation caches are loaded on each backend local memory. To address this issue I'm trying to move meta caches like catcache or relcache into shared memory. This topic seems to have been discussed several times. For instance this thread: https://www.postgresql.org/message-id/CA%2BTgmobjDw_SWsxyJwT9z-YOwWv0ietuQx5fb%3DWEYdDfvCbzGQ%40mail.gmail.com In my understanding, it discussed moving catcache and relcache to shared memory rather than current local backend memory, and is most concerned with performance overhead. Robert Haas wrote: > I think it would be interested for somebody to build a prototype here > that ignores all the problems but the first and uses some > straightforward, relatively unoptimized locking strategy for the first > problem. Then benchmark it. If the results show that the idea has > legs, then we can try to figure out what a real implementation would > look like. > (One possible approach: use Thomas Munro's DHT stuff to build the shared cache.) I'm inspired by this comment and now developing a prototype (please see attached), but I haven't yet put cache structure on shared memory. Instead, I put dummy data on shared memory which is initialized at startup, and then acquire/release lock just before/after searching/creating catcache entry. I haven't considered relcache and catcachelist either. It is difficult for me to do everything at one time with right direction. So I'm trying to make small prototype and see what I'm walking on the proper way. I tested pgbench to compare master branch with my patch. 0) Environment - RHEL 7.4 - 16 cores - 128 GB memory 1) Initialized with pgbench -i -s10 2) benchmarked 3 times for each conditions and got the average result of TPS. |master branch | prototype | proto/master (%) ------------------------------------------------------------------------------------ pgbench -c48 -T60 -Msimple -S | 131297 |130541 |101% pgbench -c48 -T60 -Msimple | 4956 |4965 |95% pgbench -c48 -T60 -Mprepared -S |129688 |132538 |97% pgbench -c48 -T60 -Mprepared |5113 |4615 |84% This result seems to show except for prepared protocol with "not only SELECT" it didn't make much difference. What do you think about it? Before I dig deeper, I want to hear your thoughts. Best regards, Takeshi Ideriha
Attachment
Takeshi-san,
>My customer created hundreds of thousands of partition tables and tried to select data from hundreds of applications,
>which resulted in enormous consumption of memory because it consumed # of backend multiplied by #
>which resulted in enormous consumption of memory because it consumed # of backend multiplied by #
> of local memory (ex. 100 backends X 1GB = 100GB).
>Relation caches are loaded on each backend local memory.
>Relation caches are loaded on each backend local memory.
My team and I have been working to make caches shared for the past two years, but the system and rel caches we have chosen not to share..
Reason being that these caches play a big role in transactional DDL processing.
When you do DDL your backend can see all the changes since you update your own cache, but no anyone else's until you commit.
You will find that dealing with that will be the true complexity.
Reason being that these caches play a big role in transactional DDL processing.
When you do DDL your backend can see all the changes since you update your own cache, but no anyone else's until you commit.
You will find that dealing with that will be the true complexity.
Have you tried to simply cap the size of these caches?
That's a rather straight forward piece of work and will get you quite far.
We run with a 20MB syscache and a 10MB relcache with 100k+ objects and hundreds of backends
A dumb LRU is plenty good for the purpose.
That's a rather straight forward piece of work and will get you quite far.
We run with a 20MB syscache and a 10MB relcache with 100k+ objects and hundreds of backends
A dumb LRU is plenty good for the purpose.
That being said I would love to see these caches shared. :-)
Cheers
Serge
Serge
Salesforce
Hi, On 2018-06-26 06:48:28 +0000, Ideriha, Takeshi wrote: > > I think it would be interested for somebody to build a prototype here > > that ignores all the problems but the first and uses some > > straightforward, relatively unoptimized locking strategy for the first > > problem. Then benchmark it. If the results show that the idea has > > legs, then we can try to figure out what a real implementation would > > look like. > > (One possible approach: use Thomas Munro's DHT stuff to build the shared cache.) > > I'm inspired by this comment and now developing a prototype (please see attached), > but I haven't yet put cache structure on shared memory. > Instead, I put dummy data on shared memory which is initialized at startup, > and then acquire/release lock just before/after searching/creating catcache entry. > I haven't considered relcache and catcachelist either. > It is difficult for me to do everything at one time with right direction. > So I'm trying to make small prototype and see what I'm walking on the proper way. > > I tested pgbench to compare master branch with my patch. > > 0) Environment > - RHEL 7.4 > - 16 cores > - 128 GB memory > > 1) Initialized with pgbench -i -s10 > > 2) benchmarked 3 times for each conditions and got the average result of TPS. > |master branch | prototype | proto/master (%) > ------------------------------------------------------------------------------------ > pgbench -c48 -T60 -Msimple -S | 131297 |130541 |101% > pgbench -c48 -T60 -Msimple | 4956 |4965 |95% > pgbench -c48 -T60 -Mprepared -S |129688 |132538 |97% > pgbench -c48 -T60 -Mprepared |5113 |4615 |84% > > This result seems to show except for prepared protocol with "not only SELECT" it didn't make much difference. This seems like an pretty large regression to me. And that's an extremely simplistic case, with tiny caches, and barely any changes to the cache contents. Greetings, Andres Freund
Ideriha, Takeshi wrote > 2) benchmarked 3 times for each conditions and got the average result of > TPS. > |master branch | prototype | > proto/master (%) > > ------------------------------------------------------------------------------------ > pgbench -c48 -T60 -Msimple -S | 131297 |130541 |101% > pgbench -c48 -T60 -Msimple | 4956 |4965 |95% > pgbench -c48 -T60 -Mprepared -S |129688 |132538 |97% > pgbench -c48 -T60 -Mprepared |5113 |4615 |84% > > > 001_global_meta_cache.patch (6K) > <http://www.postgresql-archive.org/attachment/6026686/0/001_global_meta_cache.patch> Hello, Apologies for question. I thought I would just double check percentages that have been presented. Is the percentage calculation correct? as #1 and #3 look inverted to me (say lower when should be higher and vice versa), and #2 and #4 look incorrect generally (percentages look much larger than they should be based on numbers. I.e. Msimple -S the protype had slightly worse tps performance (130541) versus Master (131297). I would expect the percentage to be e.g. 99% not 101% But I may be misunderstanding something :) Also, Msimple is 4956 master versus 4965 prototype. Just 9 tps change. A very slight improvement in tps. but the percentage provided is 95%. I would expect it to be just over 100%? Again, maybe im not understanding, and hoping it is just my error :) -- Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com] > 1) Initialized with pgbench -i -s10 ... > pgbench -c48 -T60 -Msimple | 4956 |4965 > |95% The scaling factor should be much greater than the number of clients. Otherwise, multiple clients would conflict on thesame row of branches table, which might hide the additional overhead of the global metacache. And I think -j should be 12 or so on your 16-core server, so that the pgbench client can maximize its concurrency. Regards Takayuki Tsunakawa
On 26.06.2018 09:48, Ideriha, Takeshi wrote: > Hi, hackers! > > My customer created hundreds of thousands of partition tables and tried to select data from hundreds of applications, > which resulted in enormous consumption of memory because it consumed # of backend multiplied by # of local memory (ex.100 backends X 1GB = 100GB). > Relation caches are loaded on each backend local memory. > > To address this issue I'm trying to move meta caches like catcache or relcache into shared memory. > > This topic seems to have been discussed several times. > For instance this thread: > https://www.postgresql.org/message-id/CA%2BTgmobjDw_SWsxyJwT9z-YOwWv0ietuQx5fb%3DWEYdDfvCbzGQ%40mail.gmail.com > > In my understanding, it discussed moving catcache and relcache to shared memory rather than current local backend memory, > and is most concerned with performance overhead. > > Robert Haas wrote: >> I think it would be interested for somebody to build a prototype here >> that ignores all the problems but the first and uses some >> straightforward, relatively unoptimized locking strategy for the first >> problem. Then benchmark it. If the results show that the idea has >> legs, then we can try to figure out what a real implementation would >> look like. >> (One possible approach: use Thomas Munro's DHT stuff to build the shared cache.) > I'm inspired by this comment and now developing a prototype (please see attached), > but I haven't yet put cache structure on shared memory. > Instead, I put dummy data on shared memory which is initialized at startup, > and then acquire/release lock just before/after searching/creating catcache entry. > > I haven't considered relcache and catcachelist either. > It is difficult for me to do everything at one time with right direction. > So I'm trying to make small prototype and see what I'm walking on the proper way. > > I tested pgbench to compare master branch with my patch. > > 0) Environment > - RHEL 7.4 > - 16 cores > - 128 GB memory > > 1) Initialized with pgbench -i -s10 > > 2) benchmarked 3 times for each conditions and got the average result of TPS. > |master branch | prototype | proto/master (%) > ------------------------------------------------------------------------------------ > pgbench -c48 -T60 -Msimple -S | 131297 |130541 |101% > pgbench -c48 -T60 -Msimple | 4956 |4965 |95% > pgbench -c48 -T60 -Mprepared -S |129688 |132538 |97% > pgbench -c48 -T60 -Mprepared |5113 |4615 |84% > > This result seems to show except for prepared protocol with "not only SELECT" it didn't make much difference. > > > What do you think about it? > Before I dig deeper, I want to hear your thoughts. > > Best regards, > Takeshi Ideriha > Hi, I really think that we need to move to global caches (and especially catalog caches) in Postgres. Modern NUMA servers may have hundreds of cores and to be able to utilize all of them, we may need to start large number (hundreds) of backends. Memory overhead of local cache multiplied by 1000 can be quite significant. But I am not sure that just using RW lock will be enough replace local cache with global. I am quite skeptical concerning performance results you have provided. Once dataset completely fits in memory (which is true in your case), select-only pgbench with prepared statements should be about two times faster, than without prepared statements. And in your case performance with prepared statements is even worser. I wonder if you have repeated each measurement multiple time, to make sure that it is not just a fluctuation. Also which postgresql configuration you have used. If it is default postgresql.conf with 128Mb shared buffers size, then you are measuring time of disk access and catalog cache is not relevant for performance in this case. Below are result I got with pgbench scale 100 (with scale 10 results are slightly better) at my desktop with just 16Gb of RAM and 4 ccore.: |master branch | prototype | proto/master (%) ------------------------------------------------------------------------------------ pgbench -c10 -T60 -Msimple -S | 187189 |182123 |97% pgbench -c10 -T60 -Msimple | 15495 |15112 |97% pgbench -c10 -T60 -Mprepared -S | 98273 |92810 |94% pgbench -c10 -T60 -Mprepared | 25796 |25169 |97% As you see there are no surprises here: negative effect of shared cache is the largest for the case of non-prepared selects (because selects themselves are much faster than updates and during compilation we have to access relations multiple times). -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
>-----Original Message----- >From: AJG [mailto:ayden@gera.co.nz] >Sent: Wednesday, June 27, 2018 3:21 AM >To: pgsql-hackers@postgresql.org >Subject: Re: Global shared meta cache > >Ideriha, Takeshi wrote >> 2) benchmarked 3 times for each conditions and got the average result >> of TPS. >> |master branch | prototype | >> proto/master (%) >> >> ------------------------------------------------------------------------------------ >> pgbench -c48 -T60 -Msimple -S | 131297 |130541 |101% >> pgbench -c48 -T60 -Msimple | 4956 |4965 |95% >> pgbench -c48 -T60 -Mprepared -S |129688 |132538 |97% >> pgbench -c48 -T60 -Mprepared |5113 |4615 |84% >> >> >> 001_global_meta_cache.patch (6K) >> <http://www.postgresql-archive.org/attachment/6026686/0/001_global_ >> meta_cache.patch> > > >Hello, >Apologies for question. I thought I would just double check percentages that have >been presented. >Is the percentage calculation correct? >as #1 and #3 look inverted to me (say lower when should be higher and vice versa), >and >#2 and #4 look incorrect generally (percentages look much larger than they should be >based on numbers. > >I.e. Msimple -S the protype had slightly worse tps performance (130541) versus >Master (131297). I would expect the percentage to be e.g. 99% not 101% > >But I may be misunderstanding something :) > >Also, Msimple is 4956 master versus 4965 prototype. Just 9 tps change. A very slight >improvement in tps. but the percentage provided is 95%. I would expect it to be just >over 100%? >Again, maybe im not understanding, and hoping it is just my error :) > > > >-- >Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html > Hi, Thank you for comments and sorry for late replay. Thanks to you, I noticed I made a mistake. As you pointed out, I think my calculation is wrong. I also need to change some settings of postgresql.conf and pgbench. So I'm going to measure performance again and submit the result. Regards, Takeshi Ideriha
On Mon, Jul 2, 2018 at 5:59 AM, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > But I am not sure that just using RW lock will be enough replace local cache > with global. I'm pretty sure it won't. In fact, no matter what kind of locking you use, it's bound to cost something. There is no such thing as a free lunch. It does not seem realistic to me to suppose that we're going to just get rid of all of our backend-private caches and replace them with a shared cache and somehow there will be no performance regression. Maybe someone will come up with something that is surprisingly awesome, but I doubt it. I think we need to take a little bit broader view of this problem. For instance, maybe we could have backend-local caches that are kept relatively small, and then a larger shared cache that can hold more entries. There are code comments someplace that say that CLOBBER_CACHE_ALWAYS is about 100x slower than a regular build, and CLOBBER_CACHE_ALWAYS is about 10000x slower. Obviously, loading an entry into our backend-private cache must be a LOT slower than consulting one. If the shared cache is only, say, 3x slower than the backend-private cache, then we might be able to get away with having only the really-frequently-accessed stuff in the actual private cache and the somewhat-frequently-accessed stuff in the shared cache. Now that's probably still going to cost something, but maybe we can make that "something" very small in the cases people are actually likely to hit. I would guess that we'd want to try to use something like the st_changecount protocol to make reads very cheap and writes comparatively more expensive, since by and large cache invalidations aren't that frequent. Another approach would be to consider whether we're caching too much useless junk in the first place. For instance, maybe there's some stuff in the relcache that takes up a lot of memory compared to how much of a performance boost it produces, or maybe there's some stuff that could be represented more compactly. One experiment I think would be interesting is to study how much catcache traffic we're actually generating and see if we can find any way to reduce it. For instance, somebody could write code to record the file and line number for every catcache lookup and then run a test workload (or several test workloads). That might give us some ideas about stuff we could just decide not to cache, especially if we also knew how large each cache ended up being. With respect to partitioning specifically, it seems like we might be able to come up with some way of planning that doesn't need a full relcache entry for every partition, particularly if there are no partition-local objects (indexes, triggers, etc.). But that seems like a hard refactoring, and even if we did it, what about execution time? So much code expects to be handed a Relation. Still, I have a suspicion that there might be some way to do better here with enough work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 05.07.2018 17:00, Robert Haas wrote: > On Mon, Jul 2, 2018 at 5:59 AM, Konstantin Knizhnik > <k.knizhnik@postgrespro.ru> wrote: >> But I am not sure that just using RW lock will be enough replace local cache >> with global. > I'm pretty sure it won't. In fact, no matter what kind of locking you > use, it's bound to cost something. There is no such thing as a free > lunch. It does not seem realistic to me to suppose that we're going > to just get rid of all of our backend-private caches and replace them > with a shared cache and somehow there will be no performance > regression. Maybe someone will come up with something that is > surprisingly awesome, but I doubt it. > > I think we need to take a little bit broader view of this problem. > For instance, maybe we could have backend-local caches that are kept > relatively small, and then a larger shared cache that can hold more > entries. There are code comments someplace that say that > CLOBBER_CACHE_ALWAYS is about 100x slower than a regular build, and > CLOBBER_CACHE_ALWAYS is about 10000x slower. Obviously, loading an > entry into our backend-private cache must be a LOT slower than > consulting one. If the shared cache is only, say, 3x slower than the > backend-private cache, then we might be able to get away with having > only the really-frequently-accessed stuff in the actual private cache > and the somewhat-frequently-accessed stuff in the shared cache. Now > that's probably still going to cost something, but maybe we can make > that "something" very small in the cases people are actually likely to > hit. I would guess that we'd want to try to use something like the > st_changecount protocol to make reads very cheap and writes > comparatively more expensive, since by and large cache invalidations > aren't that frequent. > > Another approach would be to consider whether we're caching too much > useless junk in the first place. For instance, maybe there's some > stuff in the relcache that takes up a lot of memory compared to how > much of a performance boost it produces, or maybe there's some stuff > that could be represented more compactly. > > One experiment I think would be interesting is to study how much > catcache traffic we're actually generating and see if we can find any > way to reduce it. For instance, somebody could write code to record > the file and line number for every catcache lookup and then run a test > workload (or several test workloads). That might give us some ideas > about stuff we could just decide not to cache, especially if we also > knew how large each cache ended up being. > > With respect to partitioning specifically, it seems like we might be > able to come up with some way of planning that doesn't need a full > relcache entry for every partition, particularly if there are no > partition-local objects (indexes, triggers, etc.). But that seems > like a hard refactoring, and even if we did it, what about execution > time? So much code expects to be handed a Relation. Still, I have a > suspicion that there might be some way to do better here with enough > work. > 94% slowdown at my desktop seems to be not so significant degradation. But I tried this patch at more powerful server with 24 physical cores and here the negative effect of global cache synchronization was much more dramatic: 196k TPS vs. 395k TPS for select-only pgbench with -S -c 100 -j 10. Almost two times! So we really need more sophisticated and smart solution for the problem of global caches. Looks like combination of small local and big global caches is the best alternative. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hi, On 2018-07-05 10:00:13 -0400, Robert Haas wrote: > I think we need to take a little bit broader view of this problem. > For instance, maybe we could have backend-local caches that are kept > relatively small, and then a larger shared cache that can hold more > entries. I think it's pretty much *required* that we have that. Not just for speed, but for correctness. It'll otherwise be very hard to deal with transactional DDL. I'm pretty sure that we'll have to have everything modified by the local transaction in that cache. There's a lot of interesting additional problems with snapshots stil, but that seems like a baseline requirement. > Obviously, loading an entry into our backend-private cache must be a > LOT slower than consulting one. I'm not actually sure it makes *that* much of a difference, if the local cache is small. > I would guess that we'd want to try to use something like the > st_changecount protocol to make reads very cheap and writes > comparatively more expensive, since by and large cache invalidations > aren't that frequent. That strikes me as hard to get right and fatal to get wrong. I suspect normal RW locking ought to do fine. > One experiment I think would be interesting is to study how much > catcache traffic we're actually generating and see if we can find any > way to reduce it. For instance, somebody could write code to record > the file and line number for every catcache lookup and then run a test > workload (or several test workloads). That might give us some ideas > about stuff we could just decide not to cache, especially if we also > knew how large each cache ended up being. We definitely do a lot of redundant lookups for the same entries. Greetings, Andres Freund
>-----Original Message----- >From: serge@rielau.com [mailto:serge@rielau.com] >Sent: Wednesday, June 27, 2018 2:04 AM >To: Ideriha, Takeshi/出利葉 健 <ideriha.takeshi@jp.fujitsu.com>; pgsql-hackers ><pgsql-hackers@postgresql.org> >Subject: RE: Global shared meta cache > >Takeshi-san, > > >>My customer created hundreds of thousands of partition tables and tried >>to select data from hundreds of applications, which resulted in >>enormous consumption of memory because it consumed # of backend multiplied by ># of local memory (ex. 100 backends X 1GB = 100GB). >>Relation caches are loaded on each backend local memory. >My team and I have been working to make caches shared for the past two years, but >the system and rel caches we have chosen not to share.. >Reason being that these caches play a big role in transactional DDL processing. >When you do DDL your backend can see all the changes since you update your own >cache, but no anyone else's until you commit. >You will find that dealing with that will be the true complexity. Hi Serge, Thank you for sharing your experience. I didn't thought much about DDL visibility problem. Introducing version control like MVCC to catcache may solve the problem, but it seems too much to me. It may be a good to keep local catcache for in-progress transaction rather than sharing everything. (Other hackers also pointed out it. ) >Have you tried to simply cap the size of these caches? >That's a rather straight forward piece of work and will get you quite far. >We run with a 20MB syscache and a 10MB relcache with 100k+ objects and hundreds >of backends A dumb LRU is plenty good for the purpose. > I haven't tried yet but read some relevant discussion: https://www.postgresql.org/message-id/flat/20161219.201505.11562604.horiguchi.kyotaro@lab.ntt.co.jp I think the cap solution alleviates memory bloating in some cases but there is a still problematic case if there are so manybackends. >That being said I would love to see these caches shared. :-) Thank you! Regards, Takeshi
On 2018/07/05 23:00, Robert Haas wrote: > With respect to partitioning specifically, it seems like we might be > able to come up with some way of planning that doesn't need a full > relcache entry for every partition, particularly if there are no > partition-local objects (indexes, triggers, etc.). We won't know that there are no partition-local objects until we open them though, right? As you said, there might be a way to refactor things such that just knowing that there are no partition-local objects becomes cheaper than doing a full-fledged RelationBuildDesc. Thanks, Amit
Hi, Konstantin >Hi, >I really think that we need to move to global caches (and especially catalog caches) in >Postgres. >Modern NUMA servers may have hundreds of cores and to be able to utilize all of them, >we may need to start large number (hundreds) of backends. >Memory overhead of local cache multiplied by 1000 can be quite significant. Yeah, thank you for the comment. >I am quite skeptical concerning performance results you have provided. >Once dataset completely fits in memory (which is true in your case), select-only >pgbench with prepared statements should be about two times faster, than without >prepared statements. And in your case performance with prepared statements is even >worser. > >I wonder if you have repeated each measurement multiple time, to make sure that it >is not just a fluctuation. >Also which postgresql configuration you have used. If it is default postgresql.conf with >128Mb shared buffers size, then you are measuring time of disk access and catalog >cache is not relevant for performance in this case. > >Below are result I got with pgbench scale 100 (with scale 10 results are slightly better) >at my desktop with just 16Gb of RAM and 4 ccore.: > > |master branch | prototype | proto/master >(%) > ------------------------------------------------------------------------------------ > pgbench -c10 -T60 -Msimple -S | 187189 |182123 |97% > pgbench -c10 -T60 -Msimple | 15495 |15112 |97% > pgbench -c10 -T60 -Mprepared -S | 98273 |92810 |94% > pgbench -c10 -T60 -Mprepared | 25796 |25169 |97% > >As you see there are no surprises here: negative effect of shared cache is the largest >for the case of non-prepared selects (because selects themselves are much faster >than updates and during compilation we have to access relations multiple times). > As you pointed out my shared_memory and scaling factor was too small. I did the benchmark again with a new setting and my result seems to reproduce your result. On the machine with 128GB memory and 16 cores, shared_buffer was set to 32GB and db was initialized with -s100. TPS result follows: (mean of 10 times measurement; round off the decimal) |master branch | proto | proto/master (%) ------------------------------------------------------------------------------------ pgbench -c48 -T60 -j16 -Msimple -S |122140 | 114103 | 93 pgbench -c48 -T60 -j16 -Msimple | 7858 | 7822 | 100 pgbench -c48 -T60 -j16 -Mprepared -S |221740 | 210778 | 95 pgbench -c48 -T60 -j16 -Mprepared | 9257 | 8998 | 97 As you mentioned, SELECT only query has more overheads. ( By the way, I think in the later email you mentioned about the result when the concurrent number of clients is larger. On this point I'll also try to check the result.) ==================== Takeshi Ideriha Fujitsu Limited
Hi, Thank you for the previous discussion while ago. I’m afraid I haven't replied to all. To move forward this development I attached a PoC patch. I introduced a guc called shared_catacache_mem to specify how much memory is supposed be allocated on the shared memory area. It defaults to zero, which indicates that no catalog cache is shared but allocated on each backend MemoryContext (same as current Postgres). At this moment this patch only allocates catalog cache header and CatCache data on the shared memory area. It doesn't do much work, just starting and stopping postgres server with shared_catcache_mem non-zero. Shared version CatCacheHdr is put on the postgres-initialized shared memory so that backends attach it and build SysCache[] to store pointers of CatCache. Each CatCache, CatCTup and CacCList is also allocated on the shared memory area, where the limit size is the value of shared_catcache_mem and backed by DSA. This area is first created at the postgres-initialized shared memory and re-initialized as DSA area because the address of postgres-initialized shared area does not change among different process and hopefully makes it easy to handle pointers on the shared memory. (Though I'm still struggling to grasp the idea of DSA and underlying DSM..) The followings are major items I haven't touched: - make hash table of each CatCache shared, which I'm going to take advantage of dshash - how to evict shared cache (LRU mechanism) - how to treat cache visibility and invalidation coming from transactions including DDL - how to alleviate the slowness compared to current PostgreSQL - make relcache shared as well as catcache If you have any insights/reactions/suggestions, please feel free to comment. ==================== Takeshi Ideriha Fujitsu Limited
Attachment
Hi, >From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com] >Sent: Wednesday, October 3, 2018 3:18 PM >At this moment this patch only allocates catalog cache header and CatCache data on >the shared memory area. On this allocation stuffs I'm trying to handle it in another thread [1] in a broader way. >The followings are major items I haven't touched: >- how to treat cache visibility and invalidation coming from transactions including DDL On this point some of you gave me comment before but at that time I had less knowledge and couldn't replay them immediately. Sorry for that. Right now I hit upon two things. Plan A is that all of the works is done in the shared memory and no local cache is used. Plan B is that both shared cache and local cache are used. Maybe based on the discussion several month ago in this thread, plan B would be better. But there are some variations of plan B so I'd like to hear opinions. A. Use only shared memory Because everything should be done inside shared memory it needs same machinery as current DB shared_buffers That is, handling transaction including DDL in a proper way needs MVCC and cleaning up obsoleted cache needs vacuum. Taking advantage of MVCC and vacuum would work but it seems to me pretty tough to implement them. So another option is plan B, which handles version control of cache and clean them up in a different way. B. Use both shared memory and local memory Basic policy is that the shared memory keeps the latest version cache as much as possible and each cache has version information(xmin, xmax). Local cache is a kind of cache of shared one and its lifetime is temporal. [Search cache] When a backend wants to use relation or catalog cache in a transaction, it tries to find them in a following order: 1. local cache 2. shared cache 3. disk At first there is no local cache so it tries to search shared cache and if found loads it into the local memory. If wanted cache is not found in shared memory, backend fetches it from disk. [Lifetime of local cache] When ALTER TABLE/DROP TABLE is issued in a transaction, relevant local cache should be different from the original one. On this point I'm thinking two cases. B-1: Create a local cache at the first reference and keep it until transaction ends. The relevant local cache is updated or deleted when the DROP/ALTER is issued. It's freed when transaction is committedor aborted. B-2: The lifetime of local cache is during one snapshot. If isolation-level is read-committed, every time the command isissued local cache is deleted. In case of B-1 sinval messages machinery is necessary to update the local cache, which is same as current machinery. On the other hand, case B-2 doesn't need sinval message because after one snapshot duration is expired the local cache isdeleted. From another point of view, there is trade-off relation between B-1 and B-2. B-1 would outweigh B-2 in terms of performance but B-2 would use less memory. [Invalidation of shared cache] I'm thinking that invalidating shared cache can be responsible for a backend which wants to see the latest version ratherthan one has committed DROP/ALTER command. In my sketch caches has its own version information so transaction can compare itssnapshot with shared cache version and if cache is not wanted one, we can obtain it from disk. Do you have any thoughts? [1] https://www.postgresql.org/message-id/flat/4E72940DA2BF16479384A86D54D0988A6F1EE452%40G01JPEXMBKW04 Regards, Takeshi Ideriha
On Mon, Nov 26, 2018 at 12:12:09PM +0000, Ideriha, Takeshi wrote: > On this allocation stuffs I'm trying to handle it in another thread > [1] in a broader way. Based on the latets updates of this thread, this is waiting for review, so moved to next CF. -- Michael
Attachment
>From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com] >Do you have any thoughts? > Hi, I updated my idea, hoping get some feedback. [TL; DR] The basic idea is following 4 points: A. User can choose which database to put a cache (relation and catalog) on shared memory and how much memory is used B. Caches of committed data are on the shared memory. Caches of uncommitted data are on the local memory. C. Caches on the shared memory have xid information (xmin, xmax) D. Evict not recently used cache from shared memory [A] Regarding point A, I can imagine some databases are connected by lots of clients but others don't. So I introduced a new parameter in postgresql.conf, "shared_meta_cache", which is disabled by default and needs server restart to enable. ex. shared_meta_cache = 'db1:500MB, db2:100MB'. Some catcaches like pg_database are shared among the whole database, so such shared catcaches are allocated in a dedicated space within shared memory. This space can be controlled by "shared_meta_global_catcache" parameter, which is named after global directory. But I want this parameter to be hidden in postgresql.conf to make it simple for users. It's too detailed. [B & C] Regarding B & C, the motivation is we don't want other backends to see uncommitted tables. Search order is local memory -> shared memory -> disk. Local process searches cache in shared memory based from its own snapshot and xid of cache. When cache is not found in shared memory, cache with xmin is made in shared memory ( but not in local one). When cache definition is changed by DDL, new cache is created in local one, and thus next commands refer to local cache ifneeded. When it's committed, local cache is cleared and shared cache is updated. This update is done by adding xmax to old cache and also make a new one with xmin. The idea behind adding a new one is that newly created cache (new table or altered table) is likely to be used in next transactions. At this point maybe we can make use of current invalidation mechanism, even though invalidation message to other backends is not sent. [D] As for D, I'm thinking to do benchmark with simple LRU. If the performance is bad, change to other algorithm like Clock. We don't care about eviction of local cache because its lifetime is in a transaction, and I don't want to make it bloat. best regards, Takeshi Ideriha
>From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com] >[TL; DR] >The basic idea is following 4 points: >A. User can choose which database to put a cache (relation and catalog) on shared >memory and how much memory is used >B. Caches of committed data are on the >shared memory. Caches of uncommitted data are on the local memory. >C. Caches on the shared memory have xid information (xmin, xmax) >D. Evict not recently used cache from shared memory I updated some thoughts about B and C for CatCache. I would be very happy if you put some comments. >[B & C] >Regarding B & C, the motivation is we don't want other backends to see uncommitted >tables. >Search order is local memory -> shared memory -> disk. >Local process searches cache in shared memory based from its own snapshot and xid >of cache. >When cache is not found in shared memory, cache with xmin is made in shared >memory ( but not in local one). > >When cache definition is changed by DDL, new cache is created in local one, and thus >next commands refer to local cache if needed. >When it's committed, local cache is cleared and shared cache is updated. This update >is done by adding xmax to old cache and also make a new one with xmin. The idea >behind adding a new one is that newly created cache (new table or altered table) is >likely to be used in next transactions. At this point maybe we can make use of current >invalidation mechanism, even though invalidation message to other backends is not >sent. My current thoughts: - Each catcache has (maybe partial) HeapTupleHeader - put every catcache on shared memory and no local catcache - but catcache for aborted tuple is not put on shared memory - Hash table exists per kind of CatCache - These hash tables exists for each database and shared - e.g) there is a hash table for pg_class of a DB Why I'm leaning toward not to use local cache follows: - At commit moment you need to copy local cache to global cache. This would delay the response time. - Even if uncommitted catcache is on shared memory, other transaction cannot see the cache. In my idea they have xid information and visibility is checked by comparing xmin, xmax of catcache and snapshot. OK, then if we put catcache on shared memory, we need to check their visibility. But if we use the exact same visibility check mechanism as heap tuple, it takes much more steps compared to current local catcache search. Current visibility check is based on snapshot check and commit/abort check. So I'm thinking to only put in-progress caches or committed one. This would save time for checking catcache status (commit/abort) while searching cache. But basically I'm going to use current visibility check mechanism except commit/ abort check (in other words check of clog). These are how it works. - When creating a catcache, copy heap tuple with heapTupleHeader - When update/delete command for catalog tuple is finished, update xmax of corresponding cache - If there is a cache whose xmin is aborted xid, delete the cache - If there is a cache whose xmax is aborted xid, initialize xmax information - At commit time, there is no action to the shared cache Pending items are - thoughts about shared relcache - "vacuum" process for shared cache Regards, Ideriha Takeshi
Hi, everyone. >From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com] >My current thoughts: >- Each catcache has (maybe partial) HeapTupleHeader >- put every catcache on shared memory and no local catcache >- but catcache for aborted tuple is not put on shared memory >- Hash table exists per kind of CatCache >- These hash tables exists for each database and shared > - e.g) there is a hash table for pg_class of a DB I talked about shared CatCache (SysCache) with Thomas at PGCon and he suggested using sinval to control cache visibility instead of xid. Base on this I've changed my design. I'll send some PoC patch in a week but share my idea beforehand. I'm sorry this email is too long to read but I'm happy if you have some comments. Basically I won't make shared catcache as default, make it as option. Both local and shared memory has hash tables of catcache. A shared hash entry is catctup itself and a local hash entry is a pointer to the shared catctup. Actually, local hash entry does not hold a direct pointer but points to a handle of shared catctup. The handle points to shared catctup and is located in shared memory. This is intended to avoid dangling pointer of local hash entry due to eviction of shared catctup by LRU. ( The detail about LRU will be written in another email because I'll implement it later.) * Search and Insert Current postgres searches (local) hash table and if it's missed, search the actual catalog (shared buffer and disk) and build the cache; build the negative cache if not found. In new architecture, if cache is not found in local hash table, postgres tries to search shared one before consulting shared buffer. Here is a detail. To begin with, postgres looks up the pointer in local hash table. If it's found, it references the pointer and gets catctup. If not, it searches the shared hash table and gets catctup and insert its pointer into local hash table if the catctup is found. If it doesn't exist in shared hash table either, postgres searches actual catalog and build the cache and in most cases insert it into shared hash table and its pointer to local one. The exception case is that the cache is made from uncommitted catalog tuple, which must not be seen from other process. So an uncommitted cache is built in local memory and pushed directly into local table but not shared one. Lastly, if there is no tuple we're looking for, put negative tuple into shared hash table. * Invalidation and visibility control Now let's talk about invalidation. Current cache invalidation is based on local and shared invalidation queue (sinval). When transaction is committed, sinval msg is queued into shared one. Other processes read and process sinval msgs at their own timing. In shared catcache, I follow the current sinval in most parts. But I'll change the action when sinval msg is queued up and read by a process. When messages are added to shared queue, identify corresponding shared caches (matched by hash value) and turn their "obsolete flag" on. When sinval msg is read by a process, each process deletes the local hash entries (pointer to handler). Each process can see a shared catctup as long as its pointer (local entry) is valid. Because sinval msgs are not processed yet, it's ok to keep seeing the pointer to possibly old cache. After local entry is invalidated, its local process tries to search shared hash table to always find a catctup whose obsolete flag is off. The process can see the right shared cache after invalidation messages are read because it checks the obsolete flag and also uncommitted cache never exists in shared memory at all. There is a subtle thing here. Always finding a shared catctup without obsolete mark assumes that the process already read the sinval msgs. So before trying to search shared table, I make the process read sinval msg. After it's read, local cache status becomes consistent with the action to get a new cache. This reading timing is almost same as current postgres behavior because it's happened after local cache miss both in current design and mine. After cache miss in current design, a process opens the relation and gets a heavyweight lock. At this time, in fact, it reads the sinval msgs. (These things are well summarized in talking by Robert Haas at PGCon[1]). Lastly, we need to invalidate a shared catctup itself at some point. But we cannot delete is as long as someone sees it. So I'll introduce refcounter. It's increased or decreased at the same timing when current postgres manipulates the local refcounter of catctup and catclist to avoid catctup is deleted while catclist is used or vice versa (that is SearchCatCache/RelaseCatCache). So shared catctup is deleted when its shared refcount becomes zero and obsolete flag is on. Once it's vanished from shared cache, the obsolete cache never comes back again because a process which tries to get cache but fails in shared hash table already read the sinval messages (in any case it reads them when opening a table and taking a lock). I'll make a PoC aside from performance issue at first and use SharedMemoryContext (ShmContext) [2], which I'm making to allocate/free shared items via palloc/pfree. [1] https://www.pgcon.org/2019/schedule/attachments/548_Challenges%20of%20Concurrent%20DDL.pdf [2] https://commitfest.postgresql.org/23/2166/ --- Regards, Takeshi Ideriha
Takeshi-san, I am sorry for late response - I just waited new version of the patch from you for review. I read your last proposal and it seems to be very reasonable. From my point of view we can not reach acceptable level of performance if we do not have local cache at all. So, as you proposed, we should maintain local cache for uncommitted data. I think that size of global cache should be limited (you have introduced GUC for it). In principle it is possible to use dynamic shared memory and have unlimited global cache. But I do not see much sense in it. I do not completely understand from your description when are are going to evict entry from local cache? Just once transaction is committed? I think it will be more efficient to also specify memory threshold for local cache size and use LRU or some other eviction policy to remove data from local cache. So if working set (accessed relations) fits in local cache limit, there will be no performance penalty comparing with current implementation. There should be completely on difference on pgbench or other benchmarks with relatively small number of relations. If entry is not found in local cache, then we should look for it in global cache and in case of double cache miss - read it from the disk. I do not completely understand why we need to store references to global cache entries in local cache and use reference counters for global cache entries. Why we can not maintain just two independent caches? While there are really databases with hundreds and even thousands of tables, application is still used to work with only some small subset of them. So I think that "working set" can still fit in memory. This is why I think that in case of local cache miss and global cache hit, we should copy data from global cache to local cache to make it possible to access it in future without any sycnhronization. As far as we need to keep all uncommitted data in local cache, there is still a chance of local memory overflow (if some transaction creates or alters too much number of tables). But I think that it is very exotic and rare use case. The problem with memory overflow usually takes place if we have large number of backends, each maintaining its own catalog cache. So I think that we should have "soft" limit for local cache and "hard" limit for global cache. I didn't think much about cache invalidation. I read your proposal, but frankly speaking do not understand why it should be so complicated. Why we can't immediately invalidate entry in global cache and lazily (as it is done now using invalidation signals) invalidate local caches? On 26.06.2019 9:23, Ideriha, Takeshi wrote: > Hi, everyone. > >> From: Ideriha, Takeshi [mailto:ideriha.takeshi@jp.fujitsu.com] >> My current thoughts: >> - Each catcache has (maybe partial) HeapTupleHeader >> - put every catcache on shared memory and no local catcache >> - but catcache for aborted tuple is not put on shared memory >> - Hash table exists per kind of CatCache >> - These hash tables exists for each database and shared >> - e.g) there is a hash table for pg_class of a DB > I talked about shared CatCache (SysCache) with Thomas at PGCon and he > suggested using sinval to control cache visibility instead of xid. > Base on this I've changed my design. I'll send some PoC patch in a week > but share my idea beforehand. I'm sorry this email is too long to read > but I'm happy if you have some comments. > > Basically I won't make shared catcache as default, make it as option. > > Both local and shared memory has hash tables of catcache. A shared hash > entry is catctup itself and a local hash entry is a pointer to the > shared catctup. Actually, local hash entry does not hold a direct pointer > but points to a handle of shared catctup. The handle points to shared > catctup and is located in shared memory. This is intended to avoid > dangling pointer of local hash entry due to eviction of shared catctup > by LRU. ( The detail about LRU will be written in another email because > I'll implement it later.) > > * Search and Insert > Current postgres searches (local) hash table and if it's missed, search > the actual catalog (shared buffer and disk) and build the cache; build > the negative cache if not found. > > In new architecture, if cache is not found in local hash table, postgres > tries to search shared one before consulting shared buffer. Here is a > detail. To begin with, postgres looks up the pointer in local hash > table. If it's found, it references the pointer and gets catctup. If > not, it searches the shared hash table and gets catctup and insert > its pointer into local hash table if the catctup is found. If it doesn't > exist in shared hash table either, postgres searches actual catalog and > build the cache and in most cases insert it into shared hash table > and its pointer to local one. The exception case is that the cache > is made from uncommitted catalog tuple, which must not be seen from > other process. So an uncommitted cache is built in local memory and > pushed directly into local table but not shared one. Lastly, if there > is no tuple we're looking for, put negative tuple into shared hash table. > > * Invalidation and visibility control > Now let's talk about invalidation. Current cache invalidation is based > on local and shared invalidation queue (sinval). When transaction is > committed, sinval msg is queued into shared one. Other processes read and > process sinval msgs at their own timing. > > In shared catcache, I follow the current sinval in most parts. But I'll > change the action when sinval msg is queued up and read by a process. > When messages are added to shared queue, identify corresponding shared > caches (matched by hash value) and turn their "obsolete flag" on. When > sinval msg is read by a process, each process deletes the local hash > entries (pointer to handler). Each process can see a shared catctup as > long as its pointer (local entry) is valid. Because sinval msgs are not > processed yet, it's ok to keep seeing the pointer to possibly old > cache. After local entry is invalidated, its local process tries > to search shared hash table to always find a catctup whose obsolete flag > is off. The process can see the right shared cache after invalidation > messages are read because it checks the obsolete flag and also > uncommitted cache never exists in shared memory at all. > > There is a subtle thing here. Always finding a shared catctup without > obsolete mark assumes that the process already read the sinval msgs. So > before trying to search shared table, I make the process read sinval msg. > After it's read, local cache status becomes consistent with the action > to get a new cache. This reading timing is almost same as current postgres > behavior because it's happened after local cache miss both in current > design and mine. After cache miss in current design, a process opens > the relation and gets a heavyweight lock. At this time, in fact, it reads > the sinval msgs. (These things are well summarized in talking by Robert > Haas at PGCon[1]). > > Lastly, we need to invalidate a shared catctup itself at some point. But > we cannot delete is as long as someone sees it. So I'll introduce > refcounter. It's increased or decreased at the same timing when > current postgres manipulates the local refcounter of catctup and catclist > to avoid catctup is deleted while catclist is used or vice versa (that > is SearchCatCache/RelaseCatCache). So shared catctup is deleted when > its shared refcount becomes zero and obsolete flag is on. Once it's > vanished from shared cache, the obsolete cache never comes back again > because a process which tries to get cache but fails in shared hash table > already read the sinval messages (in any case it reads them when opening > a table and taking a lock). > > > I'll make a PoC aside from performance issue at first and use > SharedMemoryContext (ShmContext) [2], which I'm making to allocate/free > shared items via palloc/pfree. > > [1] https://www.pgcon.org/2019/schedule/attachments/548_Challenges%20of%20Concurrent%20DDL.pdf > [2] https://commitfest.postgresql.org/23/2166/ > > --- > Regards, > Takeshi Ideriha Hi -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
The last patch we got here (a prototype) was almost a year ago. There was substantial discussion about it, but no new version of the patch has been posted. Are we getting a proper patch soon, or did we give up on the approach entirely? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, Alvaro > >The last patch we got here (a prototype) was almost a year ago. There was >substantial discussion about it, but no new version of the patch has been posted. Are >we getting a proper patch soon, or did we give up on the approach entirely? I'm sorry for the late response. I started to work for it again for coming up commitfest. Regards, Takeshi Ideriha
Hi, Konstantin I'm very sorry for the late response and thank you for your feedback. (I re-sent this email because my email address changed and couldn't deliver to hackers.) >From: Konstantin Knizhnik [mailto:k.knizhnik@postgrespro.ru] > >Takeshi-san, > >I am sorry for late response - I just waited new version of the patch >from you for review. Though I haven't incorporated your idea, I made PoC patch, which supports regular create table, select, and drop table. TBH, current patch is not sophisticated so much. It failed some installcheck items with global catalog cache on and has around 2k LOC. >I read your last proposal and it seems to be very reasonable. > From my point of view we can not reach acceptable level of performance >if we do not have local cache at all. >So, as you proposed, we should maintain local cache for uncommitted data. Yeah, I did this in my patch. >I think that size of global cache should be limited (you have introduced GUC for it). >In principle it is possible to use dynamic shared memory and have >unlimited global cache. >But I do not see much sense in it. Yes. I limit the size for global cache. Right now it doesn't support eviction policy like LRU. >I do not completely understand from your description when are are going >to evict entry from local cache? >Just once transaction is committed? I think it will be more efficient >to also specify memory threshold for local cache size and use LRU or >some other eviction policy to remove data from local cache. >So if working set (accessed relations) fits in local cache limit, there >will be no performance penalty comparing with current implementation. >There should be completely on difference on pgbench or other benchmarks >with relatively small number of relations. > >If entry is not found in local cache, then we should look for it in >global cache and in case of double cache miss - read it from the disk. >I do not completely understand why we need to store references to >global cache entries in local cache and use reference counters for global cache entries. >Why we can not maintain just two independent caches? > >While there are really databases with hundreds and even thousands of >tables, application is still used to work with only some small subset of them. >So I think that "working set" can still fit in memory. This is why I >think that in case of local cache miss and global cache hit, we should >copy data from global cache to local cache to make it possible to access it in future without any sycnhronization. > >As far as we need to keep all uncommitted data in local cache, there is >still a chance of local memory overflow (if some transaction creates or >alters too much number of tables). >But I think that it is very exotic and rare use case. The problem with >memory overflow usually takes place if we have large number of >backends, each maintaining its own catalog cache. >So I think that we should have "soft" limit for local cache and "hard" >limit for global cache. Oh, I didn't come up this idea at all. So local cache is sort of 1st cache and global cache is second cache. That soundsgreat. It would be good for performance and also setting two guc parameter for limiting local cache and global cache gives completememory control for DBA. Yeah, uncommitted data should be in local but it's the only exception. No need to keep track of reference to global cache from local cache header seems less complex for implementation. I'll lookinto the design. >I didn't think much about cache invalidation. I read your proposal, but >frankly speaking do not understand why it should be so complicated. >Why we can't immediately invalidate entry in global cache and lazily >(as it is done now using invalidation signals) invalidate local caches? > I was overthinking about when local/global cache is evicted. Simply the process reads the sinval messages then invalidateit. If the refcount is not zero, the process mark it dead to prevent other process from finding the obsoleted cachefrom global hash table. The refcount of global cache is raised between SearchSysCache() and ReleaseSysCache(). Invalidation of global cache with refcount up would cause invalid memory access. Regards, Takeshi Ideriha
Attachment
Hi, Konstantin >>From: Konstantin Knizhnik [mailto:k.knizhnik@postgrespro.ru] >>I do not completely understand from your description when are are going >>to evict entry from local cache? >>Just once transaction is committed? I think it will be more efficient >>to also specify memory threshold for local cache size and use LRU or >>some other eviction policy to remove data from local cache. >>So if working set (accessed relations) fits in local cache limit, there >>will be no performance penalty comparing with current implementation. >>There should be completely on difference on pgbench or other benchmarks >>with relatively small number of relations. >> >>If entry is not found in local cache, then we should look for it in >>global cache and in case of double cache miss - read it from the disk. >>I do not completely understand why we need to store references to >>global cache entries in local cache and use reference counters for global cache >entries. >>Why we can not maintain just two independent caches? >> >>While there are really databases with hundreds and even thousands of >>tables, application is still used to work with only some small subset of them. >>So I think that "working set" can still fit in memory. This is why I >>think that in case of local cache miss and global cache hit, we should >>copy data from global cache to local cache to make it possible to access it in future >without any sycnhronization. >> >>As far as we need to keep all uncommitted data in local cache, there is >>still a chance of local memory overflow (if some transaction creates or >>alters too much number of tables). >>But I think that it is very exotic and rare use case. The problem with >>memory overflow usually takes place if we have large number of >>backends, each maintaining its own catalog cache. >>So I think that we should have "soft" limit for local cache and "hard" >>limit for global cache. > >Oh, I didn't come up this idea at all. So local cache is sort of 1st cache and global cache >is second cache. That sounds great. >It would be good for performance and also setting two guc parameter for limiting local >cache and global cache gives complete memory control for DBA. >Yeah, uncommitted data should be in local but it's the only exception. >No need to keep track of reference to global cache from local cache header seems less >complex for implementation. I'll look into the design. (After sleeping on it) What happens if there is a cache miss in local memory and it's found in global? One possible way is to copy the found global cache into local memory. If so, I'm just anxious about the cost of memcpy. Another way is, for example, leaving the global cache and not copying it into local memory. In this case, every time searching the global cache seems expensive because we need to get lock for at least the partition of hash table. The architecture that the local cache holding the reference to global cache (strictly speaking, holding the pointer to pointer to global cache ) is complex but once a process searches global cache, after that it can get global cache by checking the reference is still valid and traversing some pointers. Regards, Takeshi Ideriha
On 09.10.2019 9:06, ideriha.takeshi@fujitsu.com wrote: > Hi, Konstantin > >>> From: Konstantin Knizhnik [mailto:k.knizhnik@postgrespro.ru] >>> I do not completely understand from your description when are are going >>> to evict entry from local cache? >>> Just once transaction is committed? I think it will be more efficient >>> to also specify memory threshold for local cache size and use LRU or >>> some other eviction policy to remove data from local cache. >>> So if working set (accessed relations) fits in local cache limit, there >>> will be no performance penalty comparing with current implementation. >>> There should be completely on difference on pgbench or other benchmarks >>> with relatively small number of relations. >>> >>> If entry is not found in local cache, then we should look for it in >>> global cache and in case of double cache miss - read it from the disk. >>> I do not completely understand why we need to store references to >>> global cache entries in local cache and use reference counters for global cache >> entries. >>> Why we can not maintain just two independent caches? >>> >>> While there are really databases with hundreds and even thousands of >>> tables, application is still used to work with only some small subset of them. >>> So I think that "working set" can still fit in memory. This is why I >>> think that in case of local cache miss and global cache hit, we should >>> copy data from global cache to local cache to make it possible to access it in future >> without any sycnhronization. >>> As far as we need to keep all uncommitted data in local cache, there is >>> still a chance of local memory overflow (if some transaction creates or >>> alters too much number of tables). >>> But I think that it is very exotic and rare use case. The problem with >>> memory overflow usually takes place if we have large number of >>> backends, each maintaining its own catalog cache. >>> So I think that we should have "soft" limit for local cache and "hard" >>> limit for global cache. >> Oh, I didn't come up this idea at all. So local cache is sort of 1st cache and global cache >> is second cache. That sounds great. >> It would be good for performance and also setting two guc parameter for limiting local >> cache and global cache gives complete memory control for DBA. >> Yeah, uncommitted data should be in local but it's the only exception. >> No need to keep track of reference to global cache from local cache header seems less >> complex for implementation. I'll look into the design. > (After sleeping on it) > What happens if there is a cache miss in local memory and it's found in global? > One possible way is to copy the found global cache into local memory. If so, > I'm just anxious about the cost of memcpy. Another way is, for example, > leaving the global cache and not copying it into local memory. In this case, > every time searching the global cache seems expensive because we need to > get lock for at least the partition of hash table. > > The architecture that the local cache holding the reference to global cache > (strictly speaking, holding the pointer to pointer to global cache ) is complex > but once a process searches global cache, after that it can get global cache by > checking the reference is still valid and traversing some pointers. > > Regards, > Takeshi Ideriha If the assumption that working set of backend (set of tables accessed by this session) is small enough to fit in backend's memory is true, then global meta cache is not needed at all: it is enough to limit size of local cache and implement some eviction algorithm. If data is not found in local cache, then it is loaded from catalog in standard way. It is the simplest solution and may be it is good starting point for work in this direction. If there are cases when application need to work with hundreds of tables (partitioning?) then we can either store in local cache references to global cache either perform two lookups: in local and global caches.
>From: Konstantin Knizhnik [mailto:k.knizhnik@postgrespro.ru] >If the assumption that working set of backend (set of tables accessed by this session) >is small enough to fit in backend's memory is true, then global meta cache is not >needed at all: it is enough to limit size of local cache and implement some eviction >algorithm. >If data is not found in local cache, then it is loaded from catalog in standard way. >It is the simplest solution and may be it is good starting point for work in this direction. Thank you for the reply. I introduced GUC for users to choose if they want to use this feature or not. But as you stated, if data size is not so much big, my suggestion does too much and simple threshold is enough. The idea of threashold has been discussed in another thread, so I'd like to discuss it in that thread. Though it's not active these days, ideas having been discussed are memory limit, access time limit, and hybrid. It seems to me that discussion is converged into the idea of eviction by access timestamp. https://www.postgresql.org/message-id/flat/20161219.201505.11562604.horiguchi.kyotaro@lab.ntt.co.jp >If there are cases when application need to work with hundreds of tables >(partitioning?) then we can either store in local cache references to global cache either >perform two lookups: in local and global caches. I think this is my target. In case of especially many partitioned table, and many (more than 100) columns and so many backends, sharing cache would have more benefits for memory usage and performance than having only simple threshold. I did experiment before. One parent table has about 145 interger columns and this table is partitioned into about 350 child tables in average. There is 11 parent tables and about 3850 tables in total. When I did "select * from parent_table" to 11 parent tables, only CacheMemoryContext consumed about 0.37GB and in case of 100 backends, it consumed about 37GB. This is because he number of system catalog cache for pg_statistics is very large (about 577,000 entries). This number is almost same as the number of columns (145) times the number of tables (3850). (Sorry that the model and figures are not simple to understand.) By the way, in my current patch there are some redundant codes. For example, LWLocks are used too much even if you can actually use spin locks. Another thing is increasing/decreasing reference count of local reference even if local reference cache doesn't need to be protected. I'll fix these things and submit statistics about memory usage and performance. Regards, Takeshi Ideriha
On Wed, Nov 06, 2019 at 02:55:30AM +0000, ideriha.takeshi@fujitsu.com wrote: > Thank you for the reply. Latest patch does not apply. Please send a rebase. Patch moved to next CF, waiting on author. Bip. -- Michael
Attachment
This patch was broken and waiting for author since early December, so I've marked it as returned with feedback. Feel free to resubmit an updated version to a future commitfest. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, Thank you for handling it as CFM and sorry for waiting long time. >-----Original Message----- >From: Tomas Vondra [mailto:tomas.vondra@2ndquadrant.com] > >This patch was broken and waiting for author since early December, so I've marked it >as returned with feedback. Feel free to resubmit an updated version to a future >commitfest. I couldn't make it until last January but now I resubmit this patch. Previous version doesn't pass RT but now it passed. Transaction consistency is controlled by invalidation messages and location of cache. As I mentioned before, for committed catalog tuple searching order is local memory -> shared memory -> catalog file (either shared buffer or disk). Local cache header has a pointer to global cache and we can use it. But after system catalog is updated in transaction, we need to make uncommitted-catalog-tuple cache from catalog file based on MVCC. Searching for global cache should be skipped because global cache entry may be old for that transaction. So at the end of command local cache is invalidated as usual and in addition, that process creates a dummy cache. This cache indicates that if found cache is only this dummy one, we need to skip global search and search actual file. Created uncommitted cache is located in local memory so that other process cannot find it. - shared_memory_cotnext-v01.patch Memory context for global system catalog cache and relation cache backed by DSA - global_catalog_cache-v01.patch For global system catalog Things not included in this patch: - sweep shared area logic - global relation cache Hope you have some comments! Regards, Takeshi Ideriha