Thread: Cache relation sizes?
Hello, PostgreSQL likes to probe the size of relations with lseek(SEEK_END) a lot. For example, a fully prewarmed pgbench -S transaction consists of recvfrom(), lseek(SEEK_END), lseek(SEEK_END), sendto(). I think lseek() is probably about as cheap as a syscall can be so I doubt it really costs us much, but it's still a context switch and it stands out when tracing syscalls, especially now that all the lseek(SEEK_SET) calls are gone (commit c24dcd0cfd). If we had a different kind of buffer mapping system of the kind that Andres has described, there might be a place in shared memory that could track the size of the relation. Even if we could do that, I wonder if it would still be better to do a kind of per-backend lock-free caching, like this: 1. Whenever a file size has been changed by extending or truncating (ie immediately after the syscall), bump a shared "size_change" invalidation counter. 2. Somewhere in SMgrRelation, store the last known size_change counter and the last known size. In _mdnblocks(), if the counter hasn't moved, we can use the cached size and skip the call to FileSize(). 3. To minimise false cache invalidations (caused by other relations' size changes), instead of using a single size_change counter in shared memory, use an array of N and map relation OIDs onto them. 4. As for memory coherency, I think it might be enough to use uint32 without locks or read barriers on the read size, since you have a view of memory at least as new as your snapshot (the taking of which included a memory barrier). That's good enough because we don't need to see blocks added after our snapshot was taken (the same assumption applies today, this just takes further advantage of it), and truncations can't happen while we have a share lock on the relation (the taking of which also includes memory barrier, covering the case where the truncation happened after our snapshot and the acquisition of the share lock on the relation). In other words, there is heavy locking around truncation already, and for extension we don't care about recent extensions so we can be quite relaxed about memory. Right? I don't have a patch for this (though I did once try it as a throw-away hack and it seemed to work), but I just wanted to share the idea and see if anyone sees a problem with the logic/interlocking, or has a better idea for how to do this. It occurred to me that I might be missing something or this would have been done already... -- Thomas Munro http://www.enterprisedb.com
Hi, On 2018-11-07 11:40:22 +1300, Thomas Munro wrote: > PostgreSQL likes to probe the size of relations with lseek(SEEK_END) a > lot. For example, a fully prewarmed pgbench -S transaction consists > of recvfrom(), lseek(SEEK_END), lseek(SEEK_END), sendto(). I think > lseek() is probably about as cheap as a syscall can be so I doubt it > really costs us much, but it's still a context switch and it stands > out when tracing syscalls, especially now that all the lseek(SEEK_SET) > calls are gone (commit c24dcd0cfd). I'd really really like to see some benchmarking before embarking on a more complex scheme. I aesthetically dislike those lseeks, but ... > If we had a different kind of buffer mapping system of the kind that > Andres has described, there might be a place in shared memory that > could track the size of the relation. Even if we could do that, I > wonder if it would still be better to do a kind of per-backend > lock-free caching, like this: Note that the reason for introducing that isn't primarily motivated by getting rid of the size determining lseeks, but reducing the locking for extending *and* truncating relations. Greetings, Andres Freund
On Wed, 7 Nov 2018 at 11:41, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > Hello, > > PostgreSQL likes to probe the size of relations with lseek(SEEK_END) a > lot. For example, a fully prewarmed pgbench -S transaction consists > of recvfrom(), lseek(SEEK_END), lseek(SEEK_END), sendto(). I think > lseek() is probably about as cheap as a syscall can be so I doubt it > really costs us much, but it's still a context switch and it stands > out when tracing syscalls, especially now that all the lseek(SEEK_SET) > calls are gone (commit c24dcd0cfd). > > If we had a different kind of buffer mapping system of the kind that > Andres has described, there might be a place in shared memory that > could track the size of the relation. Even if we could do that, I > wonder if it would still be better to do a kind of per-backend > lock-free caching, like this: On behalf of those looking after databases running over NFS (sigh!), I think this is definitely worth exploring. Looking at the behaviour of my (9.4.9) server, there's an lseek(SEEK_END) for every relation (table or index) used by a query, which is a lot of them for a heavily partitioned database. The lseek counts seem to be the same with native partitions and 10.4. As an incredibly rough benchmark, a "SELECT * FROM t ORDER BY pk LIMIT 0" on a table with 600 partitions, which builds a MergeAppend/IndexScan plan, invokes lseek around 1200 times, and takes 600ms to return when repeated. (It's much slower the first time, because the backend has to open the files, and read index blocks. I found that increasing max_files_per_process above the number of tables/indexes in the query made a huge difference, too!) Testing separately, 1200 lseeks on that NFS mount takes around 400ms. I'm aware of other improvements since 9.4.9 that would likely improve things (the pread/pwrite change; possibly using native partitioning instead of inheritance), but I imagine reducing lseeks would too. (Of course, an even better improvement is to not put your data directory on an NFS mount (sigh).)
On 7 November 2018 at 11:46, Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2018-11-07 11:40:22 +1300, Thomas Munro wrote: >> PostgreSQL likes to probe the size of relations with lseek(SEEK_END) a >> lot. For example, a fully prewarmed pgbench -S transaction consists >> of recvfrom(), lseek(SEEK_END), lseek(SEEK_END), sendto(). I think >> lseek() is probably about as cheap as a syscall can be so I doubt it >> really costs us much, but it's still a context switch and it stands >> out when tracing syscalls, especially now that all the lseek(SEEK_SET) >> calls are gone (commit c24dcd0cfd). > > I'd really really like to see some benchmarking before embarking on a > more complex scheme. I aesthetically dislike those lseeks, but ... I agree. It would be good to see benchmarks on this first. Those could be as simple as just some crude local backend cache that stuff the return value of RelationGetNumberOfBlocks in estimate_rel_size into a hashtable and does not take into account the fact that it might change. Should be okay to do some read-only benchmarking. The partitioning case is probably a less interesting case to improve if we get [1] as we'll no longer ask for the size of any pruned partitions. Queries that don't prune any partitions are less likely to notice the extra overhead of the lseek(SEEK_END) since they've probably got more work to do elsewhere. [1] https://commitfest.postgresql.org/20/1778/ -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Nov 9, 2018 at 4:42 PM David Rowley <david.rowley@2ndquadrant.com> wrote: > On 7 November 2018 at 11:46, Andres Freund <andres@anarazel.de> wrote: > > On 2018-11-07 11:40:22 +1300, Thomas Munro wrote: > >> PostgreSQL likes to probe the size of relations with lseek(SEEK_END) a > >> lot. For example, a fully prewarmed pgbench -S transaction consists > >> of recvfrom(), lseek(SEEK_END), lseek(SEEK_END), sendto(). I think > >> lseek() is probably about as cheap as a syscall can be so I doubt it > >> really costs us much, but it's still a context switch and it stands > >> out when tracing syscalls, especially now that all the lseek(SEEK_SET) > >> calls are gone (commit c24dcd0cfd). > > > > I'd really really like to see some benchmarking before embarking on a > > more complex scheme. I aesthetically dislike those lseeks, but ... > > I agree. It would be good to see benchmarks on this first. Those > could be as simple as just some crude local backend cache that stuff > the return value of RelationGetNumberOfBlocks in estimate_rel_size > into a hashtable and does not take into account the fact that it might > change. Should be okay to do some read-only benchmarking. Oh, I just found the throw-away patch I wrote ages ago down the back of the sofa. Here's a rebase. It somehow breaks initdb so you have to initdb with unpatched. Unfortunately I couldn't seem to measure any speed-up on a random EDB test lab Linux box using pgbench -S (not "prepared"), but that test doesn't involve many tables, and also it's an older kernel without KPTI mitigations. Attached in case anyone else would like to try it. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Fri, 16 Nov 2018 at 12:06, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Oh, I just found the throw-away patch I wrote ages ago down the back > of the sofa. Here's a rebase. It somehow breaks initdb so you have > to initdb with unpatched. Unfortunately I couldn't seem to measure > any speed-up on a random EDB test lab Linux box using pgbench -S (not > "prepared"), but that test doesn't involve many tables, and also it's > an older kernel without KPTI mitigations. Attached in case anyone > else would like to try it. Over on [1] there's some talk about how when using PREPAREd statements on a table with many partitions where some of the parameters help prune many of the partitions, that on the 6th, and only on the 6th execution of the statement that there's a huge spike in the query latency. This will be down to the fact that GetCachedPlan() builds a generic plan on the 6th execution and most likely discards it due to it appearing too expensive because of lack of any partition pruning. The custom plan's cost is likely much much cheaper, so the generic plan is planned but never used. This may be a good real-life candidate to test this patch with. I know from benchmarks I performed several months ago that the lseek() call to determine the relation size was a large part of the cost of planning with many partitions. [1] https://www.postgresql.org/message-id/flat/25C1C6B2E7BE044889E4FE8643A58BA963D89214%40G01JPEXMBKW03 -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hello, I also find this proposed feature to be beneficial for performance, especially when we want to extend or truncate large tables. As mentioned by David, currently there is a query latency spike when we make generic plan for partitioned table with manypartitions. I tried to apply Thomas' patch for that use case. Aside from measuring the planning and execution time, I also monitored the lseek calls using simple strace, with and without the patch. Below are the test results. Setup 8192 table partitions. (1) set plan_cache_mode = 'force_generic_plan'; [Without Patch] prepare select_stmt(int) as select * from t where id = $1; explain (timing off, analyze) execute select_stmt(8192); […] Planning Time: 1678.680 ms Execution Time: 643.495 ms $ strace -p [pid] -e trace=lseek -c % time seconds usecs/call calls errors syscall --------------------------------------------------------------------------- 100.00 0.017247 1 16385 lseek [With Patch] […] Planning Time: 1596.566 ms Execution Time: 653.646 ms $ strace -p [pid] -e trace=lseek -c % time seconds usecs/call calls errors syscall --------------------------------------------------------------------------- 100.00 0.009196 1 8192 lseek It was mentioned in the other thread [1] that creating a generic plan for the first time is very expensive. Although this patch did not seem to reduce the cost of planning time for force_generic_plan, it seems that number of lseek calls was reduced into half during the first execution of generic plan. (2) plan_cache_mode = 'auto’ reset plan_cache_mode; -- resets to auto / custom plan [Without Patch] […] Planning Time: 768.669 ms Execution Time: 0.776 ms $ strace -p [pid] -e trace=lseek -c % time seconds usecs/call calls errors syscall --------------------------------------------------------------------------- 100.00 0.015117 2 8193 lseek [With Patch] […] Planning Time: 181.690 ms Execution Time: 0.615 ms $ strace -p [pid] -e trace=lseek -c […] NO (zero) lseek calls. Without the patch, there were around 8193 lseek calls. With the patch applied, there were no lseek calls when creating the custom plan. (3) set plan_cache_mode = 'force_generic_plan'; -- force it to generic plan again to use the cached plan (no re-planning) [Without Patch] […] Planning Time: 14.294 ms Execution Time: 601.141 ms $ strace -p [pid] -e trace=lseek -c % time seconds usecs/call calls errors syscall --------------------------------------------------------------------------- 0.00 0.000000 0 1 lseek [With Patch] […] Planning Time: 13.976 ms Execution Time: 570.518 ms $ strace -p [pid] -e trace=lseek -c […] NO (zero) lseek calls. ---- If I did the test correctly, I am not sure though as to why the patch did not affect the generic planning performance oftable with many partitions. However, the number of lseek calls was greatly reduced with Thomas’ patch. I also did not get considerable speed up in terms of latency average using pgbench –S (read-only, unprepared). I am assuming this might be applicable to other use cases as well. (I just tested the patch, but haven’t dug up the patch details yet). Would you like to submit this to the commitfest to get more reviews for possible idea/patch improvement? [1] https://www.postgresql.org/message-id/flat/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com Regards, Kirk Jamison
On Thu, Dec 27, 2018 at 8:00 PM Jamison, Kirk <k.jamison@jp.fujitsu.com> wrote: > I also find this proposed feature to be beneficial for performance, especially when we want to extend or truncate largetables. > As mentioned by David, currently there is a query latency spike when we make generic plan for partitioned table with manypartitions. > I tried to apply Thomas' patch for that use case. Aside from measuring the planning and execution time, > I also monitored the lseek calls using simple strace, with and without the patch. Thanks for looking into this and testing! > Setup 8192 table partitions. > (1) set plan_cache_mode = 'force_generic_plan'; > Planning Time: 1678.680 ms > Planning Time: 1596.566 ms > (2) plan_cache_mode = 'auto’ > Planning Time: 768.669 ms > Planning Time: 181.690 ms > (3) set plan_cache_mode = 'force_generic_plan'; > Planning Time: 14.294 ms > Planning Time: 13.976 ms > If I did the test correctly, I am not sure though as to why the patch did not affect the generic planning performance oftable with many partitions. > However, the number of lseek calls was greatly reduced with Thomas’ patch. > I also did not get considerable speed up in terms of latency average using pgbench –S (read-only, unprepared). > I am assuming this might be applicable to other use cases as well. > (I just tested the patch, but haven’t dug up the patch details yet). The result for (2) is nice. Even though you had to use 8192 partitions to see it. > Would you like to submit this to the commitfest to get more reviews for possible idea/patch improvement? For now I think this still in the experiment/hack phase and I have a ton of other stuff percolating in this commitfest already (and a week of family holiday in the middle of January). But if you have ideas about the validity of the assumptions, the reason it breaks initdb, or any other aspect of this approach (or alternatives), please don't let me stop you, and of course please feel free to submit this, an improved version or an alternative proposal yourself! Unfortunately I wouldn't have time to nurture it this time around, beyond some drive-by comments. Assorted armchair speculation: I wonder how much this is affected by the OS and KPTI, virtualisation technology, PCID support, etc. Back in the good old days, Linux's lseek(SEEK_END) stopped acquiring the inode mutex when reading the size, at least in the generic implementation used by most filesystems (I wonder if our workloads were indirectly responsible for that optimisation?) so maybe it became about as fast as a syscall could possibly be, but now the baseline for how fast syscalls can be has moved and it also depends on your hardware, and it also has external costs that depend on what memory you touch in between syscalls. Also, other operating systems might still acquire a per-underlying-file/vnode/whatever lock (<checks source code>... yes) and the contention for that might depend on what else is happening, so that a single standalone test wouldn't capture that but a super busy DB with a rapidly expanding and contracting table that many other sessions are trying to observe with lseek(SEEK_END) could slow down more. -- Thomas Munro http://www.enterprisedb.com
Hi Thomas, On Friday, December 28, 2018 6:43 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > [...]if you have ideas about the validity of the assumptions, the reason it breaks initdb, or any other aspect of thisapproach (or alternatives), please don't let me stop you, and of course please feel free to submit this, an improvedversion or an alternative proposal [...] Sure. Thanks. I'd like to try to work on the idea. I also took a look at the code, and I hope you don't mind if I ask forclarifications (explanation/advice/opinions) on the following, since my postgres experience is not substantial enoughyet. (1) I noticed that you used a "relsize_change_counter" to store in MdSharedData. Is my understanding below correct? The intention is to cache the rel_size per-backend (lock-free), with an array of relsize_change_counter to skip using lseeksyscall when the counter does not change. In _mdnblocks(), if the counter did not change, the cached rel size (nblocks) is used and skip the call to FileSize() (lseekto get and cache rel size). That means in the default Postgres master, lseek syscall (through FileSize()) is calledwhenever we want to get the rel size (nblocks). On the other hand, the simplest method I thought that could also work is to only cache the file size (nblock) in shared memory,not in the backend process, since both nblock and relsize_change_counter are uint32 data type anyway. If relsize_change_countercan be changed without lock, then nblock can be changed without lock, is it right? In that case, nblockcan be accessed directly in shared memory. In this case, is the relation size necessary to be cached in backend? (2) Is the MdSharedData temporary or permanent in shared memory? from the patch: typedef struct MdSharedData { /* XXX could have an array of these, and use rel OID % nelements? */ pg_atomic_uint32 relsize_change_counter; } MdSharedData; static MdSharedData *MdShared; What I intend to have is a permanent hashtable that will keep the file size (eventually/future dev, including table addresses)in the shared memory for faster access by backend processes. The idea is to keep track of the unallocated blocks,based from how much the relation has been extended or truncated. Memory for this hashtable will be dynamically allocated. Thanks, Kirk Jamison
>From: Jamison, Kirk [mailto:k.jamison@jp.fujitsu.com] >On the other hand, the simplest method I thought that could also work is to only cache >the file size (nblock) in shared memory, not in the backend process, since both nblock >and relsize_change_counter are uint32 data type anyway. If relsize_change_counter >can be changed without lock, then nblock can be changed without lock, is it right? In >that case, nblock can be accessed directly in shared memory. In this case, is the >relation size necessary to be cached in backend? (Aside from which idea is better.. ) If you want to put relation size on the shared memory, then I don't think caches in backend is necessary because every time relation_size is updated you need to invalidate cache in backends. At the reference taking shared lock on the cache and at the update taking exclusive lock is simple without backend cache. >(2) Is the MdSharedData temporary or permanent in shared memory? That data structure seems to initialize at the time of InitPostgre, which means it's permanent because postgres-initialized-shared-memory doesn't have a chance to drop it as far as I know. (If you want to use temporary data structure, then other mechanism like dsm/dsa/dshash is a candidate.) Regards, Takeshi Ideriha
From: Jamison, Kirk [mailto:k.jamison@jp.fujitsu.com] > On the other hand, the simplest method I thought that could also work is > to only cache the file size (nblock) in shared memory, not in the backend > process, since both nblock and relsize_change_counter are uint32 data type > anyway. If relsize_change_counter can be changed without lock, then nblock > can be changed without lock, is it right? In that case, nblock can be accessed > directly in shared memory. In this case, is the relation size necessary > to be cached in backend? Although I haven't looked deeply at Thomas's patch yet, there's currently no place to store the size per relation in sharedmemory. You have to wait for the global metacache that Ideriha-san is addressing. Then, you can store the relationsize in the RelationData structure in relcache. > (2) Is the MdSharedData temporary or permanent in shared memory? > from the patch: > typedef struct MdSharedData > { > /* XXX could have an array of these, and use rel OID % nelements? > */ > pg_atomic_uint32 relsize_change_counter; > } MdSharedData; > > static MdSharedData *MdShared; Permanent in shared memory. Regards Takayuki Tsunakawa
At Wed, 6 Feb 2019 06:29:15 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1FB955DF@G01JPEXMBYT05> > From: Jamison, Kirk [mailto:k.jamison@jp.fujitsu.com] > > On the other hand, the simplest method I thought that could also work is > > to only cache the file size (nblock) in shared memory, not in the backend > > process, since both nblock and relsize_change_counter are uint32 data type > > anyway. If relsize_change_counter can be changed without lock, then nblock > > can be changed without lock, is it right? In that case, nblock can be accessed > > directly in shared memory. In this case, is the relation size necessary > > to be cached in backend? > > Although I haven't looked deeply at Thomas's patch yet, there's currently no place to store the size per relation in sharedmemory. You have to wait for the global metacache that Ideriha-san is addressing. Then, you can store the relationsize in the RelationData structure in relcache. Just one counter in the patch *seems* to give significant gain comparing to the complexity, given that lseek is so complex or it brings latency, especially on workloads where file is scarcely changed. Though I didn't run it on a test bench. > > (2) Is the MdSharedData temporary or permanent in shared memory? > > from the patch: > > typedef struct MdSharedData > > { > > /* XXX could have an array of these, and use rel OID % nelements? > > */ > > pg_atomic_uint32 relsize_change_counter; > > } MdSharedData; > > > > static MdSharedData *MdShared; > > Permanent in shared memory. I'm not sure the duration of the 'permanent' there, but it disappears when server stops. Anyway it doesn't need to be permanent beyond a server restart. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] > Just one counter in the patch *seems* to give significant gain > comparing to the complexity, given that lseek is so complex or it > brings latency, especially on workloads where file is scarcely > changed. Though I didn't run it on a test bench. I expect so, too. > I'm not sure the duration of the 'permanent' there, but it > disappears when server stops. Anyway it doesn't need to be > permanent beyond a server restart. Right, it exists while the server is running. Regards Takayuki Tsunakawa
On February 6, 2019, 08:25AM +0000, Kyotaro HORIGUCHI wrote: >At Wed, 6 Feb 2019 06:29:15 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote: >> Although I haven't looked deeply at Thomas's patch yet, there's currently no place to store the size per relation in sharedmemory. You have to wait for the global metacache that Ideriha-san is addressing. Then, you can store the relationsize in the RelationData structure in relcache. >Just one counter in the patch *seems* to give significant gain comparing to the complexity, given that lseek is so complexor it brings latency, especially on workloads where file is scarcely changed. Though I didn't run it on a test bench. > > > (2) Is the MdSharedData temporary or permanent in shared memory? > > Permanent in shared memory. > I'm not sure the duration of the 'permanent' there, but it disappears when server stops. Anyway it doesn't need to be permanentbeyond a server restart. Thank you for the insights. I did a simple test in the previous email using simple syscall tracing, the patch significantly reduced the number of lseek syscall. (but that simple test might not be enough to describe the performance benefit) Regarding Tsunakawa-san's comment, in Thomas' patch, he made a place in shared memory that stores the relsize_change_counter, so I am thinking of utilizing the same, but for caching the relsize itself. Perhaps I should explain further the intention for the design. First step, to cache the file size in the shared memory. Similar to the intention or purpose of the patch written by Thomas, to reduce the number of lseek(SEEK_END) by caching the relation size without using additional locks. The difference is by caching a rel size on the shared memory itself. I wonder if there are problems that you can see with this approach. Eventually, the next step is to have a structure in shared memory that caches file addresses along with their sizes (Andres' idea of putting an open relations table in the shared memory). With a structure that group buffers into file relation units, we can get file size directly from shared memory, so the assumption is it would be faster whenever we truncate/extend our relations because we can track the offset of the changes in size and use range for managing the truncation, etc.. The next step is a complex direction that needs serious discussion, but I am wondering if we can proceed with the first step for now if the idea and direction are valid. Regards, Kirk Jamison
On 2019-02-06 08:50:45 +0000, Jamison, Kirk wrote: > On February 6, 2019, 08:25AM +0000, Kyotaro HORIGUCHI wrote: > > >At Wed, 6 Feb 2019 06:29:15 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote: > >> Although I haven't looked deeply at Thomas's patch yet, there's currently no place to store the size per relation inshared memory. You have to wait for the global metacache that Ideriha-san is addressing. Then, you can store the relationsize in the RelationData structure in relcache. > > >Just one counter in the patch *seems* to give significant gain comparing to the complexity, given that lseek is so complexor it brings latency, especially on workloads where file is scarcely changed. Though I didn't run it on a test bench. > > > > > (2) Is the MdSharedData temporary or permanent in shared memory? > > > Permanent in shared memory. > > I'm not sure the duration of the 'permanent' there, but it disappears when server stops. Anyway it doesn't need to bepermanent beyond a server restart. > > > Thank you for the insights. > I did a simple test in the previous email using simple syscall tracing, > the patch significantly reduced the number of lseek syscall. > (but that simple test might not be enough to describe the performance benefit) > > Regarding Tsunakawa-san's comment, > in Thomas' patch, he made a place in shared memory that stores the > relsize_change_counter, so I am thinking of utilizing the same, > but for caching the relsize itself. > > Perhaps I should explain further the intention for the design. > > First step, to cache the file size in the shared memory. Similar to the > intention or purpose of the patch written by Thomas, to reduce the > number of lseek(SEEK_END) by caching the relation size without using > additional locks. The difference is by caching a rel size on the shared > memory itself. I wonder if there are problems that you can see with > this approach. > > Eventually, the next step is to have a structure in shared memory > that caches file addresses along with their sizes (Andres' idea of > putting an open relations table in the shared memory). With a > structure that group buffers into file relation units, we can get > file size directly from shared memory, so the assumption is it would > be faster whenever we truncate/extend our relations because we can > track the offset of the changes in size and use range for managing > the truncation, etc.. > The next step is a complex direction that needs serious discussion, > but I am wondering if we can proceed with the first step for now if > the idea and direction are valid. Maybe I'm missing something here, but why is it actually necessary to have the sizes in shared memory, if we're just talking about caching sizes? It's pretty darn cheap to determine the filesize of a file that has been recently stat()/lseek()/ed, and introducing per-file shared data adds *substantial* complexity, because the amount of shared memory needs to be pre-determined. The reason I want to put per-relation data into shared memory is different, it's about putting the buffer mapping into shared memory, and that, as a prerequisite, also need per-relation data. And there's a limit of the number of relation sthat need to be open (one per cached page at max), and space can be freed by evicting pages. Greetings, Andres Freund
On February 6, 2019, 8:57 AM +0000, Andres Freund wrote: > Maybe I'm missing something here, but why is it actually necessary to > have the sizes in shared memory, if we're just talking about caching > sizes? It's pretty darn cheap to determine the filesize of a file that > has been recently stat()/lseek()/ed, and introducing per-file shared > data adds *substantial* complexity, because the amount of shared memory > needs to be pre-determined. The reason I want to put per-relation data > into shared memory is different, it's about putting the buffer mapping > into shared memory, and that, as a prerequisite, also need per-relation > data. And there's a limit of the number of relations that need to be > open (one per cached page at max), and space can be freed by evicting > pages. Ahh.. You are right about the logic of putting it in the shared memory. As for Thomas' toy patch, multiple files share one counter in shmem. Although it currently works, it might likely to miss. Though his eventual plan of the idea is to use an array of N counters and map relation OIDs onto them. But as your point about complexity says, in shared memory we cannot share the same area with multiple files, so that needs an area to allocate depending on the number of files. Regarding the allocation of per-relation data in shared memory, I thought it can be a separated component at first so I asked for validity of the idea. But now I consider the point raised. Regards, Kirk Jamison
At Wed, 13 Feb 2019 05:48:28 +0000, "Jamison, Kirk" <k.jamison@jp.fujitsu.com> wrote in <D09B13F772D2274BB348A310EE3027C643880D@g01jpexmbkw24> > On February 6, 2019, 8:57 AM +0000, Andres Freund wrote: > > Maybe I'm missing something here, but why is it actually necessary to > > have the sizes in shared memory, if we're just talking about caching > > sizes? It's pretty darn cheap to determine the filesize of a file that > > has been recently stat()/lseek()/ed, and introducing per-file shared > > data adds *substantial* complexity, because the amount of shared memory > > needs to be pre-determined. The reason I want to put per-relation data > > into shared memory is different, it's about putting the buffer mapping > > into shared memory, and that, as a prerequisite, also need per-relation > > data. And there's a limit of the number of relations that need to be > > open (one per cached page at max), and space can be freed by evicting > > pages. > > Ahh.. You are right about the logic of putting it in the shared memory. > As for Thomas' toy patch, multiple files share one counter in shmem. > Although it currently works, it might likely to miss. > Though his eventual plan of the idea is to use an array of N counters > and map relation OIDs onto them. > But as your point about complexity says, in shared memory we cannot > share the same area with multiple files, so that needs an area to > allocate depending on the number of files. > > Regarding the allocation of per-relation data in shared memory, I > thought it can be a separated component at first so I asked for > validity of the idea. But now I consider the point raised. I still believe that one shared memory element for every non-mapped relation is not only too-complex but also too-much, as Andres (and implicitly I) wrote. I feel that just one flag for all works fine but partitioned flags (that is, relations or files corresponds to the same hash value share one flag) can reduce the shared memory elements to a fixed small number. Note: I'm still not sure how much lseek impacts performance. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
2019年2月14日(木) 20:41、Kyotaro HORIGUCHI さん(horiguchi.kyotaro@lab.ntt.co.jp)のメッセージ:
At Wed, 13 Feb 2019 05:48:28 +0000, "Jamison, Kirk" <k.jamison@jp.fujitsu.com> wrote in <D09B13F772D2274BB348A310EE3027C643880D@g01jpexmbkw24>
> On February 6, 2019, 8:57 AM +0000, Andres Freund wrote:
> > Maybe I'm missing something here, but why is it actually necessary to
> > have the sizes in shared memory, if we're just talking about caching
> > sizes? It's pretty darn cheap to determine the filesize of a file that
> > has been recently stat()/lseek()/ed, and introducing per-file shared
> > data adds *substantial* complexity, because the amount of shared memory
> > needs to be pre-determined. The reason I want to put per-relation data
> > into shared memory is different, it's about putting the buffer mapping
> > into shared memory, and that, as a prerequisite, also need per-relation
> > data. And there's a limit of the number of relations that need to be
> > open (one per cached page at max), and space can be freed by evicting
> > pages.
>
> Ahh.. You are right about the logic of putting it in the shared memory.
> As for Thomas' toy patch, multiple files share one counter in shmem.
> Although it currently works, it might likely to miss.
> Though his eventual plan of the idea is to use an array of N counters
> and map relation OIDs onto them.
> But as your point about complexity says, in shared memory we cannot
> share the same area with multiple files, so that needs an area to
> allocate depending on the number of files.
>
> Regarding the allocation of per-relation data in shared memory, I
> thought it can be a separated component at first so I asked for
> validity of the idea. But now I consider the point raised.
I still believe that one shared memory element for every
non-mapped relation is not only too-complex but also too-much, as
Andres (and implicitly I) wrote. I feel that just one flag for
all works fine but partitioned flags (that is, relations or files
corresponds to the same hash value share one flag) can reduce the
shared memory elements to a fixed small number.
Note: I'm still not sure how much lseek impacts performance.
Of course all the "flag"s above actually are "update counter"s :)
--
Kyotaro Horiguchi
NTT Open Source Software Center
Kyotaro Horiguchi
NTT Open Source Software Center
On Tue, Dec 31, 2019 at 4:43 PM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > I still believe that one shared memory element for every > non-mapped relation is not only too-complex but also too-much, as > Andres (and implicitly I) wrote. I feel that just one flag for > all works fine but partitioned flags (that is, relations or files > corresponds to the same hash value share one flag) can reduce the > shared memory elements to a fixed small number. There is one potentially interesting case that doesn't require any kind of shared cache invalidation AFAICS. XLogReadBufferExtended() calls smgrnblocks() for every buffer access, even if the buffer is already in our buffer pool. I tried to make yet another quick experiment-grade patch to cache the size[1], this time for use in recovery only. initdb -D pgdata postgres -D pgdata -c checkpoint_timeout=60min In another shell: pgbench -i -s100 postgres pgbench -M prepared -T60 postgres killall -9 postgres mv pgdata pgdata-save Master branch: cp -r pgdata-save pgdata strace -c -f postgres -D pgdata [... wait for "redo done", then hit ^C ...] % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- ... 18.61 22.492286 26 849396 lseek 6.95 8.404369 30 277134 pwrite64 6.63 8.009679 28 277892 pread64 0.50 0.604037 39 15169 sync_file_range ... Patched: rm -fr pgdata cp -r pgdata-save pgdata strace -c -f ~/install/bin/postgres -D pgdata [... wait for "redo done", then hit ^C ...] % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- ... 16.33 8.097631 29 277134 pwrite64 15.56 7.715052 27 277892 pread64 1.13 0.559648 39 14137 sync_file_range ... 0.00 0.001505 25 59 lseek > Note: I'm still not sure how much lseek impacts performance. It doesn't seem great that we are effectively making system calls for most WAL records we replay, but, sadly, in this case the patch didn't really make any measurable difference when run without strace on this Linux VM. I suspect there is some workload and stack where it would make a difference (CF the read(postmaster pipe) call for every WAL record that was removed), but this is just something I noticed in passing while working on something else, so I haven't investigated much. [1] https://github.com/postgres/postgres/compare/master...macdice:cache-nblocks (just a test, unfinished, probably has bugs)
Hi, On 2019-12-31 17:05:31 +1300, Thomas Munro wrote: > There is one potentially interesting case that doesn't require any > kind of shared cache invalidation AFAICS. XLogReadBufferExtended() > calls smgrnblocks() for every buffer access, even if the buffer is > already in our buffer pool. Yea, that's really quite bad*. The bit about doing so even when already in the buffer pool is particularly absurd. Needing to have special handling in mdcreate() for XLogReadBufferExtended() always calling it is also fairly ugly. > It doesn't seem great that we are effectively making system calls for > most WAL records we replay, but, sadly, in this case the patch didn't > really make any measurable difference when run without strace on this > Linux VM. I suspect there is some workload and stack where it would > make a difference (CF the read(postmaster pipe) call for every WAL > record that was removed), but this is just something I noticed in > passing while working on something else, so I haven't investigated > much. I wonder if that's just because your workload is too significantly bottlenecked elsewhere: > postgres -D pgdata -c checkpoint_timeout=60min > In another shell: > pgbench -i -s100 postgres > pgbench -M prepared -T60 postgres > killall -9 postgres > mv pgdata pgdata-save With scale 100, but the default shared_buffers, you'll frequently hit the OS for reads/writes. Which will require the same metadata in the kernel, but then also memcpys between kernel and userspace. A word of caution about strace's -c: In my experience the total time measurements are very imprecise somehow. I think it might be that some of the overhead of ptracing will be attributed to the syscalls or such, which means frequent syscalls appear relatively more expensive than they really are. Greetings, Andres Freund * it insults my sense of aesthetics
On Tue, Feb 4, 2020 at 2:23 AM Andres Freund <andres@anarazel.de> wrote: > On 2019-12-31 17:05:31 +1300, Thomas Munro wrote: > > There is one potentially interesting case that doesn't require any > > kind of shared cache invalidation AFAICS. XLogReadBufferExtended() > > calls smgrnblocks() for every buffer access, even if the buffer is > > already in our buffer pool. > > Yea, that's really quite bad*. The bit about doing so even when already > in the buffer pool is particularly absurd. Needing to have special > handling in mdcreate() for XLogReadBufferExtended() always calling it is > also fairly ugly. Yeah. It seems like there are several things to fix there. So now I'm wondering if we should start out by trying to cache the size it in the smgr layer for recovery only, like in the attached, and then later try to extend the scheme to cover the shared case as discussed at the beginning of the thread. > A word of caution about strace's -c: In my experience the total time > measurements are very imprecise somehow. I think it might be that some > of the overhead of ptracing will be attributed to the syscalls or such, > which means frequent syscalls appear relatively more expensive than they > really are. Yeah, those times are large, meaningless tracing overheads. While some systems might in fact be happy replaying a couple of million lseeks per second, (1) I'm pretty sure some systems would not be happy about that (see claims in this thread) and (2) it means you can't practically use strace on recovery because it slows it down to a crawl, which is annoying.
Attachment
On Thu, Feb 13, 2020 at 7:18 PM Thomas Munro <thomas.munro@gmail.com> wrote: > ... (1) I'm pretty sure some systems would not be happy > about that (see claims in this thread) ... I poked a couple of people off-list and learned that, although the Linux and FreeBSD systems I tried could do a million lseek(SEEK_END) calls in 60-100ms, a couple of different Windows systems took between 1.1 and 3.4 seconds (worse times when running as non-administrator), which seems to be clearly in the territory that can put a dent in your recovery speeds on that OS. I also learned that GetFileSizeEx() is "only" about twice as fast, which is useful information for that other thread about which syscall to use for this, but it's kind of irrelevant this thread about how we can get rid of these crazy syscalls altogether.
On Fri, Feb 14, 2020 at 1:50 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Thu, Feb 13, 2020 at 7:18 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > ... (1) I'm pretty sure some systems would not be happy > > about that (see claims in this thread) ... > > I poked a couple of people off-list and learned that, although the > Linux and FreeBSD systems I tried could do a million lseek(SEEK_END) > calls in 60-100ms, a couple of different Windows systems took between > 1.1 and 3.4 seconds (worse times when running as non-administrator), > which seems to be clearly in the territory that can put a dent in your > recovery speeds on that OS. I also learned that GetFileSizeEx() is > "only" about twice as fast, which is useful information for that other > thread about which syscall to use for this, but it's kind of > irrelevant this thread about how we can get rid of these crazy > syscalls altogether. I received a report off-list from someone who experimented with the patch I shared earlier on this thread[1], using a crash recovery test similar to one I showed on the WAL prefetching thread[2] (which he was also testing, separately). He observed that the lseek() rate in recovery was actually a significant problem for his environment on unpatched master, showing up as the top sampled function in perf, and by using that patch he got (identical) crash recovery to complete in 41s instead of 65s, with a sane looking perf (= 58% speedup). His test system was an AWS i3.16xlarge running an unspecified version of Linux. I think it's possible that all of the above reports can be explained by variations in the speculative execution bug mitigations that are enabled by default on different systems, but I haven't tried to investigate that yet. [1] https://github.com/postgres/postgres/compare/master...macdice:cache-nblocks [2] https://www.postgresql.org/message-id/CA%2BhUKG%2BOcWr8nHqa3%3DZoPTGgdDcrgjSC4R2sT%2BjrUunBua3rpg%40mail.gmail.com
On Sat, Apr 11, 2020 at 4:10 PM Thomas Munro <thomas.munro@gmail.com> wrote: > I received a report off-list from someone who experimented with the > patch I shared earlier on this thread[1], using a crash recovery test > similar to one I showed on the WAL prefetching thread[2] (which he was > also testing, separately). Rebased. I'll add this to the open commitfest.
Attachment
On Sat, Jun 20, 2020 at 10:32 AM Thomas Munro <thomas.munro@gmail.com> wrote: > Rebased. I'll add this to the open commitfest. I traced the recovery process while running pgbench -M prepared -c16 -j16 -t10000 (= 160,000 transactions). With the patch, the number of lseeks went from 1,080,661 (6.75 per pgbench transaction) to just 85. I went ahead and pushed this patch. There's still the matter of crazy numbers of lseeks in regular backends; looking at all processes while running the above test, I get 1,469,060 (9.18 per pgbench transaction) without -M prepared, and 193,722 with -M prepared (1.21 per pgbench transaction). Fixing that with this approach will require bullet-proof shared invalidation, but I think it's doable, in later work.
On Fri, Jul 31, 2020 at 2:36 PM Thomas Munro <thomas.munro@gmail.com> wrote: > There's still the matter of crazy numbers of lseeks in regular > backends; looking at all processes while running the above test, I get > 1,469,060 (9.18 per pgbench transaction) without -M prepared, and > 193,722 with -M prepared (1.21 per pgbench transaction). Fixing that > with this approach will require bullet-proof shared invalidation, but > I think it's doable, in later work. I couldn't help hacking on this a bit. Perhaps instead of bullet-proof general shared invalidation, we should have a way for localised bits of code to state that they are ok with a "relaxed" value. Then they should explain the theory for why that is safe in each case based on arguments about memory barrier pairs, but leave all other callers making the system call so that we don't torpedo the whole project by making it too hard. For now, the main cases I'm interested in are the ones that show up all the time as the dominant system call in various workloads: (1) Sequential scan relation-size probe. This should be OK with a relaxed value. You can't miss the invalidation for a truncation, because the read barrier in your lock acquisition on the relation pairs with the write barrier in the exclusive lock release of any session that truncated, and you can't miss relation any relation extension that your snapshot can see, because the release of the extension lock pairs with the lock involved in snapshot acquisition. (2) Planner relation-size probe, which should be OK with a relaxed value. Similar arguments give you a fresh enough view, I think. Or maybe there is a theory along these lines that already covers every use of smgrnblocks(), so a separate mode isn't require... I don't know! The attached sketch-quality patch shows some of the required elements working, though I ran out of steam trying to figure out how to thread this thing through the right API layers so for now it always asks for a relaxed value in table_block_relation_size().
Attachment
On 01.08.2020 00:56, Thomas Munro wrote: > On Fri, Jul 31, 2020 at 2:36 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> There's still the matter of crazy numbers of lseeks in regular >> backends; looking at all processes while running the above test, I get >> 1,469,060 (9.18 per pgbench transaction) without -M prepared, and >> 193,722 with -M prepared (1.21 per pgbench transaction). Fixing that >> with this approach will require bullet-proof shared invalidation, but >> I think it's doable, in later work. > I couldn't help hacking on this a bit. Perhaps instead of > bullet-proof general shared invalidation, we should have a way for > localised bits of code to state that they are ok with a "relaxed" > value. Then they should explain the theory for why that is safe in > each case based on arguments about memory barrier pairs, but leave all > other callers making the system call so that we don't torpedo the > whole project by making it too hard. For now, the main cases I'm > interested in are the ones that show up all the time as the dominant > system call in various workloads: > > (1) Sequential scan relation-size probe. This should be OK with a > relaxed value. You can't miss the invalidation for a truncation, > because the read barrier in your lock acquisition on the relation > pairs with the write barrier in the exclusive lock release of any > session that truncated, and you can't miss relation any relation > extension that your snapshot can see, because the release of the > extension lock pairs with the lock involved in snapshot acquisition. > > (2) Planner relation-size probe, which should be OK with a relaxed > value. Similar arguments give you a fresh enough view, I think. > > Or maybe there is a theory along these lines that already covers every > use of smgrnblocks(), so a separate mode isn't require... I don't > know! > > The attached sketch-quality patch shows some of the required elements > working, though I ran out of steam trying to figure out how to thread > this thing through the right API layers so for now it always asks for > a relaxed value in table_block_relation_size(). So in this thread three solutions are proposed: 1. "bullet-proof general shared invalidation" 2. recovery-only solution avoiding shared memory and invalidation 3. "relaxed" shared memory cache with simplified invalidation If solving such very important by still specific problem of caching relation size requires so much efforts, then may be it is time to go further in the direction towards shared catalog? This shared relation cache can easily store relation size as well. In addition it will solve a lot of other problems: - noticeable overhead of local relcache warming - large memory consumption in case of larger number of relations O(max_connections*n_relations) - sophisticated invalidation protocol and related performance issues Certainly access to shared cache requires extra synchronization.But DDL operations are relatively rare. So in most cases we will have only shared locks. May be overhead of locking will not be too large?
po 3. 8. 2020 v 17:54 odesílatel Konstantin Knizhnik <k.knizhnik@postgrespro.ru> napsal:
On 01.08.2020 00:56, Thomas Munro wrote:
> On Fri, Jul 31, 2020 at 2:36 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> There's still the matter of crazy numbers of lseeks in regular
>> backends; looking at all processes while running the above test, I get
>> 1,469,060 (9.18 per pgbench transaction) without -M prepared, and
>> 193,722 with -M prepared (1.21 per pgbench transaction). Fixing that
>> with this approach will require bullet-proof shared invalidation, but
>> I think it's doable, in later work.
> I couldn't help hacking on this a bit. Perhaps instead of
> bullet-proof general shared invalidation, we should have a way for
> localised bits of code to state that they are ok with a "relaxed"
> value. Then they should explain the theory for why that is safe in
> each case based on arguments about memory barrier pairs, but leave all
> other callers making the system call so that we don't torpedo the
> whole project by making it too hard. For now, the main cases I'm
> interested in are the ones that show up all the time as the dominant
> system call in various workloads:
>
> (1) Sequential scan relation-size probe. This should be OK with a
> relaxed value. You can't miss the invalidation for a truncation,
> because the read barrier in your lock acquisition on the relation
> pairs with the write barrier in the exclusive lock release of any
> session that truncated, and you can't miss relation any relation
> extension that your snapshot can see, because the release of the
> extension lock pairs with the lock involved in snapshot acquisition.
>
> (2) Planner relation-size probe, which should be OK with a relaxed
> value. Similar arguments give you a fresh enough view, I think.
>
> Or maybe there is a theory along these lines that already covers every
> use of smgrnblocks(), so a separate mode isn't require... I don't
> know!
>
> The attached sketch-quality patch shows some of the required elements
> working, though I ran out of steam trying to figure out how to thread
> this thing through the right API layers so for now it always asks for
> a relaxed value in table_block_relation_size().
So in this thread three solutions are proposed:
1. "bullet-proof general shared invalidation"
2. recovery-only solution avoiding shared memory and invalidation
3. "relaxed" shared memory cache with simplified invalidation
If solving such very important by still specific problem of caching
relation size requires so much efforts,
then may be it is time to go further in the direction towards shared
catalog?
This shared relation cache can easily store relation size as well.
In addition it will solve a lot of other problems:
- noticeable overhead of local relcache warming
- large memory consumption in case of larger number of relations
O(max_connections*n_relations)
- sophisticated invalidation protocol and related performance issues
Certainly access to shared cache requires extra synchronization.But DDL
operations are relatively rare.
Some applications use very frequently CREATE TEMP TABLE, DROP TEMP TABLE, or CREATE TABLE AS SELECT ..
Regards
Pavel
So in most cases we will have only shared locks. May be overhead of
locking will not be too large?
On Tue, Aug 4, 2020 at 3:54 AM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > So in this thread three solutions are proposed: > 1. "bullet-proof general shared invalidation" > 2. recovery-only solution avoiding shared memory and invalidation > 3. "relaxed" shared memory cache with simplified invalidation Hi Konstantin, By the way, point 2 is now committed (c5315f4f). As for 1 vs 3, I wasn't proposing two different invalidation techniques: in both approaches, I'm calling the cached values "relaxed", meaning that their freshness is controlled by memory barriers elsewhere that the caller has to worry about. I'm just suggesting for idea 3 that it might be a good idea to use relaxed values only in a couple of hot code paths where we do the analysis required to convince ourselves that memory barriers are already in the right places to make it safe. By "bullet-proof" I meant that we could in theory convince ourselves that *all* users of smgrnblocks() can safely use relaxed values, but that's hard. That said, the sketch patch I posted certainly needs more work, and maybe someone has a better idea on how to do it. > If solving such very important by still specific problem of caching > relation size requires so much efforts, > then may be it is time to go further in the direction towards shared > catalog? I wouldn't say it requires too much effort, at least the conservative approach (3). But I also agree with what you're saying, in the long term: > This shared relation cache can easily store relation size as well. > In addition it will solve a lot of other problems: > - noticeable overhead of local relcache warming > - large memory consumption in case of larger number of relations > O(max_connections*n_relations) > - sophisticated invalidation protocol and related performance issues > Certainly access to shared cache requires extra synchronization.But DDL > operations are relatively rare. > So in most cases we will have only shared locks. May be overhead of > locking will not be too large? Yeah, I would be very happy if we get a high performance shared sys/rel/plan/... caches in the future, and separately, having the relation size available in shmem is something that has come up in discussions about other topics too (tree-based buffer mapping, multi-relation data files, ...). I agree with you that our cache memory usage is a big problem, and it will be great to fix that one day. I don't think that should stop us from making small improvements to the existing design in the meantime, though. "The perfect is the enemy of the good." Look at all this trivial stuff: https://wiki.postgresql.org/wiki/Syscall_Reduction I don't have high quality data to report yet, but from simple tests I'm seeing orders of magnitude fewer syscalls per pgbench transaction in recovery, when comparing 11 to 14devel, due to various changes. Admittedly, the size probes in regular backends aren't quite as bad as the recovery ones were, because they're already limited by the rate you can throw request/response cycles at the DB, but there are still some cases like planning for high-partition counts and slow filesystems that can benefit measurably from caching.
On Tue, Aug 4, 2020 at 2:21 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Tue, Aug 4, 2020 at 3:54 AM Konstantin Knizhnik > <k.knizhnik@postgrespro.ru> wrote: > > This shared relation cache can easily store relation size as well. > > In addition it will solve a lot of other problems: > > - noticeable overhead of local relcache warming > > - large memory consumption in case of larger number of relations > > O(max_connections*n_relations) > > - sophisticated invalidation protocol and related performance issues > > Certainly access to shared cache requires extra synchronization.But DDL > > operations are relatively rare. > > So in most cases we will have only shared locks. May be overhead of > > locking will not be too large? > > Yeah, I would be very happy if we get a high performance shared > sys/rel/plan/... caches in the future, and separately, having the > relation size available in shmem is something that has come up in > discussions about other topics too (tree-based buffer mapping, > multi-relation data files, ...). ... After recent discussions about the limitations of relying on SEEK_END in a nearby thread[1], I decided to try to prototype a system for tracking relation sizes properly in shared memory. Earlier in this thread I was talking about invalidation schemes for backend-local caches, because I only cared about performance. In contrast, this new system has SMgrRelation objects that point to SMgrSharedRelation objects (better names welcome) that live in a pool in shared memory, so that all backends agree on the size. The scheme is described in the commit message and comments. The short version is that smgr.c tracks the "authoritative" size of any relation that has recently been extended or truncated, until it has been fsync'd. By authoritative, I mean that there may be dirty buffers in that range in our buffer pool, even if the filesystem has vaporised the allocation of disk blocks and shrunk the file. That is, it's not really a "cache". It's also not like a shared catalog, which Konstantin was talking about... it's more like the pool of inodes in a kernel's memory. It holds all currently dirty SRs (SMgrSharedRelations), plus as many clean ones as it can fit, with some kind of reclamation scheme, much like buffers. Here, "dirty" means the size changed. Attached is an early sketch, not debugged much yet (check undir contrib/postgres_fdw fails right now for a reason I didn't look into), and there are clearly many architectural choices one could make differently, and more things to be done... but it seemed like enough of a prototype to demonstrate the concept and fuel some discussion about this and whatever better ideas people might have... Thoughts? [1] https://www.postgresql.org/message-id/flat/OSBPR01MB3207DCA7EC725FDD661B3EDAEF660%40OSBPR01MB3207.jpnprd01.prod.outlook.com
Attachment
On 16.11.2020 10:11, Thomas Munro wrote: > On Tue, Aug 4, 2020 at 2:21 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> On Tue, Aug 4, 2020 at 3:54 AM Konstantin Knizhnik >> <k.knizhnik@postgrespro.ru> wrote: >>> This shared relation cache can easily store relation size as well. >>> In addition it will solve a lot of other problems: >>> - noticeable overhead of local relcache warming >>> - large memory consumption in case of larger number of relations >>> O(max_connections*n_relations) >>> - sophisticated invalidation protocol and related performance issues >>> Certainly access to shared cache requires extra synchronization.But DDL >>> operations are relatively rare. >>> So in most cases we will have only shared locks. May be overhead of >>> locking will not be too large? >> Yeah, I would be very happy if we get a high performance shared >> sys/rel/plan/... caches in the future, and separately, having the >> relation size available in shmem is something that has come up in >> discussions about other topics too (tree-based buffer mapping, >> multi-relation data files, ...). ... > After recent discussions about the limitations of relying on SEEK_END > in a nearby thread[1], I decided to try to prototype a system for > tracking relation sizes properly in shared memory. Earlier in this > thread I was talking about invalidation schemes for backend-local > caches, because I only cared about performance. In contrast, this new > system has SMgrRelation objects that point to SMgrSharedRelation > objects (better names welcome) that live in a pool in shared memory, > so that all backends agree on the size. The scheme is described in > the commit message and comments. The short version is that smgr.c > tracks the "authoritative" size of any relation that has recently been > extended or truncated, until it has been fsync'd. By authoritative, I > mean that there may be dirty buffers in that range in our buffer pool, > even if the filesystem has vaporised the allocation of disk blocks and > shrunk the file. > > That is, it's not really a "cache". It's also not like a shared > catalog, which Konstantin was talking about... it's more like the pool > of inodes in a kernel's memory. It holds all currently dirty SRs > (SMgrSharedRelations), plus as many clean ones as it can fit, with > some kind of reclamation scheme, much like buffers. Here, "dirty" > means the size changed. I will look at your implementation more precisely latter. But now I just to ask one question: is it reasonable to spent substantial amount of efforts trying to solve the problem of redundant calculations of relation size, which seems to be just part of more general problem: shared relation cache? It is well known problem: private caches of system catalog can cause unacceptable memory consumption for large number of backends. If there are thousands relations in the database (which is not so rare case, especially taken in account ORMs and temporary tables) then size of catalog cache can be several megabytes. If it is multiplies by thousand backends, then we get gigabytes of memory used just for private catalog caches. Thanks to Andres optimizations of taking snapshots, now Postgres can normally handle thousands of connections. So this extremely inefficient use of memory for private catalog caches becomes more and more critical. Also private caches requires sophisticated and error prone invalidation mechanism. If we will have such shared cache, then keeping shared relation size seems to be trivial task, isn't it? So may be we before "diving" into the problem of maintaining shared pool of dirtied "inodes" we can better discuss and conerntrate on solving more generic problem?
On 16.11.2020 10:11, Thomas Munro wrote: > On Tue, Aug 4, 2020 at 2:21 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> On Tue, Aug 4, 2020 at 3:54 AM Konstantin Knizhnik >> <k.knizhnik@postgrespro.ru> wrote: >>> This shared relation cache can easily store relation size as well. >>> In addition it will solve a lot of other problems: >>> - noticeable overhead of local relcache warming >>> - large memory consumption in case of larger number of relations >>> O(max_connections*n_relations) >>> - sophisticated invalidation protocol and related performance issues >>> Certainly access to shared cache requires extra synchronization.But DDL >>> operations are relatively rare. >>> So in most cases we will have only shared locks. May be overhead of >>> locking will not be too large? >> Yeah, I would be very happy if we get a high performance shared >> sys/rel/plan/... caches in the future, and separately, having the >> relation size available in shmem is something that has come up in >> discussions about other topics too (tree-based buffer mapping, >> multi-relation data files, ...). ... > After recent discussions about the limitations of relying on SEEK_END > in a nearby thread[1], I decided to try to prototype a system for > tracking relation sizes properly in shared memory. Earlier in this > thread I was talking about invalidation schemes for backend-local > caches, because I only cared about performance. In contrast, this new > system has SMgrRelation objects that point to SMgrSharedRelation > objects (better names welcome) that live in a pool in shared memory, > so that all backends agree on the size. The scheme is described in > the commit message and comments. The short version is that smgr.c > tracks the "authoritative" size of any relation that has recently been > extended or truncated, until it has been fsync'd. By authoritative, I > mean that there may be dirty buffers in that range in our buffer pool, > even if the filesystem has vaporised the allocation of disk blocks and > shrunk the file. > > That is, it's not really a "cache". It's also not like a shared > catalog, which Konstantin was talking about... it's more like the pool > of inodes in a kernel's memory. It holds all currently dirty SRs > (SMgrSharedRelations), plus as many clean ones as it can fit, with > some kind of reclamation scheme, much like buffers. Here, "dirty" > means the size changed. > > Attached is an early sketch, not debugged much yet (check undir > contrib/postgres_fdw fails right now for a reason I didn't look into), > and there are clearly many architectural choices one could make > differently, and more things to be done... but it seemed like enough > of a prototype to demonstrate the concept and fuel some discussion > about this and whatever better ideas people might have... > > Thoughts? > > [1] https://www.postgresql.org/message-id/flat/OSBPR01MB3207DCA7EC725FDD661B3EDAEF660%40OSBPR01MB3207.jpnprd01.prod.outlook.com I noticed that there are several fragments like this: if (!smgrexists(rel->rd_smgr, FSM_FORKNUM)) smgrcreate(rel->rd_smgr, FSM_FORKNUM, false); fsm_nblocks_now = smgrnblocks(rel->rd_smgr, FSM_FORKNUM); I wonder if it will be more efficient and simplify code to add "create_if_not_exists" parameter to smgrnblocks? It will avoid extra hash lookup and avoid explicit checks for fork presence in multiple places? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Mon, Nov 16, 2020 at 11:01 PM Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > I will look at your implementation more precisely latter. Thanks! Warning: I thought about making a thing like this for a while, but the patch itself is only a one-day prototype, so I am sure you can find many bugs... Hmm, I guess the smgrnblocks_fast() code really needs a nice big comment to explain the theory about why this value is fresh enough, based on earlier ideas about implicit memory barriers. (Something like: if you're extending, you must have acquired the extension lock; if you're scanning you must have acquired a snapshot that only needs to see blocks that existed at that time and concurrent truncations are impossible; I'm wondering about special snapshots like VACUUM, and whether this is too relaxed for that, or if it's covered by existing horizon-based interlocking...). > But now I just to ask one question: is it reasonable to spent > substantial amount of efforts trying to solve > the problem of redundant calculations of relation size, which seems to > be just part of more general problem: shared relation cache? It's a good question. I don't know exactly how to make a shared relation cache (I have only some vague ideas and read some of Idehira-san's work) but I agree that it would be very nice. However, I think that's an independent and higher level thing, because Relation != SMgrRelation. What is an SMgrRelation? Not much! It holds file descriptors, which are different in every backend so you can't share them, and then some information about sizes and a target block for insertions. With this patch, the sizes become shared and more reliable, so there's *almost* nothing left at all except for the file descriptors and the pointer to the shared object. I haven't looked into smgr_targblock, the last remaining non-shared thing, but there might be an opportunity to do something clever for parallel inserters (the proposed feature) by coordinating target blocks in SMgrSharedRelation; I don't know. > It is well known problem: private caches of system catalog can cause > unacceptable memory consumption for large number of backends. Yeah. > Also private caches requires sophisticated and error prone invalidation > mechanism. > > If we will have such shared cache, then keeping shared relation size > seems to be trivial task, isn't it? > So may be we before "diving" into the problem of maintaining shared pool > of dirtied "inodes" we can better discuss and conerntrate on solving > more generic problem? Sure, we can talk about how to make a shared relation cache (and syscache etc). I think it'll be much harder than this shared SMGR concept. Even if we figure out how to share palloc'd trees of nodes with correct invalidation and interlocking (ie like Ideriha-san's work in [1]), including good performance and space management/replacement, I guess we'll still need per-backend Relation objects to point to the per-backend SMgrRelation objects to hold the per-backend file descriptors. (Until we have threads...). But... do you think sharing SMGR relations to the maximum extent possible conflicts with that? Are you thinking there should be some general component here, perhaps a generic system for pools of shmem objects with mapping tables and a replacement policy? [1] https://www.postgresql.org/message-id/flat/4E72940DA2BF16479384A86D54D0988A567B9245%40G01JPEXMBKW04
From: Thomas Munro <thomas.munro@gmail.com> > On Mon, Nov 16, 2020 at 11:01 PM Konstantin Knizhnik > <k.knizhnik@postgrespro.ru> wrote: > > I will look at your implementation more precisely latter. > > Thanks! Warning: I thought about making a thing like this for a > while, but the patch itself is only a one-day prototype, so I am sure > you can find many bugs... Hmm, I guess the smgrnblocks_fast() code > really needs a nice big comment to explain the theory about why this > value is fresh enough, based on earlier ideas about implicit memory > barriers. (Something like: if you're extending, you must have > acquired the extension lock; if you're scanning you must have acquired > a snapshot that only needs to see blocks that existed at that time and > concurrent truncations are impossible; I'm wondering about special > snapshots like VACUUM, and whether this is too relaxed for that, or if > it's covered by existing horizon-based interlocking...). We'd like to join the discussion and review, too, so that Kirk's optimization for VACUUM truncation and TRUNCATRE can extendbeyond recovery to operations during normal operation. In that regard, would you mind becoming a committer for hispatch? We think it's committable now. I'm fine with using the unmodified smgrnblocks() as your devil's suggestion. > It's a good question. I don't know exactly how to make a shared > relation cache (I have only some vague ideas and read some of > Idehira-san's work) but I agree that it would be very nice. However, > I think that's an independent and higher level thing, because Relation > != SMgrRelation. > Sure, we can talk about how to make a shared relation cache (and > syscache etc). I think it'll be much harder than this shared SMGR > concept. Even if we figure out how to share palloc'd trees of nodes > with correct invalidation and interlocking (ie like Ideriha-san's work > in [1]), including good performance and space management/replacement, > I guess we'll still need per-backend Relation objects to point to the > per-backend SMgrRelation objects to hold the per-backend file > descriptors. (Until we have threads...). But... do you think sharing > SMGR relations to the maximum extent possible conflicts with that? > Are you thinking there should be some general component here, perhaps > a generic system for pools of shmem objects with mapping tables and a > replacement policy? Yes, Ideriha-san and we have completed the feature called global metacache for placing relation and catalog caches in sharedmemory and limiting their sizes, and incorporated it in our product to meet our customer's request. I hope we'll submitthe patch in the near future when our resource permits. Regards Takayuki Tsunakawa
On Tuesday, November 17, 2020 9:40 AM, Tsunkawa-san wrote: > From: Thomas Munro <thomas.munro@gmail.com> > > On Mon, Nov 16, 2020 at 11:01 PM Konstantin Knizhnik > > <k.knizhnik@postgrespro.ru> wrote: > > > I will look at your implementation more precisely latter. > > > > Thanks! Warning: I thought about making a thing like this for a > > while, but the patch itself is only a one-day prototype, so I am sure > > you can find many bugs... Hmm, I guess the smgrnblocks_fast() code > > really needs a nice big comment to explain the theory about why this > > value is fresh enough, based on earlier ideas about implicit memory > > barriers. (Something like: if you're extending, you must have > > acquired the extension lock; if you're scanning you must have acquired > > a snapshot that only needs to see blocks that existed at that time and > > concurrent truncations are impossible; I'm wondering about special > > snapshots like VACUUM, and whether this is too relaxed for that, or if > > it's covered by existing horizon-based interlocking...). > > We'd like to join the discussion and review, too, so that Kirk's optimization for > VACUUM truncation and TRUNCATRE can extend beyond recovery to > operations during normal operation. In that regard, would you mind > becoming a committer for his patch? We think it's committable now. I'm > fine with using the unmodified smgrnblocks() as your devil's suggestion. Just saw this thread has been bumped. And yes, regarding the vacuum and truncate optimization during recovery, it would still work even without the unmodified smgrnblocks. We'd just need to remove the conditions that checks the flag parameter. And if we explore your latest prototype patch, it can possibly use the new smgrnblocks_xxxx APIs introduced here. That said, we'd like to explore the next step of extending the feature to normal operations. Regards, Kirk Jamison
At Mon, 16 Nov 2020 20:11:52 +1300, Thomas Munro <thomas.munro@gmail.com> wrote in > After recent discussions about the limitations of relying on SEEK_END > in a nearby thread[1], I decided to try to prototype a system for > tracking relation sizes properly in shared memory. Earlier in this > thread I was talking about invalidation schemes for backend-local > caches, because I only cared about performance. In contrast, this new > system has SMgrRelation objects that point to SMgrSharedRelation > objects (better names welcome) that live in a pool in shared memory, > so that all backends agree on the size. The scheme is described in > the commit message and comments. The short version is that smgr.c > tracks the "authoritative" size of any relation that has recently been > extended or truncated, until it has been fsync'd. By authoritative, I > mean that there may be dirty buffers in that range in our buffer pool, > even if the filesystem has vaporised the allocation of disk blocks and > shrunk the file. > > That is, it's not really a "cache". It's also not like a shared > catalog, which Konstantin was talking about... it's more like the pool > of inodes in a kernel's memory. It holds all currently dirty SRs > (SMgrSharedRelations), plus as many clean ones as it can fit, with > some kind of reclamation scheme, much like buffers. Here, "dirty" > means the size changed. > > Attached is an early sketch, not debugged much yet (check undir > contrib/postgres_fdw fails right now for a reason I didn't look into), > and there are clearly many architectural choices one could make > differently, and more things to be done... but it seemed like enough > of a prototype to demonstrate the concept and fuel some discussion > about this and whatever better ideas people might have... > > Thoughts? > > [1] https://www.postgresql.org/message-id/flat/OSBPR01MB3207DCA7EC725FDD661B3EDAEF660%40OSBPR01MB3207.jpnprd01.prod.outlook.com I was naively thinking that we could just remember the size of all database files in a shared array but I realized that that needs a hash table to translate rnodes into array indexes, which could grow huge.. The proposed way tries to make sure that the next fseek call tells the truth before forgetting cached values. On the other hand a shared-smgrrel allows to defer fsyncing of a (local) smgrrel until someone evicts the entry. That seems to me to be the minimal mechanism that allows us to keep us being able to know the right file size at all times, getting rid of a need to remember the size of all database files in shared memory. I'm afraid that it might cause fsync storms on very-huge systems, though.. However, if we try to do the similar thing in any other way, it seems to be that it's going grown to be more like the pg_smgr catalog. But that seems going to eat more memory and cause invalidation storms. Sorry for the rambling thought, but I think this is basically in the right direction. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Nov 17, 2020 at 4:13 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Mon, Nov 16, 2020 at 11:01 PM Konstantin Knizhnik > <k.knizhnik@postgrespro.ru> wrote: > > I will look at your implementation more precisely latter. > > Thanks! Warning: I thought about making a thing like this for a > while, but the patch itself is only a one-day prototype, so I am sure > you can find many bugs... Hmm, I guess the smgrnblocks_fast() code > really needs a nice big comment to explain the theory about why this > value is fresh enough, based on earlier ideas about implicit memory > barriers. (Something like: if you're extending, you must have > acquired the extension lock; if you're scanning you must have acquired > a snapshot that only needs to see blocks that existed at that time and > concurrent truncations are impossible; I'm wondering about special > snapshots like VACUUM, and whether this is too relaxed for that, or if > it's covered by existing horizon-based interlocking...). > Yeah, it is good to verify VACUUM stuff but I have another question here. What about queries having functions that access the same relation (SELECT c1 FROM t1 WHERE c1 <= func(); assuming here function access the relation t1)? Now, here I think because the relation 't1' is already opened, it might use the same value of blocks from the shared cache even though the snapshot for relation 't1' when accessed from func() might be different. Am, I missing something, or is it dealt in some way? -- With Regards, Amit Kapila.
On Tue, Nov 17, 2020 at 10:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Yeah, it is good to verify VACUUM stuff but I have another question > here. What about queries having functions that access the same > relation (SELECT c1 FROM t1 WHERE c1 <= func(); assuming here function > access the relation t1)? Now, here I think because the relation 't1' > is already opened, it might use the same value of blocks from the > shared cache even though the snapshot for relation 't1' when accessed > from func() might be different. Am, I missing something, or is it > dealt in some way? I think it should be covered by the theory about implicit memory barriers snapshots, but to simplify things I have now removed the lock-free stuff from the main patch (0001), because it was a case of premature optimisation and it distracted from the main concept. The main patch has 128-way partitioned LWLocks for the mapping table, and then per-relfilenode spinlocks for modifying the size. There are still concurrency considerations, which I think are probably handled with the dirty-update-wins algorithm you see in the patch. In short: due to extension and exclusive locks, size changes AKA dirty updates are serialised, but clean updates can run concurrently, so we just have to make sure that clean updates never clobber dirty updates -- do you think that is on the right path?
Attachment
On Thu, Nov 19, 2020 at 1:02 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Tue, Nov 17, 2020 at 10:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Yeah, it is good to verify VACUUM stuff but I have another question
> here. What about queries having functions that access the same
> relation (SELECT c1 FROM t1 WHERE c1 <= func(); assuming here function
> access the relation t1)? Now, here I think because the relation 't1'
> is already opened, it might use the same value of blocks from the
> shared cache even though the snapshot for relation 't1' when accessed
> from func() might be different. Am, I missing something, or is it
> dealt in some way?
I think it should be covered by the theory about implicit memory
barriers snapshots, but to simplify things I have now removed the
lock-free stuff from the main patch (0001), because it was a case of
premature optimisation and it distracted from the main concept. The
main patch has 128-way partitioned LWLocks for the mapping table, and
then per-relfilenode spinlocks for modifying the size. There are
still concurrency considerations, which I think are probably handled
with the dirty-update-wins algorithm you see in the patch. In short:
due to extension and exclusive locks, size changes AKA dirty updates
are serialised, but clean updates can run concurrently, so we just
have to make sure that clean updates never clobber dirty updates -- do
you think that is on the right path?
Thank you for working on it.
I spent one day studying the patch and I want to talk about one question for now.
What is the purpose of calling smgrimmedsync to evict a DIRTY sr (what will happen
if we remove it and the SR_SYNCING and SR_JUST_DIRTIED flags)?
--
Best Regards
Andy Fan
Hi Andy, On Thu, Dec 17, 2020 at 7:29 PM Andy Fan <zhihui.fan1213@gmail.com> wrote: > I spent one day studying the patch and I want to talk about one question for now. > What is the purpose of calling smgrimmedsync to evict a DIRTY sr (what will happen > if we remove it and the SR_SYNCING and SR_JUST_DIRTIED flags)? The underlying theory of the patch is that after fsync() succeeds, the new size is permanent, but before that it could change at any time because of asynchronous activities in the kernel*. Therefore, if you want to evict the size from shared memory, you have to fsync() the file first. I used smgrimmedsync() to do that. *I suspect that it can really only shrink with network file systems such as NFS and GlusterFS, where the kernel has to take liberties to avoid network round trips for every file extension, but I don't know the details on all 9 of our supported OSes and standards aren't very helpful for understanding buffered I/O.
Hi Thomas,
Thank you for your quick response.
On Thu, Dec 17, 2020 at 3:05 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Hi Andy,
On Thu, Dec 17, 2020 at 7:29 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
> I spent one day studying the patch and I want to talk about one question for now.
> What is the purpose of calling smgrimmedsync to evict a DIRTY sr (what will happen
> if we remove it and the SR_SYNCING and SR_JUST_DIRTIED flags)?
The underlying theory of the patch is that after fsync() succeeds, the
new size is permanent, but before that it could change at any time
because of asynchronous activities in the kernel*. Therefore, if you
want to evict the size from shared memory, you have to fsync() the
file first. I used smgrimmedsync() to do that.
*I suspect that it can really only shrink with network file systems
such as NFS and GlusterFS, where the kernel has to take liberties to
avoid network round trips for every file extension, but I don't know
the details on all 9 of our supported OSes and standards aren't very
helpful for understanding buffered I/O.
2 blocks from 1 block, and fsync is not called, then a). the lseek *may* still
return 1 based on the comments in the ReadBuffer_common ("because
of buggy Linux kernels that sometimes return an seek SEEK_END result
that doesn't account for a recent write"). b). When this patch is introduced,
we would get 2 blocks for sure if we can get the size from cache. c). After
user reads the 2 blocks from cache and then the cache is evicted, and user
reads the blocks again, it is possible to get 1 again.
Do we need to consider case a) in this patch? and Is case c). the situation you
want to avoid and do we have to introduce fsync to archive this? Basically I
think case a) should not happen by practice so we don't need to handle case
c) and finally we don't need the fsync/SR_SYNCING/SR_JUST_DIRTIED flag.
I'm not opposed to adding them, I am just interested in the background in case
you are also willing to discuss it.
I would suggest the below change for smgrnblock_fast, FYI
@@ -182,7 +182,11 @@ smgrnblocks_fast(SMgrRelation reln, ForkNumber forknum)
/*
* The generation doesn't match, the shared relation must have been
* evicted since we got a pointer to it. We'll need to do more work.
+ * and invalid the fast path for next time.
*/
+ reln->smgr_shared = NULL;
}
/*
* The generation doesn't match, the shared relation must have been
* evicted since we got a pointer to it. We'll need to do more work.
+ * and invalid the fast path for next time.
*/
+ reln->smgr_shared = NULL;
}
--
Best Regards
Andy Fan
Hi Thomas:
I studied your patch these days and found there might be a problem.
When execute 'drop database', the smgr shared pool will not be removed because of no call 'smgr_drop_sr'. Function 'dropdb' in dbcommands.c remove the buffer from bufferpool and unlink the real files by 'rmtree', It doesn't call smgrdounlinkall, so the smgr shared cache will not be dropped although the table has been removed. This will cause some errors when smgr_alloc_str -> smgropen、smgrimmedsync. Table file has been removed, so smgropen and smgrimmedsync will get a unexpected result.
Buzhen
------------------原始邮件 ------------------发件人:Thomas Munro <thomas.munro@gmail.com>发送时间:Tue Dec 22 19:57:35 2020收件人:Amit Kapila <amit.kapila16@gmail.com>抄送:Konstantin Knizhnik <k.knizhnik@postgrespro.ru>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>主题:Re: Cache relation sizes?On Tue, Nov 17, 2020 at 10:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Yeah, it is good to verify VACUUM stuff but I have another question
> here. What about queries having functions that access the same
> relation (SELECT c1 FROM t1 WHERE c1 <= func(); assuming here function
> access the relation t1)? Now, here I think because the relation 't1'
> is already opened, it might use the same value of blocks from the
> shared cache even though the snapshot for relation 't1' when accessed
> from func() might be different. Am, I missing something, or is it
> dealt in some way?
I think it should be covered by the theory about implicit memory
barriers snapshots, but to simplify things I have now removed the
lock-free stuff from the main patch (0001), because it was a case of
premature optimisation and it distracted from the main concept. The
main patch has 128-way partitioned LWLocks for the mapping table, and
then per-relfilenode spinlocks for modifying the size. There are
still concurrency considerations, which I think are probably handled
with the dirty-update-wins algorithm you see in the patch. In short:
due to extension and exclusive locks, size changes AKA dirty updates
are serialised, but clean updates can run concurrently, so we just
have to make sure that clean updates never clobber dirty updates -- do
you think that is on the right path?
On Thu, Dec 17, 2020 at 10:22 PM Andy Fan <zhihui.fan1213@gmail.com> wrote: > Let me try to understand your point. Suppose process 1 extends a file to > 2 blocks from 1 block, and fsync is not called, then a). the lseek *may* still > return 1 based on the comments in the ReadBuffer_common ("because > of buggy Linux kernels that sometimes return an seek SEEK_END result > that doesn't account for a recent write"). b). When this patch is introduced, > we would get 2 blocks for sure if we can get the size from cache. c). After > user reads the 2 blocks from cache and then the cache is evicted, and user > reads the blocks again, it is possible to get 1 again. > > Do we need to consider case a) in this patch? and Is case c). the situation you > want to avoid and do we have to introduce fsync to archive this? Basically I > think case a) should not happen by practice so we don't need to handle case > c) and finally we don't need the fsync/SR_SYNCING/SR_JUST_DIRTIED flag. > I'm not opposed to adding them, I am just interested in the background in case > you are also willing to discuss it. Sorry for the lack of background -- there was some discussion on the thread "Optimize dropping of relation buffers using dlist", but it's long and mixed up with other topics. I'll try to summarise here. It's easy to reproduce files shrinking asynchronously on network filesystems that reserve space lazily[1], and we know that there have historically been problems even with local filesystems[2], leading to that comment about buggy kernels. I am only interested in the behaviour I can demonstrate, because I believe Linux is working as intended when it does that (hypothetical/unknown bugs would probably be helped by this approach too, but I'm not interested in speculating about that beyond these parentheses). So, why should we consider this? It's true that commit ffae5cc5a60's change to ReadBuffer_common() already prevents us from eating data by overwriting an existing buffer due to this phenomenon, but there are other problems: 1. A system that in this state can give an incorrect answer to a query: heapam.c calls RelationGetNumberOfBlocks() at the beginning of a scan, so it might not see recently added blocks containing committed data. Hopefully we'll panic when we try to create a checkpoint and see errors, but we could be giving bad results for a while. 2. Commitfest entry #2319 needs an accurate size, or it might leave stray blocks in the buffer pool that will cause failures or corruption later. It's true that we could decide not to worry about this, and perhaps even acknowledge in some official way that PostgreSQL doesn't work reliably on some kinds of filesystems. But in this prototype I thought it was fun to try to fix our very high rate of lseek() syscalls and this reliability problem, at the same time. I also speculate that we'll eventually have other reasons to want a demand-paged per-relation shared memory object. > I would suggest the below change for smgrnblock_fast, FYI > > @@ -182,7 +182,11 @@ smgrnblocks_fast(SMgrRelation reln, ForkNumber forknum) > /* > * The generation doesn't match, the shared relation must have been > * evicted since we got a pointer to it. We'll need to do more work. > + * and invalid the fast path for next time. > */ > + reln->smgr_shared = NULL; > } Thanks, makes sense -- I'll add this to the next version. [1] https://www.postgresql.org/message-id/CA%2BhUKGLZTfKuXir9U4JQkY%3Dk3Tb6M_3toGrGOK_fa2d4MPQQ_w%40mail.gmail.com [2] https://www.postgresql.org/message-id/BAY116-F146A38326C5630EA33B36BD12B0%40phx.gbl
On Wed, Dec 23, 2020 at 1:31 AM 陈佳昕(步真) <buzhen.cjx@alibaba-inc.com> wrote: > I studied your patch these days and found there might be a problem. > When execute 'drop database', the smgr shared pool will not be removed because of no call 'smgr_drop_sr'. Function 'dropdb'in dbcommands.c remove the buffer from bufferpool and unlink the real files by 'rmtree', It doesn't call smgrdounlinkall,so the smgr shared cache will not be dropped although the table has been removed. This will cause some errorswhen smgr_alloc_str -> smgropen、smgrimmedsync. Table file has been removed, so smgropen and smgrimmedsync will geta unexpected result. Hi Buzhen, Thanks, you're right -- it needs to scan the pool of SRs and forget everything from the database you're dropping.
On Thu, Dec 24, 2020 at 6:59 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Thu, Dec 17, 2020 at 10:22 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
> Let me try to understand your point. Suppose process 1 extends a file to
> 2 blocks from 1 block, and fsync is not called, then a). the lseek *may* still
> return 1 based on the comments in the ReadBuffer_common ("because
> of buggy Linux kernels that sometimes return an seek SEEK_END result
> that doesn't account for a recent write"). b). When this patch is introduced,
> we would get 2 blocks for sure if we can get the size from cache. c). After
> user reads the 2 blocks from cache and then the cache is evicted, and user
> reads the blocks again, it is possible to get 1 again.
>
> Do we need to consider case a) in this patch? and Is case c). the situation you
> want to avoid and do we have to introduce fsync to archive this? Basically I
> think case a) should not happen by practice so we don't need to handle case
> c) and finally we don't need the fsync/SR_SYNCING/SR_JUST_DIRTIED flag.
> I'm not opposed to adding them, I am just interested in the background in case
> you are also willing to discuss it.
Sorry for the lack of background -- there was some discussion on the
thread "Optimize dropping of relation buffers using dlist", but it's
long and mixed up with other topics. I'll try to summarise here.
It's easy to reproduce files shrinking asynchronously on network
filesystems that reserve space lazily[1],
Thanks for the explaintain. After I read that thread, I am more confused now.
In [1], we have the below test result.
$ cat magic_shrinking_file.c
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int main()
{
int fd;
char buffer[8192] = {0};
fd = open("/mnt/test_loopback_remote/dir/file", O_RDWR | O_APPEND);
if (fd < 0) {
perror("open");
return EXIT_FAILURE;
}
printf("lseek(..., SEEK_END) = %jd\n", lseek(fd, 0, SEEK_END));
printf("write(...) = %zd\n", write(fd, buffer, sizeof(buffer)));
printf("lseek(..., SEEK_END) = %jd\n", lseek(fd, 0, SEEK_END));
printf("fsync(...) = %d\n", fsync(fd));
printf("lseek(..., SEEK_END) = %jd\n", lseek(fd, 0, SEEK_END));
return EXIT_SUCCESS;
}
$ cc magic_shrinking_file.c
$ ./a.out
lseek(..., SEEK_END) = 9670656
write(...) = 8192
lseek(..., SEEK_END) = 9678848
fsync(...) = -1
lseek(..., SEEK_END) = 9670656
I got 2 information from above. a) before the fsync, the lseek(fd, 0, SEEK_END)
returns a correct value, however after the fsync, it returns a wrong value. b).
the fsync failed(return values is -1) in the above case. I'm more confused
because of point a). Since the fsync can't correct some wrong value, what
is the point to call fsync in this patch (I agree that it is good to fix this
reliability problem within this patch, but I'm doubtful if fsync can help in
this case). Am I missing something obviously?
I read your patch with details some days ago and feel it is in pretty good shape.
and I think you are clear about the existing issues very well. like a). smgr_alloc_sr use a
FIFO design. b). SR_PARTITION_LOCK doesn't use a partition lock really. c). and
looks like you have some concern about the concurrency issue, which I didn't find
anything wrong. d). how to handle the DIRTY sr during evict. I planned to recheck
item c) before this reply, but looks like the time is not permitted. May I know what
Is your plan about this patch?
Best Regards
Andy Fan
Hi Thomas
I found some other problems which I want to share my change with you to make you confirm.
<1> I changed the code in smgr_alloc_sr to avoid dead lock.
LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
flags = smgr_lock_sr(sr);
Assert(flags & SR_SYNCING(forknum));
+ flags &= ~SR_SYNCING(forknum);
if (flags & SR_JUST_DIRTIED(forknum))
{
/*
* Someone else dirtied it while we were syncing, so we can't mark
* it clean. Let's give up on this SR and go around again.
*/
smgr_unlock_sr(sr, flags);
LWLockRelease(mapping_lock);
goto retry;
}
/* This fork is clean! */
- flags &= ~SR_SYNCING(forknum);
flags &= ~SR_DIRTY(forknum);
}
In smgr_drop_sr, if the sr is SR_SYNCING, it will retry until the sr is not SR_SYNCING. But in smgr_alloc_sr, if the sr is SR_JUST_DIRTIED, it will retry to get another sr, although it has been synced by smgrimmedsync, the flag SR_SYNCING doesn't changed. This might cause dead lock. So I changed your patch as above.
<2> I changed the code in smgr_drop_sr to avoid some corner cases
/* Mark it invalid and drop the mapping. */
smgr_unlock_sr(sr, ~SR_VALID);
+ for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+ sr->nblocks[forknum] = InvalidBlockNumber;
hash_search_with_hash_value(sr_mapping_table,
&reln->smgr_rnode,
hash,
HASH_REMOVE,
NULL);
smgr_drop_sr just removes the hash entry from sr_mapping_table, but doesn't remove the sr_pool. But in smgrnblocks_fast, it just get sr from sr_pool, so I add some codes as above to avoid some corner cases get an unexpected result from smgrnblocks_fast. Is it necessary, I also want some advice from you.
Thanks a lot.
Buzhen
------------------原始邮件 ------------------发件人:Thomas Munro <thomas.munro@gmail.com>发送时间:Tue Dec 29 22:52:59 2020收件人:陈佳昕(步真) <buzhen.cjx@alibaba-inc.com>抄送:Amit Kapila <amit.kapila16@gmail.com>, Konstantin Knizhnik <k.knizhnik@postgrespro.ru>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>主题:Re: Re: Cache relation sizes?On Wed, Dec 23, 2020 at 1:31 AM 陈佳昕(步真) <buzhen.cjx@alibaba-inc.com> wrote:
> I studied your patch these days and found there might be a problem.
> When execute 'drop database', the smgr shared pool will not be removed because of no call 'smgr_drop_sr'. Function 'dropdb' in dbcommands.c remove the buffer from bufferpool and unlink the real files by 'rmtree', It doesn't call smgrdounlinkall, so the smgr shared cache will not be dropped although the table has been removed. This will cause some errors when smgr_alloc_str -> smgropen、smgrimmedsync. Table file has been removed, so smgropen and smgrimmedsync will get a unexpected result.
Hi Buzhen,
Thanks, you're right -- it needs to scan the pool of SRs and forget
everything from the database you're dropping.
Hi, On 2020-11-19 18:01:14 +1300, Thomas Munro wrote: > From ac3c61926bf947a3288724bd02cf8439ff5c14bc Mon Sep 17 00:00:00 2001 > From: Thomas Munro <thomas.munro@gmail.com> > Date: Fri, 13 Nov 2020 14:38:41 +1300 > Subject: [PATCH v2 1/2] WIP: Track relation sizes in shared memory. > > Introduce a fixed size pool of SMgrSharedRelation objects. A new GUC > smgr_shared_relation controls how many can exist at once, and they are > evicted as required. As noted over IM, I think it'd be good if we could avoid the need for a new GUC here and instead derive it from other settings. E.g. max_files_per_process * MaxBackends or maybe max_files_per_process * log(MaxBackends) (on the basis that it's unlikely that each backend uses completely separate files). Given the size of the structs it seems that the memory overhead of this should be acceptable? > XXX smgrimmedsync() is maybe too blunt an instrument? > > XXX perhaps mdsyncfiletag should tell smgr.c when it cleans forks via > some new interface, but it doesn't actually know if it's cleaning a fork > that extended a relation... > > XXX perhaps bgwriter should try to clean them? > As I suggested on the same call, I think it might make sense to integrate this with the checkpointer sync queue. Have checkpointer try to keep some a certain fraction of the entries clean (or even make them unused?), by processing all the sync requests for that relfilnode. That'd allow to a) only sync segments we actually need to sync b) syncing those segments again at the end of the checkpoint. Another use-case I have is that I'd like to make file-extension better than it currently is. There's two main challenges: - Having bulk extension add pages to the FSM isn't actually a good solution, as that has a high likelihood of multiple backends ending up trying to insert into the same page. If we had, in SMgrSharedRelation, a 'physical' relation size and a 'logical' relation size, we could do better, and hand out pre-allocated pages to exactly one backend. - Doing bulk-extension when there's no other space left leads to large pile-ups of processes needing extension lock -> a large latency increase. Instead we should opportunistically start to do extension when we've used-up most of the bulk allocated space. > +/* > + * An entry in the hash table that allows us to look up objects in the > + * SMgrSharedRelation pool by rnode (+ backend). > + */ > +typedef struct SMgrSharedRelationMapping > +{ > + RelFileNodeBackend rnode; > + int index; > +} SMgrSharedRelationMapping; Maybe add a note that index points into SMgrSharedRelationPool->objects[]? And why you chose to not store SMgrSharedRelation directly in the hashtable (makes sense to me, still worth commenting on imo). > +/* Flags. */ > +#define SR_LOCKED 0x01 > +#define SR_VALID 0x02 > + > +/* Each forknum gets its own dirty, syncing and just dirtied bits. */ > +#define SR_DIRTY(forknum) (0x04 << ((forknum) + (MAX_FORKNUM + 1) * 0)) > +#define SR_SYNCING(forknum) (0x04 << ((forknum) + (MAX_FORKNUM + 1) * 1)) > +#define SR_JUST_DIRTIED(forknum) (0x04 << ((forknum) + (MAX_FORKNUM + 1) * 2)) Hm - do we really want all that stored in a single flag value? > +/* > + * Lock a SMgrSharedRelation. The lock is a spinlock that should be held for > + * only a few instructions. The return value is the current set of flags, > + * which may be modified and then passed to smgr_unlock_sr() to be atomically > + * when the lock is released. > + */ > +static uint32 > +smgr_lock_sr(SMgrSharedRelation *sr) > +{ > + > + for (;;) > + { > + uint32 old_flags = pg_atomic_read_u32(&sr->flags); > + uint32 flags; > + > + if (!(old_flags & SR_LOCKED)) > + { > + flags = old_flags | SR_LOCKED; > + if (pg_atomic_compare_exchange_u32(&sr->flags, &old_flags, flags)) > + return flags; > + } > + } > + return 0; /* unreachable */ > +} Since this is a spinlock, you should use the spin delay infrastructure (c.f. LWLockWaitListLock()). Otherwise it'll react terribly if there ever ended up contention. What's the reason not to use a per-entry lwlock instead? Naming: I find it weird to have smgr_ as a prefix and _sr as a postfix. I'd go for smgr_sr_$op. > +/* > + * Unlock a SMgrSharedRelation, atomically updating its flags at the same > + * time. > + */ > +static void > +smgr_unlock_sr(SMgrSharedRelation *sr, uint32 flags) > +{ > + pg_write_barrier(); > + pg_atomic_write_u32(&sr->flags, flags & ~SR_LOCKED); > +} This assumes there never are atomic modification of flags without holding the spinlock (or a cmpxchg loop testing whether somebody is holding the lock).Is that right / good? > +/* > + * Allocate a new invalid SMgrSharedRelation, and return it locked. > + * > + * The replacement algorithm is a simple FIFO design with no second chance for > + * now. > + */ > +static SMgrSharedRelation * > +smgr_alloc_sr(void) > +{ > + SMgrSharedRelationMapping *mapping; > + SMgrSharedRelation *sr; > + uint32 index; > + LWLock *mapping_lock; > + uint32 flags; > + RelFileNodeBackend rnode; > + uint32 hash; > + > + retry: > + /* Lock the next one in clock-hand order. */ > + index = pg_atomic_fetch_add_u32(&sr_pool->next, 1) % smgr_shared_relations; > + sr = &sr_pool->objects[index]; > + flags = smgr_lock_sr(sr); Modulo is pretty expensive - I'd force smgr_shared_relations to be a power-of-two and then do a & smgr_shared_relations_mask initialized to (smgr_shared_relations - 1). > + /* If it's unused, can return it, still locked, immediately. */ > + if (!(flags & SR_VALID)) > + return sr; > + > + /* > + * Copy the rnode and unlock. We'll briefly acquire both mapping and SR > + * locks, but we need to do it in that order, so we'll unlock the SR > + * first. > + */ > + rnode = sr->rnode; > + smgr_unlock_sr(sr, flags); > + > + hash = get_hash_value(sr_mapping_table, &rnode); > + mapping_lock = SR_PARTITION_LOCK(hash); > + > + LWLockAcquire(mapping_lock, LW_EXCLUSIVE); > + mapping = hash_search_with_hash_value(sr_mapping_table, > + &rnode, > + hash, > + HASH_FIND, > + NULL); > + if (!mapping || mapping->index != index) > + { > + /* Too slow, it's gone or now points somewhere else. Go around. */ > + LWLockRelease(mapping_lock); > + goto retry; > + } > + > + /* We will lock the SR for just a few instructions. */ > + flags = smgr_lock_sr(sr); > + Assert(flags & SR_VALID); So the replacement here is purely based on when the relation was accessed first, not when it was accessed last. Probably good enough, but seems worth a comment. > 3. The nblocks value must be free enough for scans, extension, > truncatation and dropping buffers, because the those operations all > executed memory barriers when it acquired a snapshot to scan (which > doesn't need to see blocks added after that) or an exclusive heavyweight > lock to extend, truncate or drop. XXX right? "free enough"? I am not sure this holds for non-mvcc scans? > @@ -699,7 +748,7 @@ smgrclearowner(SMgrRelation *owner, SMgrRelation reln) > bool > smgrexists(SMgrRelation reln, ForkNumber forknum) > { > - if (smgrnblocks_shared(reln, forknum) != InvalidBlockNumber) > + if (smgrnblocks_fast(reln, forknum) != InvalidBlockNumber) > return true; ISTM that all these places referencing _fast is leaking an implementation detail (but not very far) - seems like that should be an implementation detail for smgrnblocks() which can then call out to _fast/_shared or such. Greetings, Andres Freund
On Mon, Dec 28, 2020 at 5:24 PM Andy Fan <zhihui.fan1213@gmail.com> wrote: > lseek(..., SEEK_END) = 9670656 > write(...) = 8192 > lseek(..., SEEK_END) = 9678848 > fsync(...) = -1 > lseek(..., SEEK_END) = 9670656 > > I got 2 information from above. a) before the fsync, the lseek(fd, 0, SEEK_END) > returns a correct value, however after the fsync, it returns a wrong value. b). > the fsync failed(return values is -1) in the above case. I'm more confused > because of point a). Since the fsync can't correct some wrong value, what > is the point to call fsync in this patch (I agree that it is good to fix this > reliability problem within this patch, but I'm doubtful if fsync can help in > this case). Am I missing something obviously? The point of the fsync() call isn't to correct the value, it's to find out if it is safe to drop the SR object from shared memory because the operating system has promised that it won't forget about the new size. If that fails, we'll report an ERROR or PANIC (depending on data_sync_retry), but not drop the SR. By the way, you don't need fsync(fd) for the size to change, as shown by the other experiment in that other thread that just did sleep(60). It might also happen asynchronously. fsync(fd) forces it, but also tells you about errors. > I read your patch with details some days ago and feel it is in pretty good shape. > and I think you are clear about the existing issues very well. like a). smgr_alloc_sr use a > FIFO design. b). SR_PARTITION_LOCK doesn't use a partition lock really. c). and Yeah, it probably should have its own bank of locks (I wish they were easier to add, via lwlocknames.txt or similar). > looks like you have some concern about the concurrency issue, which I didn't find > anything wrong. d). how to handle the DIRTY sr during evict. I planned to recheck So yeah, mostly this discussion has been about not trusting lseek() and using our own tracking of relation size instead. But there is a separate question about how "fresh" the value needs to be. The question is: under what circumstances could the unlocked read of nblocks from shared memory give you a value that isn't fresh enough for your snapshot/scan? This involves thinking about implicit memory barriers. The idea of RECENTLY_DIRTIED is similar to what we do for buffers: while you're trying to "clean" it (in this case by calling fsync()) someone else can extend the relation again, and in that case we don't know whether the new extension was included in the fsync() or not, so we can't clear the DIRTY flag when it completes. > item c) before this reply, but looks like the time is not permitted. May I know what > Is your plan about this patch? Aside from details, bugs etc discussed in this thread, one major piece remains incomplete: something in the background to "clean" SRs so that foreground processes rarely have to block.
On Wed, Dec 30, 2020 at 4:13 AM 陈佳昕(步真) <buzhen.cjx@alibaba-inc.com> wrote: > I found some other problems which I want to share my change with you to make you confirm. > <1> I changed the code in smgr_alloc_sr to avoid dead lock. > > LWLockAcquire(mapping_lock, LW_EXCLUSIVE); > flags = smgr_lock_sr(sr); > Assert(flags & SR_SYNCING(forknum)); > + flags &= ~SR_SYNCING(forknum); > if (flags & SR_JUST_DIRTIED(forknum)) > { > /* > * Someone else dirtied it while we were syncing, so we can't mark > * it clean. Let's give up on this SR and go around again. > */ > smgr_unlock_sr(sr, flags); > LWLockRelease(mapping_lock); > goto retry; > } > /* This fork is clean! */ > - flags &= ~SR_SYNCING(forknum); > flags &= ~SR_DIRTY(forknum); > } > > In smgr_drop_sr, if the sr is SR_SYNCING, it will retry until the sr is not SR_SYNCING. But in smgr_alloc_sr, if the sris SR_JUST_DIRTIED, it will retry to get another sr, although it has been synced by smgrimmedsync, the flag SR_SYNCING doesn't changed. This might cause dead lock. So I changed your patch as above. Right. Thanks! I also added smgrdropdb() to handle DROP DATABASE (the problem you reported in your previous email). While tweaking that code, I fixed it so that it uses a condition variable to wait (instead of the silly sleep loop) when it needs to drop an SR that is being sync'd. Also, it now releases the mapping lock while it's doing that, and requires it on retry. > <2> I changed the code in smgr_drop_sr to avoid some corner cases > /* Mark it invalid and drop the mapping. */ > smgr_unlock_sr(sr, ~SR_VALID); > + for (forknum = 0; forknum <= MAX_FORKNUM; forknum++) > + sr->nblocks[forknum] = InvalidBlockNumber; > hash_search_with_hash_value(sr_mapping_table, > &reln->smgr_rnode, > hash, > HASH_REMOVE, > NULL); > > smgr_drop_sr just removes the hash entry from sr_mapping_table, but doesn't remove the sr_pool. But in smgrnblocks_fast,it just get sr from sr_pool, so I add some codes as above to avoid some corner cases get an unexpected resultfrom smgrnblocks_fast. Is it necessary, I also want some advice from you. Hmm. I think it might be better to increment sr->generation. That was already done in the "eviction" code path, where other processes might still have references to the SR object, and I don't think it's possible for anyone to access a dropped relation, but I suppose it's more consistent to do that anyway. Fixed. Thanks for the review!
Attachment
On Wed, Dec 30, 2020 at 5:52 PM Thomas Munro <thomas.munro@gmail.com> wrote: > and requires it on retry s/requires/reacquires/
Hi, Thomas
Is some scenes the same as dropdb for smgr cache.
1. drop tablespace for master. Should smgrdroptbs function be added in DropTableSpace function ?
2. drop database for slave. smgrdropdb in dbase_redo.
3. drop tablespace for slave. smgrdroptbs in tblspc_redo.
Buzhen
------------------原始邮件 ------------------发件人:Thomas Munro <thomas.munro@gmail.com>发送时间:Fri Jan 8 00:56:17 2021收件人:陈佳昕(步真) <buzhen.cjx@alibaba-inc.com>抄送:Amit Kapila <amit.kapila16@gmail.com>, Konstantin Knizhnik <k.knizhnik@postgrespro.ru>, PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>主题:Re: Cache relation sizes?On Wed, Dec 30, 2020 at 4:13 AM 陈佳昕(步真) <buzhen.cjx@alibaba-inc.com> wrote:
> I found some other problems which I want to share my change with you to make you confirm.
> <1> I changed the code in smgr_alloc_sr to avoid dead lock.
>
> LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
> flags = smgr_lock_sr(sr);
> Assert(flags & SR_SYNCING(forknum));
> + flags &= ~SR_SYNCING(forknum);
> if (flags & SR_JUST_DIRTIED(forknum))
> {
> /*
> * Someone else dirtied it while we were syncing, so we can't mark
> * it clean. Let's give up on this SR and go around again.
> */
> smgr_unlock_sr(sr, flags);
> LWLockRelease(mapping_lock);
> goto retry;
> }
> /* This fork is clean! */
> - flags &= ~SR_SYNCING(forknum);
> flags &= ~SR_DIRTY(forknum);
> }
>
> In smgr_drop_sr, if the sr is SR_SYNCING, it will retry until the sr is not SR_SYNCING. But in smgr_alloc_sr, if the sr is SR_JUST_DIRTIED, it will retry to get another sr, although it has been synced by smgrimmedsync, the flag SR_SYNCING doesn't changed. This might cause dead lock. So I changed your patch as above.
Right. Thanks!
I also added smgrdropdb() to handle DROP DATABASE (the problem you
reported in your previous email).
While tweaking that code, I fixed it so that it uses a condition
variable to wait (instead of the silly sleep loop) when it needs to
drop an SR that is being sync'd. Also, it now releases the mapping
lock while it's doing that, and requires it on retry.
> <2> I changed the code in smgr_drop_sr to avoid some corner cases
> /* Mark it invalid and drop the mapping. */
> smgr_unlock_sr(sr, ~SR_VALID);
> + for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
> + sr->nblocks[forknum] = InvalidBlockNumber;
> hash_search_with_hash_value(sr_mapping_table,
> &reln->smgr_rnode,
> hash,
> HASH_REMOVE,
> NULL);
>
> smgr_drop_sr just removes the hash entry from sr_mapping_table, but doesn't remove the sr_pool. But in smgrnblocks_fast, it just get sr from sr_pool, so I add some codes as above to avoid some corner cases get an unexpected result from smgrnblocks_fast. Is it necessary, I also want some advice from you.
Hmm. I think it might be better to increment sr->generation. That
was already done in the "eviction" code path, where other processes
might still have references to the SR object, and I don't think it's
possible for anyone to access a dropped relation, but I suppose it's
more consistent to do that anyway. Fixed.
Thanks for the review!
Hi Thomas
I want to share a patch with you, I change the replacement algorithm from fifo to a simple lru.
Buzhen
Attachment
No change yet, just posting a rebase to keep cfbot happy. One thing I'm wondering about is whether it'd be possible, and if so, a good idea, to make a kind of tiny reusable cache replacement algorithm, something modern, that can be used to kill several birds with one stone (SLRUs, this object pool, ...). Or if the interlocking requirements make it too integrated.
Attachment
On Wed, Jun 16, 2021 at 9:24 AM Thomas Munro <thomas.munro@gmail.com> wrote:
No change yet, just posting a rebase to keep cfbot happy.
I think that the proposed feature is pretty cool not only because it fixes some old issues with lseek() performance and reliability, but also because it opens the door to some new features, such as disk size quota management [1]. Cheap up-to-date size tracking is one of the major problems for it.
I've only read the 1st patch so far. Here are some thoughts about it:
- SMgrSharedRelation - what about calling it SMgrShmemRelation. It looks weird too, but "...SharedRelation" makes me think of shared catalog tables.
- We can exclude temp relations from consideration as they are never shared and use RelFileNode instead of RelFileNodeBackend in SMgrSharedRelationMapping.
Not only it will save us some space, but also it will help to avoid a situation when temp relations occupy whole cache (which may easily happen in some workloads).
I assume you were trying to make a generic solution, but in practice, temp rels differ from regular ones and may require different optimizations. Local caching will be enough for them if ever needed, for example, we can leave smgr_cached_nblocks[] for temp relations.
> int smgr_shared_relations = 1000;
- How bad would it be to keep cache for all relations? Let's test memory overhead on some realistic datasets. I suppose this 1000 default was chosen just for WIP patch.
I wonder if we can use DSM instead of regular shared memory?
- I'd expect that CREATE DATABASE and ALTER DATABASE SET TABLESPACE require special treatment, because they bypass smgr, just like dropdb. Don't see it in the patch.
[1] https://github.com/greenplum-db/diskquota
--
Best regards,
Lubennikova Anastasia