Thread: Buffer Manager and Contention
Thinking about poor performance in the case where the data fits in RAM, but the working set is too big for shared_buffers, I notice a couple of things that seem bad in BufMgr, but don't understand why they are like that. 1. If we need to allocate a buffer to a new block we do this in one step, while holding both partition locks for the old and the new tag. Surely it would cause less contention to make the old block/tag invalid (after flushing), drop the old partition lock and then switch to the new one? i.e. just hold one mapping partition lock at a time. Is there a specific reason we do it this way? 2. Possibly connected to the above, we issue BufTableInsert() BEFORE we issue BufTableDelete(). That means we need extra entries in the buffer mapping hash table to allow us to hold both the old and the new at the same time, for a short period. The way dynahash.c works, we try to allocate an entry from the freelist and if that doesn't work, we begin searching ALL the freelists for free entries to steal. So if we get enough people trying to do virtual I/O at the same time, then we will hit a "freelist storm" where everybody is chasing the last few entries. It would make more sense if we could do BufTableDelete() first, then hold onto the buffer mapping entry rather than add it to the freelist, so we can use it again when we do BufTableInsert() very shortly afterwards. That way we wouldn't need to search the freelist at all. What is the benefit or reason of doing the Delete after the Insert? Put that another way, it looks like BufTable functions are used in a way that mismatches against the way dynahash is designed. Thoughts? -- Simon Riggs http://www.EnterpriseDB.com/
(I added Yura, as the author of a related patch) At Thu, 24 Feb 2022 12:58:23 +0000, Simon Riggs <simon.riggs@enterprisedb.com> wrote in > Thinking about poor performance in the case where the data fits in > RAM, but the working set is too big for shared_buffers, I notice a > couple of things that seem bad in BufMgr, but don't understand why > they are like that. > > 1. If we need to allocate a buffer to a new block we do this in one > step, while holding both partition locks for the old and the new tag. > Surely it would cause less contention to make the old block/tag > invalid (after flushing), drop the old partition lock and then switch > to the new one? i.e. just hold one mapping partition lock at a time. > Is there a specific reason we do it this way? I'm not sure but I guess the developer wanted to make the operation atomic. Yura Sokolov is proposing a patch to separte the two partition locks. https://www.postgresql.org/message-id/1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel%40postgrespro.ru And it seems to me viable for me and a benchmarking in the thread showed a good result. I'd appreciate your input on that thread. > 2. Possibly connected to the above, we issue BufTableInsert() BEFORE > we issue BufTableDelete(). That means we need extra entries in the > buffer mapping hash table to allow us to hold both the old and the new > at the same time, for a short period. The way dynahash.c works, we try Yes. > to allocate an entry from the freelist and if that doesn't work, we > begin searching ALL the freelists for free entries to steal. So if we > get enough people trying to do virtual I/O at the same time, then we > will hit a "freelist storm" where everybody is chasing the last few > entries. It would make more sense if we could do BufTableDelete() To reduce that overhead, Yura proposed a surgically modification on dynahash, but I didn't like that and the latest patch doesn't contain that part. > first, then hold onto the buffer mapping entry rather than add it to > the freelist, so we can use it again when we do BufTableInsert() very > shortly afterwards. That way we wouldn't need to search the freelist > at all. What is the benefit or reason of doing the Delete after the > Insert? Hmm. something like hash_swap_key() or hash_reinsert_entry()? That sounds reasonable. (Yura's proposal was taking out an entry from hash then attach it with a new key again.) > Put that another way, it looks like BufTable functions are used in a > way that mismatches against the way dynahash is designed. > > Thoughts? On the first part, I think Yura's patch works. On the second point, Yura already showed it gives a certain amount of gain if we do that. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Fri, 25 Feb 2022 10:20:25 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > (I added Yura, as the author of a related patch) > > At Thu, 24 Feb 2022 12:58:23 +0000, Simon Riggs <simon.riggs@enterprisedb.com> wrote in > > Thinking about poor performance in the case where the data fits in > > RAM, but the working set is too big for shared_buffers, I notice a > > couple of things that seem bad in BufMgr, but don't understand why > > they are like that. > > > > 1. If we need to allocate a buffer to a new block we do this in one > > step, while holding both partition locks for the old and the new tag. > > Surely it would cause less contention to make the old block/tag > > invalid (after flushing), drop the old partition lock and then switch > > to the new one? i.e. just hold one mapping partition lock at a time. > > Is there a specific reason we do it this way? > > I'm not sure but I guess the developer wanted to make the operation > atomic. > > Yura Sokolov is proposing a patch to separte the two partition locks. > > https://www.postgresql.org/message-id/1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel%40postgrespro.ru > > And it seems to me viable for me and a benchmarking in the thread > showed a good result. I'd appreciate your input on that thread. > > > 2. Possibly connected to the above, we issue BufTableInsert() BEFORE > > we issue BufTableDelete(). That means we need extra entries in the > > buffer mapping hash table to allow us to hold both the old and the new > > at the same time, for a short period. The way dynahash.c works, we try > > Yes. > > > to allocate an entry from the freelist and if that doesn't work, we > > begin searching ALL the freelists for free entries to steal. So if we > > get enough people trying to do virtual I/O at the same time, then we > > will hit a "freelist storm" where everybody is chasing the last few > > entries. It would make more sense if we could do BufTableDelete() > > To reduce that overhead, Yura proposed a surgically modification on > dynahash, but I didn't like that and the latest patch doesn't contain > that part. > > > first, then hold onto the buffer mapping entry rather than add it to > > the freelist, so we can use it again when we do BufTableInsert() very > > shortly afterwards. That way we wouldn't need to search the freelist > > at all. What is the benefit or reason of doing the Delete after the > > Insert? > > Hmm. something like hash_swap_key() or hash_reinsert_entry()? That > sounds reasonable. (Yura's proposal was taking out an entry from hash > then attach it with a new key again.) > > > Put that another way, it looks like BufTable functions are used in a > > way that mismatches against the way dynahash is designed. > > > > Thoughts? > > On the first part, I think Yura's patch works. On the second point, > Yura already showed it gives a certain amount of gain if we do that. On second thought, even if we have a new dynahash API to atomically replace hash key, we need to hold two partition locks to do that. That is contradicting our objective. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, 25 Feb 2022 at 01:20, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > Yura Sokolov is proposing a patch to separte the two partition locks. > > https://www.postgresql.org/message-id/1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel%40postgrespro.ru > > And it seems to me viable for me and a benchmarking in the thread > showed a good result. I'd appreciate your input on that thread. Certainly, I will stop this thread and contribute to Yura's thread. -- Simon Riggs http://www.EnterpriseDB.com/