Thread: Buffer Manager and Contention

Buffer Manager and Contention

From
Simon Riggs
Date:
Thinking about poor performance in the case where the data fits in
RAM, but the working set is too big for shared_buffers, I notice a
couple of things that seem bad in BufMgr, but don't understand why
they are like that.

1. If we need to allocate a buffer to a new block we do this in one
step, while holding both partition locks for the old and the new tag.
Surely it would cause less contention to make the old block/tag
invalid (after flushing), drop the old partition lock and then switch
to the new one? i.e. just hold one mapping partition lock at a time.
Is there a specific reason we do it this way?

2. Possibly connected to the above, we issue BufTableInsert() BEFORE
we issue BufTableDelete(). That means we need extra entries in the
buffer mapping hash table to allow us to hold both the old and the new
at the same time, for a short period. The way dynahash.c works, we try
to allocate an entry from the freelist and if that doesn't work, we
begin searching ALL the freelists for free entries to steal. So if we
get enough people trying to do virtual I/O at the same time, then we
will hit a "freelist storm" where everybody is chasing the last few
entries. It would make more sense if we could do BufTableDelete()
first, then hold onto the buffer mapping entry rather than add it to
the freelist, so we can use it again when we do BufTableInsert() very
shortly afterwards. That way we wouldn't need to search the freelist
at all. What is the benefit or reason of doing the Delete after the
Insert?

Put that another way, it looks like BufTable functions are used in a
way that mismatches against the way dynahash is designed.

Thoughts?

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: Buffer Manager and Contention

From
Kyotaro Horiguchi
Date:
(I added Yura, as the author of a related patch)

At Thu, 24 Feb 2022 12:58:23 +0000, Simon Riggs <simon.riggs@enterprisedb.com> wrote in 
> Thinking about poor performance in the case where the data fits in
> RAM, but the working set is too big for shared_buffers, I notice a
> couple of things that seem bad in BufMgr, but don't understand why
> they are like that.
> 
> 1. If we need to allocate a buffer to a new block we do this in one
> step, while holding both partition locks for the old and the new tag.
> Surely it would cause less contention to make the old block/tag
> invalid (after flushing), drop the old partition lock and then switch
> to the new one? i.e. just hold one mapping partition lock at a time.
> Is there a specific reason we do it this way?

I'm not sure but I guess the developer wanted to make the operation
atomic.

Yura Sokolov is proposing a patch to separte the two partition locks.

https://www.postgresql.org/message-id/1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel%40postgrespro.ru

And it seems to me viable for me and a benchmarking in the thread
showed a good result.  I'd appreciate your input on that thread.

> 2. Possibly connected to the above, we issue BufTableInsert() BEFORE
> we issue BufTableDelete(). That means we need extra entries in the
> buffer mapping hash table to allow us to hold both the old and the new
> at the same time, for a short period. The way dynahash.c works, we try

Yes.

> to allocate an entry from the freelist and if that doesn't work, we
> begin searching ALL the freelists for free entries to steal. So if we
> get enough people trying to do virtual I/O at the same time, then we
> will hit a "freelist storm" where everybody is chasing the last few
> entries. It would make more sense if we could do BufTableDelete()

To reduce that overhead, Yura proposed a surgically modification on
dynahash, but I didn't like that and the latest patch doesn't contain
that part.

> first, then hold onto the buffer mapping entry rather than add it to
> the freelist, so we can use it again when we do BufTableInsert() very
> shortly afterwards. That way we wouldn't need to search the freelist
> at all. What is the benefit or reason of doing the Delete after the
> Insert?

Hmm. something like hash_swap_key() or hash_reinsert_entry()?  That
sounds reasonable. (Yura's proposal was taking out an entry from hash
then attach it with a new key again.)

> Put that another way, it looks like BufTable functions are used in a
> way that mismatches against the way dynahash is designed.
> 
> Thoughts?

On the first part, I think Yura's patch works.  On the second point,
Yura already showed it gives a certain amount of gain if we do that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Buffer Manager and Contention

From
Kyotaro Horiguchi
Date:
At Fri, 25 Feb 2022 10:20:25 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> (I added Yura, as the author of a related patch)
> 
> At Thu, 24 Feb 2022 12:58:23 +0000, Simon Riggs <simon.riggs@enterprisedb.com> wrote in 
> > Thinking about poor performance in the case where the data fits in
> > RAM, but the working set is too big for shared_buffers, I notice a
> > couple of things that seem bad in BufMgr, but don't understand why
> > they are like that.
> > 
> > 1. If we need to allocate a buffer to a new block we do this in one
> > step, while holding both partition locks for the old and the new tag.
> > Surely it would cause less contention to make the old block/tag
> > invalid (after flushing), drop the old partition lock and then switch
> > to the new one? i.e. just hold one mapping partition lock at a time.
> > Is there a specific reason we do it this way?
> 
> I'm not sure but I guess the developer wanted to make the operation
> atomic.
> 
> Yura Sokolov is proposing a patch to separte the two partition locks.
> 
> https://www.postgresql.org/message-id/1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel%40postgrespro.ru
> 
> And it seems to me viable for me and a benchmarking in the thread
> showed a good result.  I'd appreciate your input on that thread.
> 
> > 2. Possibly connected to the above, we issue BufTableInsert() BEFORE
> > we issue BufTableDelete(). That means we need extra entries in the
> > buffer mapping hash table to allow us to hold both the old and the new
> > at the same time, for a short period. The way dynahash.c works, we try
> 
> Yes.
> 
> > to allocate an entry from the freelist and if that doesn't work, we
> > begin searching ALL the freelists for free entries to steal. So if we
> > get enough people trying to do virtual I/O at the same time, then we
> > will hit a "freelist storm" where everybody is chasing the last few
> > entries. It would make more sense if we could do BufTableDelete()
> 
> To reduce that overhead, Yura proposed a surgically modification on
> dynahash, but I didn't like that and the latest patch doesn't contain
> that part.
> 
> > first, then hold onto the buffer mapping entry rather than add it to
> > the freelist, so we can use it again when we do BufTableInsert() very
> > shortly afterwards. That way we wouldn't need to search the freelist
> > at all. What is the benefit or reason of doing the Delete after the
> > Insert?
> 
> Hmm. something like hash_swap_key() or hash_reinsert_entry()?  That
> sounds reasonable. (Yura's proposal was taking out an entry from hash
> then attach it with a new key again.)
> 
> > Put that another way, it looks like BufTable functions are used in a
> > way that mismatches against the way dynahash is designed.
> > 
> > Thoughts?
> 
> On the first part, I think Yura's patch works.  On the second point,
> Yura already showed it gives a certain amount of gain if we do that.

On second thought, even if we have a new dynahash API to atomically
replace hash key, we need to hold two partition locks to do that. That
is contradicting our objective.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Buffer Manager and Contention

From
Simon Riggs
Date:
On Fri, 25 Feb 2022 at 01:20, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

> Yura Sokolov is proposing a patch to separte the two partition locks.
>
> https://www.postgresql.org/message-id/1edbb61981fe1d99c3f20e3d56d6c88999f4227c.camel%40postgrespro.ru
>
> And it seems to me viable for me and a benchmarking in the thread
> showed a good result.  I'd appreciate your input on that thread.

Certainly, I will stop this thread and contribute to Yura's thread.

-- 
Simon Riggs                http://www.EnterpriseDB.com/