Thread: Re: pgsql: Improve performance of subsystems on top of SLRU

Re: pgsql: Improve performance of subsystems on top of SLRU

From
Alvaro Herrera
Date:
On 2024-Feb-28, Alvaro Herrera wrote:

> Improve performance of subsystems on top of SLRU

Coverity had the following complaint about this commit:

________________________________________________________________________________________________________
*** CID NNNNNNN:  Control flow issues  (DEADCODE)
/srv/coverity/git/pgsql-git/postgresql/src/backend/access/transam/multixact.c: 1375 in GetMultiXactIdMembers()
1369             * and acquire the lock of the new bank.
1370             */
1371            lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
1372            if (lock != prevlock)
1373            {
1374                    if (prevlock != NULL)
>>>     CID 1592913:  Control flow issues  (DEADCODE)
         
 
>>>     Execution cannot reach this statement: "LWLockRelease(prevlock);".
         
 
1375                            LWLockRelease(prevlock);
1376                    LWLockAcquire(lock, LW_EXCLUSIVE);
1377                    prevlock = lock;
1378            }
1379
1380            slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);

And I think it's correct that this is somewhat bogus, or at least
confusing: the only way to have control back here on line 1371 after
having executed once is via the "goto retry" line below; and there we
release "prevlock" and set it to NULL beforehand, so it's impossible for
prevlock to be NULL.  Looking closer I think this code is all confused,
so I suggest to rework it as shown in the attached patch.

I'll have a look at the other places where we use this "prevlock" coding
pattern tomorrow.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Attachment

Re: pgsql: Improve performance of subsystems on top of SLRU

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> And I think it's correct that this is somewhat bogus, or at least
> confusing: the only way to have control back here on line 1371 after
> having executed once is via the "goto retry" line below; and there we
> release "prevlock" and set it to NULL beforehand, so it's impossible for
> prevlock to be NULL.  Looking closer I think this code is all confused,
> so I suggest to rework it as shown in the attached patch.

This is certainly simpler, but I notice that it holds the current
LWLock across the line

     ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));

where the old code did not.  Could the palloc take long enough that
holding the lock is bad?

Also, with this coding the "lock = NULL;" assignment just before
"goto retry" is a dead store.  Not sure if Coverity or other static
analyzers would whine about that.

            regards, tom lane



Re: pgsql: Improve performance of subsystems on top of SLRU

From
Dilip Kumar
Date:
On Mon, Mar 4, 2024 at 1:56 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2024-Feb-28, Alvaro Herrera wrote:
>
> > Improve performance of subsystems on top of SLRU
>
> Coverity had the following complaint about this commit:
>
> ________________________________________________________________________________________________________
> *** CID NNNNNNN:  Control flow issues  (DEADCODE)
> /srv/coverity/git/pgsql-git/postgresql/src/backend/access/transam/multixact.c: 1375 in GetMultiXactIdMembers()
> 1369             * and acquire the lock of the new bank.
> 1370             */
> 1371            lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
> 1372            if (lock != prevlock)
> 1373            {
> 1374                    if (prevlock != NULL)
> >>>     CID 1592913:  Control flow issues  (DEADCODE)
> >>>     Execution cannot reach this statement: "LWLockRelease(prevlock);".
> 1375                            LWLockRelease(prevlock);
> 1376                    LWLockAcquire(lock, LW_EXCLUSIVE);
> 1377                    prevlock = lock;
> 1378            }
> 1379
> 1380            slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
>
> And I think it's correct that this is somewhat bogus, or at least
> confusing: the only way to have control back here on line 1371 after
> having executed once is via the "goto retry" line below; and there we
> release "prevlock" and set it to NULL beforehand, so it's impossible for
> prevlock to be NULL.  Looking closer I think this code is all confused,
> so I suggest to rework it as shown in the attached patch.
>
> I'll have a look at the other places where we use this "prevlock" coding
> pattern tomorrow.


+ /* Acquire the bank lock for the page we need. */
  lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
- if (lock != prevlock)
- {
- if (prevlock != NULL)
- LWLockRelease(prevlock);
- LWLockAcquire(lock, LW_EXCLUSIVE);
- prevlock = lock;
- }
+ LWLockAcquire(lock, LW_EXCLUSIVE);

This part is definitely an improvement.

I am not sure about the other changes, I mean that makes the code much
simpler but now we are not releasing the 'MultiXactOffsetCtl' related
bank lock, and later in the following loop, we are comparing that lock
against 'MultiXactMemberCtl' related bank lock. This makes code
simpler because now in the loop we are sure that we are always holding
the lock but I do not like comparing the bank locks for 2 different
SLRUs, although there is no problem as there would not be a common
lock address, anyway, I do not have any strong objection to what you
have done here.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: pgsql: Improve performance of subsystems on top of SLRU

From
Alvaro Herrera
Date:
On 2024-Mar-03, Tom Lane wrote:

> This is certainly simpler, but I notice that it holds the current
> LWLock across the line
> 
>      ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
> 
> where the old code did not.  Could the palloc take long enough that
> holding the lock is bad?

Hmm, I guess most of the time it shouldn't be much of a problem (if the
length is small so we can palloc without malloc'ing); but it could be in
the worst case.  But the fix is simple: just release the lock before.
There's no correctness argument for holding it all the way down.  I was
just confused about how the original code worked.

> Also, with this coding the "lock = NULL;" assignment just before
> "goto retry" is a dead store.  Not sure if Coverity or other static
> analyzers would whine about that.

Oh, right.  I removed that.

On 2024-Mar-04, Dilip Kumar wrote:

> I am not sure about the other changes, I mean that makes the code much
> simpler but now we are not releasing the 'MultiXactOffsetCtl' related
> bank lock, and later in the following loop, we are comparing that lock
> against 'MultiXactMemberCtl' related bank lock. This makes code
> simpler because now in the loop we are sure that we are always holding
> the lock but I do not like comparing the bank locks for 2 different
> SLRUs, although there is no problem as there would not be a common
> lock address,

True.  This can be addressed in the same way Tom's first comment is:
just release the lock before entering the second loop, and setting lock
to NULL.  This brings the code to a similar state than before, except
that the additional LWLock * variables are in a tighter scope.  That's
in 0001.


Now, I had a look at the other users of slru.c and noticed in subtrans.c
that StartupSUBTRANS we have some duplicate code that I think could be
rewritten by making the "while" block test the condition at the end
instead of at the start; changed that in 0002.  I'll leave this one for
later, because I want to add some test code for it -- right now it's
pretty much test-uncovered code.


I also looked at slru.c for uses of shared->bank_locks and noticed a
couple that could be made simpler.  That's 0003 here.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"If it is not right, do not do it.
If it is not true, do not say it." (Marcus Aurelius, Meditations)

Attachment

Re: pgsql: Improve performance of subsystems on top of SLRU

From
Alvaro Herrera
Date:
FWIW there's a stupid bug in 0002, which is fixed here.  I'm writing a
simple test for it.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Now I have my system running, not a byte was off the shelf;
It rarely breaks and when it does I fix the code myself.
It's stable, clean and elegant, and lightning fast as well,
And it doesn't cost a nickel, so Bill Gates can go to hell."

Attachment