Thread: many backends hang on MultiXactOffsetSLRU

many backends hang on MultiXactOffsetSLRU

From
James Pang
Date:
Hi experts,
    we have a Postgresql v14.8 database, almost thousands of backends hang on MultiXactOffsetSLRU at the same time, all of these sessions running same query "SELECT ....", from OS and postgresql slow log, we found all of these query on "BIND" stage.
 LOG:  duration: 36631.688 ms  bind S_813: SELECT
LOG:  duration: 36859.786 ms  bind S_1111: SELECT
LOG:  duration: 35868.148 ms  bind <unnamed>: SELECT
LOG:  duration: 36906.471 ms  bind <unnamed>: SELECT
LOG:  duration: 35955.489 ms  bind <unnamed>: SELECT
LOG:  duration: 36833.510 ms  bind <unnamed>: SELECT
LOG:  duration: 36839.535 ms  bind S_1219: SELECT
...

this database hang on MultiXactOffsetSLRU and MultiXactOffsetBuffer long time.  

could you direct me why they are hanging on 'BIND‘ stage with MultiXactOffsetSLRU ? 

Thanks,

James 

    


Re: many backends hang on MultiXactOffsetSLRU

From
Amine Tengilimoglu
Date:

     Hi,

           I encountered this in a project we migrated to PostgreSQL before, and unfortunately, it’s a situation that completely degrades performance. We identified the cause as savepoints being used excessively and without control. Once they reduced the number of savepoints, the issue was resolved. However, the documentation also mentions that it could be caused by foreign keys.


  Kind regards..  


James Pang <jamespang886@gmail.com>, 10 Eyl 2024 Sal, 10:33 tarihinde şunu yazdı:
Hi experts,
    we have a Postgresql v14.8 database, almost thousands of backends hang on MultiXactOffsetSLRU at the same time, all of these sessions running same query "SELECT ....", from OS and postgresql slow log, we found all of these query on "BIND" stage.
 LOG:  duration: 36631.688 ms  bind S_813: SELECT
LOG:  duration: 36859.786 ms  bind S_1111: SELECT
LOG:  duration: 35868.148 ms  bind <unnamed>: SELECT
LOG:  duration: 36906.471 ms  bind <unnamed>: SELECT
LOG:  duration: 35955.489 ms  bind <unnamed>: SELECT
LOG:  duration: 36833.510 ms  bind <unnamed>: SELECT
LOG:  duration: 36839.535 ms  bind S_1219: SELECT
...

this database hang on MultiXactOffsetSLRU and MultiXactOffsetBuffer long time.  

could you direct me why they are hanging on 'BIND‘ stage with MultiXactOffsetSLRU ? 

Thanks,

James 

    


Re: many backends hang on MultiXactOffsetSLRU

From
Alvaro Herrera
Date:
On 2024-Sep-10, James Pang wrote:

> Hi experts,
>     we have a Postgresql v14.8 database, almost thousands of backends hang
> on MultiXactOffsetSLRU at the same time, all of these sessions running same
> query "SELECT ....", from OS and postgresql slow log, we found all of these
> query on "BIND" stage.
>  LOG:  duration: 36631.688 ms  bind S_813: SELECT
> LOG:  duration: 36859.786 ms  bind S_1111: SELECT
> LOG:  duration: 35868.148 ms  bind <unnamed>: SELECT
> LOG:  duration: 36906.471 ms  bind <unnamed>: SELECT
> LOG:  duration: 35955.489 ms  bind <unnamed>: SELECT
> LOG:  duration: 36833.510 ms  bind <unnamed>: SELECT
> LOG:  duration: 36839.535 ms  bind S_1219: SELECT
> ...
> 
> this database hang on MultiXactOffsetSLRU and MultiXactOffsetBuffer long
> time.
> 
> could you direct me why they are hanging on 'BIND‘ stage with
> MultiXactOffsetSLRU ?

Very likely, it's related to this problem
[1] https://thebuild.com/blog/2023/01/18/a-foreign-key-pathology-to-avoid/

This is caused by a suboptimal implementation of what we call SLRU,
which multixact uses underneath.  For years, many people dodged this
problem by recompiling with a changed value for
NUM_MULTIXACTOFFSET_BUFFERS in src/include/access/multixact.h (it was
originally 8 buffers, which is very small); you'll need to do that in
all releases up to pg16.  In pg17 this was improved[2] and you'll be
able to change the value in postgresql.conf, though the default already
being larger than the original (16 instead of 8), you may not need to.

[2] https://pgconf.in/files/presentations/2023/Dilip_Kumar_RareExtremelyChallengingPostgresPerformanceProblems.pdf
[3]
https://www.pgevents.ca/events/pgconfdev2024/schedule/session/53-problem-in-postgresql-slru-and-how-we-are-optimizing-it/

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"La victoria es para quien se atreve a estar solo"



Re: many backends hang on MultiXactOffsetSLRU

From
Alvaro Herrera
Date:
On 2024-Sep-10, Amine Tengilimoglu wrote:

>      Hi,
> 
>            I encountered this in a project we migrated to PostgreSQL
> before, and unfortunately, it’s a situation that completely degrades
> performance. We identified the cause as savepoints being used excessively
> and without control. Once they reduced the number of savepoints, the issue
> was resolved. However, the documentation also mentions that it could be
> caused by foreign keys.

Yeah, it's exactly the same problem; when it comes from savepoints the
issue is pg_subtrans, and when foreign keys are involved, it is
pg_multixact.  Both of those use the SLRU subsystem, which was heavily
modified in pg17 as I mentioned in my reply to James.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"I think my standards have lowered enough that now I think 'good design'
is when the page doesn't irritate the living f*ck out of me." (JWZ)



Re: many backends hang on MultiXactOffsetSLRU

From
James Pang
Date:
    There is no foreign keys, but there is one session who did transactions to tables with savepoints, one savepoints/per sql in same transaction. But sessions with query "SELECT “ do not use savepoints , just with a lot of sessions running same query and hang on MultiXact suddenly.  even only one session doing DML with savepoints , and all other queries sessions can see this kind of "MultiXact" waiting ,right? 
  

James Pang <jamespang886@gmail.com> 於 2024年9月10日週二 下午4:26寫道:
  There is no foreign keys, but there are several sessions who did transactions to tables with savepoints, one savepoints/per sql in same transaction. But sessions with query "SELECT “ do not use savepoints , just with a lot of sessions running same query and hang on MultiXact suddenly.  

Alvaro Herrera <alvherre@alvh.no-ip.org> 於 2024年9月10日週二 下午4:15寫道:
On 2024-Sep-10, Amine Tengilimoglu wrote:

>      Hi,
>
>            I encountered this in a project we migrated to PostgreSQL
> before, and unfortunately, it’s a situation that completely degrades
> performance. We identified the cause as savepoints being used excessively
> and without control. Once they reduced the number of savepoints, the issue
> was resolved. However, the documentation also mentions that it could be
> caused by foreign keys.

Yeah, it's exactly the same problem; when it comes from savepoints the
issue is pg_subtrans, and when foreign keys are involved, it is
pg_multixact.  Both of those use the SLRU subsystem, which was heavily
modified in pg17 as I mentioned in my reply to James.

--
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"I think my standards have lowered enough that now I think 'good design'
is when the page doesn't irritate the living f*ck out of me." (JWZ)

Re: many backends hang on MultiXactOffsetSLRU

From
Amine Tengilimoglu
Date:

          I hadn't found a satisfactory explanation about the top limit related to SLRU, so this document will be useful. It's a nice development that the relevant limit has been increased in pg17; I hope I don't encounter a situation where this limit is exceeded in large systems.


    Kind regards..  


James Pang <jamespang886@gmail.com>, 10 Eyl 2024 Sal, 11:35 tarihinde şunu yazdı:
    There is no foreign keys, but there is one session who did transactions to tables with savepoints, one savepoints/per sql in same transaction. But sessions with query "SELECT “ do not use savepoints , just with a lot of sessions running same query and hang on MultiXact suddenly.  even only one session doing DML with savepoints , and all other queries sessions can see this kind of "MultiXact" waiting ,right? 
  

James Pang <jamespang886@gmail.com> 於 2024年9月10日週二 下午4:26寫道:
  There is no foreign keys, but there are several sessions who did transactions to tables with savepoints, one savepoints/per sql in same transaction. But sessions with query "SELECT “ do not use savepoints , just with a lot of sessions running same query and hang on MultiXact suddenly.  

Alvaro Herrera <alvherre@alvh.no-ip.org> 於 2024年9月10日週二 下午4:15寫道:
On 2024-Sep-10, Amine Tengilimoglu wrote:

>      Hi,
>
>            I encountered this in a project we migrated to PostgreSQL
> before, and unfortunately, it’s a situation that completely degrades
> performance. We identified the cause as savepoints being used excessively
> and without control. Once they reduced the number of savepoints, the issue
> was resolved. However, the documentation also mentions that it could be
> caused by foreign keys.

Yeah, it's exactly the same problem; when it comes from savepoints the
issue is pg_subtrans, and when foreign keys are involved, it is
pg_multixact.  Both of those use the SLRU subsystem, which was heavily
modified in pg17 as I mentioned in my reply to James.

--
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"I think my standards have lowered enough that now I think 'good design'
is when the page doesn't irritate the living f*ck out of me." (JWZ)

Re: many backends hang on MultiXactOffsetSLRU

From
Alvaro Herrera
Date:
On 2024-Sep-10, James Pang wrote:

>     There is no foreign keys, but there is one session who did transactions
> to tables with savepoints, one savepoints/per sql in same transaction. But
> sessions with query "SELECT “ do not use savepoints , just with a lot of
> sessions running same query and hang on MultiXact suddenly.  even only one
> session doing DML with savepoints , and all other queries sessions can see
> this kind of "MultiXact" waiting ,right?

I think SELECT FOR UPDATE combined with savepoints can create
multixacts, in absence of foreign keys.

A query that's waiting doesn't need to have *created* the multixact or
subtrans -- it is sufficient that it's forced to look it up.

If thousands of sessions tried to look up different multixact values
(spread across more than 8 pages), then thrashing of the cache would
result, with catastrophic performance.  This can probably be caused by
some operation that creates one multixact per tuple in a few thousand
tuples.

Maybe you could ease this by doing VACUUM on the table (perhaps with a
low multixact freeze age), which might remove some of the multixacts.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Para tener más hay que desear menos"



Re: many backends hang on MultiXactOffsetSLRU

From
James Pang
Date:
     most of query sessions using jdbc connections, the one who use ODBC one savepoint/per statement, but it does not run any "select for update; savepoint;update", since row lock conflict, so not easy to touch same row with update/delete, no idea how that create multixact?   a MultiXact may contain an update or delete Xid. ?
   in this server, we see thousands of session hang on ‘MultixactOffsetSLRU" but they are in " bind " stage instead of "execute", why a backend  in "bind" need to access Multixact?

Thanks,

James

Alvaro Herrera <alvherre@alvh.no-ip.org> 於 2024年9月10日週二 下午5:00寫道:
On 2024-Sep-10, James Pang wrote:

>     There is no foreign keys, but there is one session who did transactions
> to tables with savepoints, one savepoints/per sql in same transaction. But
> sessions with query "SELECT “ do not use savepoints , just with a lot of
> sessions running same query and hang on MultiXact suddenly.  even only one
> session doing DML with savepoints , and all other queries sessions can see
> this kind of "MultiXact" waiting ,right?

I think SELECT FOR UPDATE combined with savepoints can create
multixacts, in absence of foreign keys.

A query that's waiting doesn't need to have *created* the multixact or
subtrans -- it is sufficient that it's forced to look it up.

If thousands of sessions tried to look up different multixact values
(spread across more than 8 pages), then thrashing of the cache would
result, with catastrophic performance.  This can probably be caused by
some operation that creates one multixact per tuple in a few thousand
tuples.

Maybe you could ease this by doing VACUUM on the table (perhaps with a
low multixact freeze age), which might remove some of the multixacts.

--
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Para tener más hay que desear menos"