Thread: soft lockup - CPU#16 stuck for 3124s! [postmaster:2273]
We have a PostgreSQL 15.1 server in production at a customer for some weeks (migrated from an older version) on SuSE SLES 15. The customer is facing machine locks and before the Linux server does not respond any more (not even on SSH, only power-cycle reset helps to get it back), short before the fault a lot of messages are in /var/log/messages of the content: # grep watchdog: /var/log/messages ... 2024-03-22T13:11:32.056154+01:00 sunrise kernel: [327844.313048][ C25] watchdog: BUG: soft lockup - CPU#25 stuck for 3069s![migration/25:166] 2024-03-22T13:12:28.056244+01:00 sunrise kernel: [327900.310267][ C16] watchdog: BUG: soft lockup - CPU#16 stuck for 3124s![postmaster:2273] 2024-03-22T13:12:28.056340+01:00 sunrise kernel: [327900.311052][ C25] watchdog: BUG: soft lockup - CPU#25 stuck for 3121s![migration/25:166] Not all related to postmaster, but some of them. The server is in principle idle, has a lot of CPUs and 32 GByte memory. To the PostgreSQL server connect around 100 PostgreSQL clients, most of them by ESQL/C and on localhost. Looking around, I detected today that the WAL archiving was configured wrong, leading to messages like (sorry for the German, but you will get the meaning): 2024-03-22 13:11:50.838 CET [2630] LOG: Archivbefehl ist fehlgeschlagen mit Statuscode 1 2024-03-22 13:11:50.838 CET [2630] DETAIL: Der fehlgeschlagene Archivbefehl war: test ! -f /data/postgresql151/wal_archive/000000010000000000000001&& cp pg_wal/000000010000000000000001 /data/postgresql151/wal_archive/000000010000000000000001 cp: reguläre Datei '/data/postgresql151/wal_archive/000000010000000000000001' kann nicht angelegt werden: Datei oder Verzeichnisnicht gefunden 2024-03-22 13:11:51.842 CET [2630] LOG: Archivbefehl ist fehlgeschlagen mit Statuscode 1 2024-03-22 13:11:51.842 CET [2630] DETAIL: Der fehlgeschlagene Archivbefehl war: test ! -f /data/postgresql151/wal_archive/000000010000000000000001&& cp pg_wal/000000010000000000000001 /data/postgresql151/wal_archive/000000010000000000000001 The problem was that the directory /data/postgresql151/wal_archive was just not created (and this for two weeks in production). Since it is now created and also the backup of the WAL from there is in place, the problem of the locks went away. Any chance that the problem of the Pos server not being able to copy the WALs could have caused the locks? Just to make sure that we hit the beast. matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub
Matthias Apitz <guru@unixarea.de> writes: > We have a PostgreSQL 15.1 server in production at a customer for some > weeks (migrated from an older version) on SuSE SLES 15. > The customer is facing machine locks and before the Linux server does > not respond any more (not even on SSH, only power-cycle reset helps to > get it back), short before the fault a lot of messages are in > /var/log/messages of the content: > # grep watchdog: /var/log/messages > ... > 2024-03-22T13:11:32.056154+01:00 sunrise kernel: [327844.313048][ C25] watchdog: BUG: soft lockup - CPU#25 stuck for3069s! [migration/25:166] > 2024-03-22T13:12:28.056244+01:00 sunrise kernel: [327900.310267][ C16] watchdog: BUG: soft lockup - CPU#16 stuck for3124s! [postmaster:2273] > 2024-03-22T13:12:28.056340+01:00 sunrise kernel: [327900.311052][ C25] watchdog: BUG: soft lockup - CPU#25 stuck for3121s! [migration/25:166] Sounds like failing hardware to me :-( regards, tom lane
On Fri, Mar 22, 2024 at 1:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Matthias Apitz <guru@unixarea.de> writes:
> We have a PostgreSQL 15.1 server in production at a customer for some
> weeks (migrated from an older version) on SuSE SLES 15.
> The customer is facing machine locks and before the Linux server does
> not respond any more (not even on SSH, only power-cycle reset helps to
> get it back), short before the fault a lot of messages are in
> /var/log/messages of the content:
> # grep watchdog: /var/log/messages
> ...
> 2024-03-22T13:11:32.056154+01:00 sunrise kernel: [327844.313048][ C25] watchdog: BUG: soft lockup - CPU#25 stuck for 3069s! [migration/25:166]
> 2024-03-22T13:12:28.056244+01:00 sunrise kernel: [327900.310267][ C16] watchdog: BUG: soft lockup - CPU#16 stuck for 3124s! [postmaster:2273]
> 2024-03-22T13:12:28.056340+01:00 sunrise kernel: [327900.311052][ C25] watchdog: BUG: soft lockup - CPU#25 stuck for 3121s! [migration/25:166]
Sounds like failing hardware to me :-(
Updating to 15.6 would rule out any bugs squashed in the last 15 months.
El día viernes, marzo 22, 2024 a las 01:31:43p. m. -0400, Ron Johnson escribió: > On Fri, Mar 22, 2024 at 1:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > Matthias Apitz <guru@unixarea.de> writes: > > > We have a PostgreSQL 15.1 server in production at a customer for some > > > weeks (migrated from an older version) on SuSE SLES 15. > > > > > The customer is facing machine locks and before the Linux server does > > > not respond any more (not even on SSH, only power-cycle reset helps to > > > get it back), short before the fault a lot of messages are in > > > /var/log/messages of the content: > > > > > # grep watchdog: /var/log/messages > > > ... > > > 2024-03-22T13:11:32.056154+01:00 sunrise kernel: [327844.313048][ C25] > > watchdog: BUG: soft lockup - CPU#25 stuck for 3069s! [migration/25:166] > > > 2024-03-22T13:12:28.056244+01:00 sunrise kernel: [327900.310267][ C16] > > watchdog: BUG: soft lockup - CPU#16 stuck for 3124s! [postmaster:2273] > > > 2024-03-22T13:12:28.056340+01:00 sunrise kernel: [327900.311052][ C25] > > watchdog: BUG: soft lockup - CPU#25 stuck for 3121s! [migration/25:166] > > > > Sounds like failing hardware to me :-( > > > Updating to 15.6 would rule out any bugs squashed in the last 15 months. Yesterday the message appeared 300 times: sunrise:~ # xz -dc /var/log/messages-20240323.xz | grep lockup | wc -l 323 Since the WAL copy and backup is working correctly, no such message anymore. matthias -- Matthias Apitz, ✉ guru@unixarea.de, http://www.unixarea.de/ +49-176-38902045 Public GnuPG key: http://www.unixarea.de/key.pub