Home > mailing lists

Re: Random memory related errors on live postgres 14.13 instance on Ubuntu 22.04 LTS - Mailing list pgsql-general

From	Ian J Cottee
Subject	Re: Random memory related errors on live postgres 14.13 instance on Ubuntu 22.04 LTS
Date	November 2, 2024 10:19:55
Msg-id	CAL0m=zXg1rYcCdgzJpasCjJputicB1-6JO435ajser-2URpudw@mail.gmail.com Whole thread
In response to	Re: Random memory related errors on live postgres 14.13 instance on Ubuntu 22.04 LTS (Vijaykumar Jain <vijaykumarjain.github@gmail.com>)
Responses	Re: Random memory related errors on live postgres 14.13 instance on Ubuntu 22.04 LTS
List	pgsql-general

Tree view

Dear Vijay

Many thanks for the comprehensive response.

The installed Postgres is just the current standard Ubuntu 22.04 LTS release. The output of pg_config is below. Since the reboot we've not had the error again (even though I'm still getting concurrency issues) and having reached out to Linode they suggested the same as you suggested, running a memtest86+ check on it which I will do this weekend. As the previous errors (thankfully) are not showing now I can't really do any more debugging but I'll report back on the results of the memtest.

Once again, many thanks for your thoughts and help, it is much appreciated.

Best regards

Ian

pg_config output:

BINDIR = /usr/lib/postgresql/14/bin                                                                                                                                                                                         
DOCDIR = /usr/share/doc/postgresql-doc-14                                                                                                                                                                                   
HTMLDIR = /usr/share/doc/postgresql-doc-14                                                                                                                                                                                  
INCLUDEDIR = /usr/include/postgresql                                                                                                                                                                                        
PKGINCLUDEDIR = /usr/include/postgresql                                                                                                                                                                                     
INCLUDEDIR-SERVER = /usr/include/postgresql/14/server                                                                                                                                                                       
LIBDIR = /usr/lib/x86_64-linux-gnu                                                                                                                                                                                          
PKGLIBDIR = /usr/lib/postgresql/14/lib                                                                                                                                                                                      
LOCALEDIR = /usr/share/locale                                                                                                                                                                                               
MANDIR = /usr/share/postgresql/14/man                                                                                                                                                                                       
SHAREDIR = /usr/share/postgresql/14                                                                                                                                                                                         
SYSCONFDIR = /etc/postgresql-common                                                                                                                                                                                         
PGXS = /usr/lib/postgresql/14/lib/pgxs/src/makefiles/pgxs.mk                                                                                                                                                                
CONFIGURE =  '--build=x86_64-linux-gnu' '--prefix=/usr' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-option-checki
ng' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--runstatedir=/run' '--disable-maintainer-mode' '--disable-dependency-tracking' '--with-tcl' '--with-perl' '--with-python' '--with-pam' '--with-open
ssl' '--with-libxml' '--with-libxslt' '--mandir=/usr/share/postgresql/14/man' '--docdir=/usr/share/doc/postgresql-doc-14' '--sysconfdir=/etc/postgresql-common' '--datarootdir=/usr/share/' '--datadir=/usr/share/postgresql
/14' '--bindir=/usr/lib/postgresql/14/bin' '--libdir=/usr/lib/x86_64-linux-gnu/' '--libexecdir=/usr/lib/postgresql/' '--includedir=/usr/include/postgresql/' '--with-extra-version= (Ubuntu 14.13-0ubuntu0.22.04.1)' '--enab
le-nls' '--enable-thread-safety' '--enable-debug' '--enable-dtrace' '--disable-rpath' '--with-uuid=e2fs' '--with-gnu-ld' '--with-gssapi' '--with-ldap' '--with-pgport=5432' '--with-system-tzdata=/usr/share/zoneinfo' 'AWK=
mawk' 'MKDIR_P=/bin/mkdir -p' 'PROVE=/usr/bin/prove' 'PYTHON=/usr/bin/python3' 'TAR=/bin/tar' 'XSLTPROC=xsltproc --nonet' 'CFLAGS=-g -O2 -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong 
-Wformat -Werror=format-security -fno-omit-frame-pointer' 'LDFLAGS=-Wl,-Bsymbolic-functions -flto=auto -ffat-lto-objects -flto=auto -Wl,-z,relro -Wl,-z,now' '--enable-tap-tests' '--with-icu' '--with-llvm' 'LLVM_CONFIG=/u
sr/bin/llvm-config-14' 'CLANG=/usr/bin/clang-14' '--with-lz4' '--with-systemd' '--with-selinux' 'build_alias=x86_64-linux-gnu' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'CXXFLAGS=-g -O2 -flto=auto -ffat-lto-objects -flt
o=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security'                                                                                                                                         
CC = gcc                                                                                                                                                                                                                    
CPPFLAGS = -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2                                                                                                                                             
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing
 -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g -g -O2 -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -fno-omit
-frame-pointer                                                                                                                                                                                                              
CFLAGS_SL = -fPIC                                                                                                                                                                                                           
LDFLAGS = -Wl,-Bsymbolic-functions -flto=auto -ffat-lto-objects -flto=auto -Wl,-z,relro -Wl,-z,now -L/usr/lib/llvm-14/lib -Wl,--as-needed                                                                                   
LDFLAGS_EX =                                                                                                                                                                                                                
LDFLAGS_SL =                                                                                                                                                                                                                
LIBS = -lpgcommon -lpgport -lselinux -llz4 -lxslt -lxml2 -lpam -lssl -lcrypto -lgssapi_krb5 -lz -lreadline -lm                                                                                                              
VERSION = PostgreSQL 14.13 (Ubuntu 14.13-0ubuntu0.22.04.1)

On Wed, 30 Oct 2024 at 11:39, Vijaykumar Jain <vijaykumarjain.github@gmail.com> wrote:

On Wed, 30 Oct 2024 at 13:04, Ian J Cottee <ian@cottee.org> wrote:
Hello everyone, I’ve been using postgres for over 25 years now and never had any major issues which were not caused by my own stupidity. In the last 24 hours however I’ve had a number of issues on one client's server which I assume are a bug in postgres or a possible hardware issue (they are running on a Linode) but I need some clarification and would welcome advice on how to proceed. I will also forward this mail to Linode support to ask them to check for any memory issues they can detect.

This particular Postgres is running on Ubuntu LTS 22.04 and has the following version information:

```
PostgreSQL 14.13 (Ubuntu 14.13-0ubuntu0.22.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit
```

The quick summary is that over a 24 hour period I had the following errors appear in the postgres logs at different times causing the system processes to restart:

stuck spinlock detected
free(): corrupted unsorted chunks
double free or corruption (!prev)
corrupted size vs. prev_size
corrupted double-linked list
*** stack smashing detected ***: terminated
Segmentation fault

are you using the postgresql setup compiled from source ?
listing the output of pg_config may give the details

are there any extensions installed, can you list those extensions.

if you have access to source packages ,
Getting a stack trace of a running PostgreSQL backend on Linux/BSD - PostgreSQL wiki
can you generate a stacktrace from the process that is crashing or if it dumped a core, then backtrace from the core dump.

will it be possible to share the actual logs both postgresql and kernel around the incident ...

Do you have access to core dumps which these crashes may have generated ? i think ABRT / segmentation faults would generate one.

do you collect stats of system. around the time of crash do you any abnormal usage of io or cpu or memory , along with locks held in postgresql setup etc.

Here’s the more detailed breakdown.

On Monday evening this week, the following event occurred on the server

```
2024-10-28 18:12:47.145 GMT [575437] xxx@xxx PANIC: stuck spinlock detected at LWLockWaitListLock, ./build/../src/backend/storage/lmgr/lwlock.c:913
```

I think a backtrace here would help what part of call stack led to this. this alone does not look like any bug.

Followed by:

```
2024-10-28 18:12:47.249 GMT [1880289] LOG: terminating any other active server processes
2024-10-28 18:12:47.284 GMT [1880289] LOG: all server processes terminated; reinitializing
```
And eventually

```
2024-10-28 18:12:48.474 GMT [575566] xxx@xxx FATAL: the database system is in recovery mode
2024-10-28 18:12:48.476 GMT [575550] LOG: database system was not properly shut down; automatic recovery in progress
2024-10-28 18:12:48.487 GMT [575550] LOG: redo starts at DD/405E83A8
2024-10-28 18:12:48.487 GMT [575550] LOG: invalid record length at DD/405EF818: wanted 24, got 0
2024-10-28 18:12:48.487 GMT [575550] LOG: redo done at DD/405EF7E0 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2024-10-28 18:12:48.515 GMT [1880289] LOG: database system is ready to accept connections
```
This wasn’t noticed by myself or any users as they tend to all be finished by 17:30. However later,

```
2024-10-28 20:27:15.258 GMT [611459] xxx@xxx LOG: unexpected EOF on client connection with an open transaction
2024-10-28 21:01:05.934 GMT [620373] xxx@xxxx LOG: unexpected EOF on client connection with an open transaction
free(): corrupted unsorted chunks

it all seems like memory corruption or some leak ... valgrind ? to get more details if leak ...

2024-10-28 21:15:02.203 GMT [1880289] LOG: server process (PID 623803) was terminated by signal 6: Aborted
2024-10-28 21:15:02.204 GMT [1880289] LOG: terminating any other active server processes
```

This time it could not recover and I didn’t notice until early the next morning whilst doing some routine checks.

```
2024-10-28 21:15:03.643 GMT [623807] LOG: database system was not properly shut down; automatic recovery in progress
2024-10-28 21:15:03.655 GMT [623807] LOG: redo starts at DD/47366740
2024-10-28 21:15:03.663 GMT [623807] LOG: invalid record length at DD/475452A0: wanted 24, got 0
2024-10-28 21:15:03.663 GMT [623807] LOG: redo done at DD/47545268 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2024-10-28 21:15:03.682 GMT [623829] xxx@xxx FATAL: the database system is in recovery mode
double free or corruption (!prev)
2024-10-28 21:15:03.832 GMT [1880289] LOG: startup process (PID 623807) was terminated by signal 6: Aborted
2024-10-28 21:15:03.832 GMT [1880289] LOG: aborting startup due to startup process failure
2024-10-28 21:15:03.835 GMT [1880289] LOG: database system is shut down
```

When I noticed in the morning it was able to start without an issue. From googling it appeared to be a memory issue and I wondered if the problem was sorted now the server process had stopped completely and restarted. The problem was not sorted although all the above errors were recovered from automatically without any input from myself or the client’s noticing.

```
corrupted size vs. prev_size
2024-10-29 09:55:24.417 GMT [894747] LOG: background worker "parallel worker" (PID 947642) was terminated by signal 6: Aborted
```

```
corrupted double-linked list
2024-10-29 13:14:28.322 GMT [894747] LOG: background worker "parallel worker" (PID 1019071) was terminated by signal 6: Aborted
```

```
*** stack smashing detected ***: terminated
2024-10-28 15:24:30.331 GMT [1880289] LOG: background worker "parallel worker" (PID 528630) was terminated by signal 6: A\ borted
```

```
2024-10-28 15:40:26.617 GMT [1880289] LOG: background worker "parallel worker" (PID 533515) was terminated by signal 11: \
Segmentation fault
2024-10-28 15:40:26.617 GMT [1880289] DETAIL: Failed process was running: SELECT "formula_line".id FROM "formul\
```

idk why it crashed with sigabrt instead of sigkill if it was indeed a memory leak and not a bug ... so not sure memory overcommiting can be of use here ...

how much is the concurrency at peak and with what work mem .... any theory of excessive work mem and too many concurrent processes holding some locks for long ...
it should not crash even if that is the case, but just asking ...

lastly.... is it possible to memcheck run on the machine just to ensure no memory scares ... if this is running on a vm, or bare metal ,,,, any hardware errors around that time ?

most likely it looks like a h/w issue, we used to see things like this on bare metals .... which only happened occasionally and then frequently till we moved away from that setup.

also, does it happen only when the optimiser picks a plan involving parallel workers for a query?
If you set max_parallel_workers_per_gather to 0, to not parallelize anything , do you still see the issue ?

Just insights, if not useful, pls ignore.

Best regards

Ian Cottee

--
Thanks,
Vijay

Open to work
Resume - Vijaykumar Jain

pgsql-general by date:

From: Thiemo Kellner
Date: 02 November 2024, 10:13:34
Subject: Re: Plans for partitioning of inheriting tables

From: Koen De Groote
Date: 03 November 2024, 16:59:55
Subject: Re: pg_wal folder high disk usage

Re: Random memory related errors on live postgres 14.13 instance on Ubuntu 22.04 LTS - Mailing list pgsql-general

Previous

Next