terminate called after throwing an instance of 'std::bad_alloc' - Mailing list pgsql-hackers

A VM crashed which is now running PG13.0 on centos7:

Sep 30 19:40:08 database7 abrt-hook-ccpp: Process 17905 (postgres) of user 26 killed by SIGABRT - dumping core
Core was generated by `postgres: telsasoft ts 192.168.122.11(34608) SELECT    '.

Unfortunately, the filesystem wasn't large enough and the corefile is
truncated.

The first badness in our logfiles looks like this ; this is the very head of
the logfile:

|[pryzbyj@database7 ~]$ sudo gzip -dc /var/log/postgresql/crash-postgresql-2020-09-30_194000.log.gz |head
|[sudo] password for pryzbyj: 
|terminate called after throwing an instance of 'std::bad_alloc'
|  what():  std::bad_alloc
|< 2020-09-30 19:40:09.653 ADT  >LOG:  checkpoint starting: time
|< 2020-09-30 19:40:17.002 ADT  >LOG:  checkpoint complete: wrote 74 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0
recycled;write=7.331 s, sync=0.006 s, total=7.348 s; sync files=25, longest=0.002 s, average=0.000 s; distance=295 kB,
estimate=4183kB
 
|< 2020-09-30 19:40:22.642 ADT  >LOG:  server process (PID 17905) was terminated by signal 6: Aborted
|< 2020-09-30 19:40:22.642 ADT  >DETAIL:  Failed process was running: --BEGIN SQL
|        SELECT * FROM

I was able to grep the filesystem to find what looks like the preceding logfile
(which our script had immediately rotated in its attempt to be helpful).

|< 2020-09-30 19:39:09.096 ADT  >LOG:  checkpoint starting: time
|< 2020-09-30 19:39:12.640 ADT  >LOG:  checkpoint complete: wrote 35 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0
recycled;write=3.523 s, sync=0.006 s, total=3.544 s; sync files=14, longest=0.002 s, average=0.000 s; distance=103 kB,
estimate=4615kB
 

This seemed familiar, and I found it had happened 18 months ago on a different
server:

|terminate called after throwing an instance of 'std::bad_alloc'
|  what():  std::bad_alloc
|< 2019-02-01 15:36:54.434 CST  >LOG:  server process (PID 13557) was terminated by signal 6: Aborted
|< 2019-02-01 15:36:54.434 CST  >DETAIL:  Failed process was running:
|            SELECT *, '' as n
|            FROM a, b WHERE
|            a.id = b.id AND
|            b.c IS NOT NULL
|            ORDER BY b.c LIMIT 9

I have a log of pg_settings, which shows that this server ran v11.1 until
2019-02-14, when it was upgraded to v11.2.  Since v11.2 included a fix for a
bug I reported involving JIT and wide tables, I probably saw this crash and
dismissed it, even though tables here named "a" and "b" have only ~30 columns
combined.  The query that crashed in 2019 is actually processing a small queue,
and runs every 5sec.

The query that crashed today is a "day" level query which runs every 15min, so
it ran 70some times today with no issue.

Our DBs use postgis, and today's crash JOINs to the table with geometry
columns, but does not use them at all.

But the 2019 doesn't even include the geometry table.  I'm not sure if these
are even the same crash, but if they are, I think it's maybe an JIT issue and
not postgis (??)

I've had JIT disabled since 2019, due to no performance benefit for us, but
I've been re-enabling it during upgrades and transitions, and instead disabling
jit_tuple_deforming (since this performs badly for columns with high attnum).
So maybe this will recur before too long.



pgsql-hackers by date:

Previous
From: "k.jamison@fujitsu.com"
Date:
Subject: RE: [Patch] Optimize dropping of relation buffers using dlist
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: Asynchronous Append on postgres_fdw nodes.